Apache lucene database

5/26/2023

Underlying data model for in memory cacheĪs mentioned before, internal Lucene data model is based on two main data sets – Index and documents, which are implemented as two models – IndexMemoryModel and DocumentMemoryModel. A compromise is achieved through implementing configurable cache time to live parameter, limiting cache presence in each Lucene instance. The latter requires minimizing of the cache life time to synchronize content with the HBase instance (a single copy of thruth).

The implementation tries to balance two conflicting requirements - performance: in memory cache can drastically improve performance by minimizing the amount of HBase reads for search and documents retrieval and scalability: ability to run as many Lucene instances as required to support growing search clients population. The overall implementation (Figure 3) is based on a memory-based backend used as an in memory cache and a mechanism for synchronizing this cache with the HBase backend.įigure 3: Overall Architecture of HBase-based Lucene implementation The implementation presented in the article follows this approach. module, Lucandra and HBasene took a different approach and overwrote not a directory but higher level Lucene's classes - IndexReader and IndexWriter, thus bypassing Directory APIs (Figure 2).įigure 2: Integration Lucene with back end without file systemĪlthough such approach often requires more work, it leads to significantly more powerful implementations allowing for full utilization of back end's native capabilities. As a result, several Lucene ports, including a limited memory index support from Lucene contrib.

Document data set stores all the documents, including stored fields, etc.Īs we have mentioned above, directly implementing directory interface is not always the simplest (most convenient) approach to port Lucene to a new backend.
Index data set keeps all the Field/Term pairs (with additional info like, term frequency, position etc.) and the documents containing these terms in appropriate fields.
Implementation approachĪs explained in, at a very high level, Lucene operates on 2 distinct data sets: In this article we will describe an implementation based on an HBase. One of such backend can be a noSQL database. Although powerful, usage of sharding complicates overall implementation architecture and requires a certain amount of an apriory knowledge about expected documents to properly partition Lucene indexes.Ī different approach is to allow an index backend itself to shard data correctly and build an implementation based on such a backend. Different techniques were used to overcome this problem including load balancing and index sharding - splitting indexes between multiple Lucene instances. The drawback of a standard file system - based backend (directory implementation) is a performance degradation caused by the index growth. Both IndexReader and IndexWriter rely on Directory, which provides APIs for manipulating index data sets, which are directly mimicking file system API. IndexReader reads the content of indexes in support of IndexSearcher.

IndexWriter writes reverse indexes for each inserted document. IndexSearcher implements the search logic. Its main components are IndexSearcher, IndexReader, IndexWriter and Directory. Unlike normal indexes, where you can look up a document to know what fields it contains, in inverted index, you look up a field's term to know all the documents it appears in.Ī high-level Lucene architecture is presented at Figure 1. Lucene search is based on inverted index containing information about searchable documents. Every field value is comprised of one or more searchable elements - terms. Searchable entities in Lucene are represented as documents comprised of fields and their values. As a result, any implementation allowing for improving of Lucene's scalability and performance is of great interest. It is used by Apple, IBM, Attlassian (Jira), Wolfram, pick your favorite company. Lucene search library is today's de facto standard for implementing search engines. Search plays a pivotal role in just about any modern application from shopping sites to social networks to points of interest.

0 Comments

Apache lucene database

Leave a Reply.

Author

Archives

Categories