Before you perform any search operation, you should add all the documents that you want to search to the index. In coming chapters, you are going to learn different ways to index a document.
Lucene do not understand different data formats like tables in RDBMS, PDF files, word documents, spreadsheets etc., It is your responsibility to extract the text from those different sources and map the text to a Lucene document and add to index.
Adding the documents to index is not a straightforward process. First the field values of the documents are analyzed, unnecessary data like stopwords (a, an, the etc.,) are deleted, and tokens are generated, stemmed and convert to lower case to support case-insensitive search and finally the tokens are added to index.
Let me explain with an example.
Document 1
Java is a popular programming language. Lucene is a Java Library.
Document 2
Lucene Java Library used to perform search operations.
Document 3
Lucene in Action.
We have three documents document1, document2 and document 3. Whenever a request issued to add these documents to data store, Lucene creates unique identifier and assign to the documents.
For example,
Document |
id |
Document 1 |
1 |
Document 2 |
2 |
Document 3 |
3 |
As you see, above table identifier 1 is assigned to ‘Document 1’, identifier 2 is assigned to ‘Document 2’ and identifier 3 is assigned to ‘Document 3’.
Now, lucence tokenizes the words in each document and map the tokens to document ids.
Lucene Inverted Index looks like below.
Term |
Available in Documents |
Java |
1, 2 |
is |
1 |
a |
1 |
popular |
1 |
programming |
1 |
language |
1 |
Lucene |
1, 2, 3 |
Library |
2 |
used |
2 |
to |
2 |
perform |
2 |
search |
2 |
operations |
2 |
in |
3 |
Action |
3 |
Lucene use this inverted index to serve search queries. For example, if user asks for the term ‘programming’ Lucene can quickly checks that this word is in document 1. If user asks for the term ‘Lucene’, then Lucene can serve the document 1, 2 and 3.
This example looks simple, but Lucene has some more capabilities like removing the stop words (words like a, an, the, is, in which has lower importance in search), support case insensitive search by converting the tokens to lowercase while indexing etc.,
No comments:
Post a Comment