Programming for beginners: Introduction to Lucene

Lucene is a platform where we can index our data to make it searchable. If you want to add search capability to your application, then Lucene is your buddy.

Basic Process of Search Application

There are three steps involved in a typical search application.

a. Fetch data. Here data can come from Internet, documents, databases such as sql, log files etc.,

b. Add Crawled/fetched data to an index.

c. Search indexed data.

Features of Lucene

a. Lucene is scalable and High-Performance at indexing documents (over 150GB/hour on modern hardware).

b. Support Incremental Indexing.

c. Support rank-based search. Best matched results come first.

d. Support wide variety of query types like range query, proximity queries and wildcard query etc.,

e. Support field level searching

f. Support sort by field and many more.

g. Lucene is implemented in many languages like Java, C, Python etc.,

You can refer following link to know latest features of Lucene.

h. It is open source project and matured from many years.

https://lucene.apache.org/core/

Is Lucene perform Crawling?

No, Lucene is just a searching library. Crawling and filtering of the data you should handle it explicitly before feeding to Lucene index.

Is Lucene, complete search engine?

No, Lucene is a Java library, that can perform documents indexing and information retrieval.

What is Indexing?

It is a process of adding document to a datastore. Whenever a document is added to Datastore, Lucene creates a unique identifier and assign this id to the document.

Let me explain with an example.

Document 1

Java is a popular programming language. Lucene is a Java Library.

Document 2

Lucene Java Library used to perform search operations.

Document 3

Lucene in Action.

We have three documents document1, document2 and document 3. Whenever a request issued to add these documents to data store, Lucene creates unique identifier and assign to the documents.

For example,

Document	id
Document 1	1
Document 2	2
Document 3	3

As you see, above table identifier 1 is assigned to ‘Document 1’, identifier 2 is assigned to ‘Document 2’ and identifier 3 is assigned to ‘Document 3’.

Now, lucence tokenizes the words in each document and map the tokens to document ids.

Lucene Inverted Index looks like below.

Term	Available in Documents
Java	1, 2
is	1
a	1
popular	1
programming	1
language	1
Lucene	1, 2, 3
Library	2
used	2
to	2
perform	2
search	2
operations	2
in	3
Action	3

Lucene use this inverted index to serve search queries. For example, if user asks for the term ‘programming’ Lucene can quickly checks that this word is in document 1. If user asks for the term ‘Lucene’, then Lucene can serve the document 1, 2 and 3.

This example looks simple, but Lucene has some more capabilities like removing the stop words (words like a, an, the, is, in which has lower importance in search), support case insensitive search by converting the tokens to lowercase while indexing etc.,

Other Open Source Applications that built on top of Lucene

Apache Solr, Elastic Search application are built on top of Lucene.

Lucene: Maven Dependencies Used

	<dependencies>
	
		<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-core -->
		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-core</artifactId>
			<version>8.4.1</version>
		</dependency>

		<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-queryparser -->
		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-queryparser</artifactId>
			<version>8.4.1</version>
		</dependency>
		
		<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-common -->
		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-analyzers-common</artifactId>
			<version>8.4.1</version>
		</dependency>

	</dependencies>

Previous Next Home

Programming for beginners

Wednesday, 16 June 2021

Introduction to Lucene

No comments:

Post a Comment