Programming for beginners: Lucene: Analyzers

Analyzers are used to tokenize given text and perform some filtering techniques to make the search results more optimal.

Some of the filtering techniques listed below.

a. Stopword Filtering

Stopwords like a, an, the etc., do not add any specific meaning to given text. Analyzers can be used to filter these stop words.

b. Text Normalization

Data can be standardized by performing various normalizations like lowercasing the data, removing grave accents.

c. Stemming

Stemming is a reduction process that convert the words to their basic form. For example, words like eat, eatting, ate, eaten are stemmed to the word eat. Lucene provides built in stemmers like Snowball, PorterStem, and KStem.

d. Lemmatization

Lemmatization stem the words by considering their meaning and grammatical rules. Lemmatization is Language and Context sensitive.

e. Synonym Expansion

Words in the given data can be expanded to their synonyms to get better results. For example, the word 'angry' can be expanded to 'annoyed', 'outraged', 'heated', 'furious' etc.,

Lucene Built-in Analyzers

Lucene provide some built in analuzers. You can get all these analyzers by importing 'lucene-analyzers-common' artifact to your project.

I am using below dependencies for this tutorial.

<dependency>
	<groupId>org.apache.lucene</groupId>
	<artifactId>lucene-analyzers-common</artifactId>
	<version>8.4.1</version>
</dependency>

In My next posts, I am going to explain below analyzers.

a. WhitespaceAnalyzer

b. SimpleAnalyzer

c. StopAnalyzer

d. StandardAnalyzer

e. KeywordAnalyzer

f. Language Analyzers

Previous Next Home

Programming for beginners

Monday, 21 June 2021

Lucene: Analyzers

No comments:

Post a Comment