Analyzers are used to tokenize given text and perform some filtering techniques to make the search results more optimal.
Some of the filtering techniques listed below.
a. Stopword Filtering
Stopwords like a, an, the etc., do not add any specific meaning to given text. Analyzers can be used to filter these stop words.
b. Text Normalization
Data can be standardized by performing various normalizations like lowercasing the data, removing grave accents.
c. Stemming
Stemming is a reduction process that convert the words to their basic form. For example, words like eat, eatting, ate, eaten are stemmed to the word eat. Lucene provides built in stemmers like Snowball, PorterStem, and KStem.
d. Lemmatization
Lemmatization stem the words by considering their meaning and grammatical rules. Lemmatization is Language and Context sensitive.
e. Synonym Expansion
Words in the given data can
be expanded to their synonyms to get better results. For example, the word
'angry' can be expanded to 'annoyed', 'outraged', 'heated', 'furious' etc.,
Lucene Built-in Analyzers
Lucene provide some built in analuzers. You can get all these analyzers by importing 'lucene-analyzers-common' artifact to your project.
I am using below dependencies for this tutorial.
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
<version>8.4.1</version>
</dependency>
In My next posts, I am going to explain below analyzers.
a. WhitespaceAnalyzer
b. SimpleAnalyzer
c. StopAnalyzer
d. StandardAnalyzer
e. KeywordAnalyzer
f. Language Analyzers
Previous Next Home
No comments:
Post a Comment