Monday, 21 June 2021

Lucene: SimpleAnalyzer

 SimpleAnalyzer tokenize the text at non-letters (It uses 'Character.isLetter()' method to identify whether given character is a letter or not.) and normalize the text to lowercase.

 

How to get SimpleAnalyzer?

Analyzer whitespaceAnalyzer = new SimpleAnalyzer();

 

Find the below working application.

App.java

package com.sample.app;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class App {

	public static List<String> getTokens(String text, String fieldName, Analyzer analyzer) throws IOException {

		TokenStream tokenStream = analyzer.tokenStream(fieldName, text);
		CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
		
		tokenStream.reset();

		List<String> result = new ArrayList<String>();
		while (tokenStream.incrementToken()) {
			result.add(charTermAttribute.toString());
		}
		return result;
	}

	public static void main(String args[]) throws IOException {
		Analyzer whitespaceAnalyzer = new SimpleAnalyzer();

		List<String> tokens = getTokens("Hello, How Are you\t\tI am Fine \n Thank you", null, whitespaceAnalyzer);

		for (String token : tokens) {
			System.out.println(token);
		}

	}

}

Output

hello
how
are
you
i
am
fine
thank
you


This analyzer helps us in performing case-insensitive searches.


Previous                                                    Next                                                    Home

No comments:

Post a Comment