Thursday 24 June 2021

Lucene: StandardAnalyzer

StandardAnalyzer perform following operations.

a. Use Word Break rules from the Unicode Text Segmentation algorithm (http://unicode.org/reports/tr29/)

b. Normalizes token text to lower case.

 

How to get StandardAnalyzer?

StandardAnalyzer class provides following constructors to get an instance of StandardAnalyzer.

public StandardAnalyzer() 
public StandardAnalyzer(CharArraySet stopWords) 
public StandardAnalyzer(Reader stopwords)

Example

CharArraySet stopWordsCharArraySet = new CharArraySet(10, true);
stopWordsCharArraySet.add("a");
stopWordsCharArraySet.add("an");
stopWordsCharArraySet.add("are");
stopWordsCharArraySet.add("is");
stopWordsCharArraySet.add("the");
stopWordsCharArraySet.add("to");
stopWordsCharArraySet.add("you");

Analyzer whitespaceAnalyzer = new StandardAnalyzer(stopWordsCharArraySet);


Find the below working application.

 

App.java

package com.sample.app;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.CharArraySet;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class App {

	public static List<String> getTokens(String text, String fieldName, Analyzer analyzer) throws IOException {

		TokenStream tokenStream = analyzer.tokenStream(fieldName, text);
		CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);

		tokenStream.reset();

		List<String> result = new ArrayList<String>();
		while (tokenStream.incrementToken()) {
			result.add(charTermAttribute.toString());
		}
		return result;
	}

	public static void main(String args[]) throws IOException {
		CharArraySet stopWordsCharArraySet = new CharArraySet(10, true);
		stopWordsCharArraySet.add("a");
		stopWordsCharArraySet.add("an");
		stopWordsCharArraySet.add("are");
		stopWordsCharArraySet.add("is");
		stopWordsCharArraySet.add("the");
		stopWordsCharArraySet.add("to");
		stopWordsCharArraySet.add("you");

		Analyzer whitespaceAnalyzer = new StandardAnalyzer(stopWordsCharArraySet);

		List<String> tokens = getTokens("Java is a programming Language to build Enterprise Applications", null,
				whitespaceAnalyzer);

		for (String token : tokens) {
			System.out.println(token);
		}

	}

}


Output

java
programming
language
build
enterprise
applications


 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment