Wednesday 23 June 2021

Lucene: StopAnalyzer

StopAnalyzer perform below operations.

a.   Tokenize the text at non-letters (A letter is identified using 'Character.isLetter()' method).

b.   Normalizes token text to lower case.

c.    Removes stop words from a token stream.

 

How to get StopAnalyzer?

StopAnalyzer class provides following constructors to get an instance of StopAnalyzer.

public StopAnalyzer(CharArraySet stopWords)
public StopAnalyzer(Path stopwordsFile) throws IOException
public StopAnalyzer(Reader stopwords) throws IOException 

 

Example

CharArraySet charArraySet = new CharArraySet(10, true);
charArraySet.add("a");
charArraySet.add("an");
charArraySet.add("are");
charArraySet.add("is");
charArraySet.add("the");
charArraySet.add("to");
charArraySet.add("you");

Analyzer whitespaceAnalyzer = new StopAnalyzer(charArraySet);

 

Find the below working application.

 

App.java

package com.sample.app;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.CharArraySet;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class App {

	public static List<String> getTokens(String text, String fieldName, Analyzer analyzer) throws IOException {

		TokenStream tokenStream = analyzer.tokenStream(fieldName, text);
		CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);

		tokenStream.reset();

		List<String> result = new ArrayList<String>();
		while (tokenStream.incrementToken()) {
			result.add(charTermAttribute.toString());
		}
		return result;
	}

	public static void main(String args[]) throws IOException {
		CharArraySet charArraySet = new CharArraySet(10, true);
		charArraySet.add("a");
		charArraySet.add("an");
		charArraySet.add("are");
		charArraySet.add("is");
		charArraySet.add("the");
		charArraySet.add("to");
		charArraySet.add("you");

		Analyzer whitespaceAnalyzer = new StopAnalyzer(charArraySet);

		List<String> tokens = getTokens("Java is a programming Language to build Enterprise Applications", null,
				whitespaceAnalyzer);

		for (String token : tokens) {
			System.out.println(token);
		}

	}

}

 

Output

java
programming
language
build
enterprise
applications

 

 

 

 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment