StandardAnalyzer perform following operations.
a. Use Word Break rules from the Unicode Text Segmentation algorithm (http://unicode.org/reports/tr29/)
b. Normalizes token text to lower case.
How to get StandardAnalyzer?
StandardAnalyzer class provides following constructors to get an instance of StandardAnalyzer.
public StandardAnalyzer()
public StandardAnalyzer(CharArraySet stopWords)
public StandardAnalyzer(Reader stopwords)
Example
CharArraySet stopWordsCharArraySet = new CharArraySet(10, true);
stopWordsCharArraySet.add("a");
stopWordsCharArraySet.add("an");
stopWordsCharArraySet.add("are");
stopWordsCharArraySet.add("is");
stopWordsCharArraySet.add("the");
stopWordsCharArraySet.add("to");
stopWordsCharArraySet.add("you");
Analyzer whitespaceAnalyzer = new StandardAnalyzer(stopWordsCharArraySet);
Find the below working application.
App.java
package com.sample.app;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.CharArraySet;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
public class App {
public static List<String> getTokens(String text, String fieldName, Analyzer analyzer) throws IOException {
TokenStream tokenStream = analyzer.tokenStream(fieldName, text);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
List<String> result = new ArrayList<String>();
while (tokenStream.incrementToken()) {
result.add(charTermAttribute.toString());
}
return result;
}
public static void main(String args[]) throws IOException {
CharArraySet stopWordsCharArraySet = new CharArraySet(10, true);
stopWordsCharArraySet.add("a");
stopWordsCharArraySet.add("an");
stopWordsCharArraySet.add("are");
stopWordsCharArraySet.add("is");
stopWordsCharArraySet.add("the");
stopWordsCharArraySet.add("to");
stopWordsCharArraySet.add("you");
Analyzer whitespaceAnalyzer = new StandardAnalyzer(stopWordsCharArraySet);
List<String> tokens = getTokens("Java is a programming Language to build Enterprise Applications", null,
whitespaceAnalyzer);
for (String token : tokens) {
System.out.println(token);
}
}
}
Output
java programming language build enterprise applications
No comments:
Post a Comment