WhitespaceAnalyzer is used to tokenize given string by whitespace characters. A Whitespace is identified by 'isWhitespace' method of Character class.
How to get an instance of WhitespaceAnalyzer?
Analyzer whitespaceAnalyzer = new WhitespaceAnalyzer();
Following standard method is used to get tokens for given text.
public static List<String> getTokens(String text, String fieldName, Analyzer analyzer) throws IOException {
TokenStream tokenStream = analyzer.tokenStream(fieldName, text);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
List<String> result = new ArrayList<String>();
while (tokenStream.incrementToken()) {
result.add(charTermAttribute.toString());
}
return result;
}
Find the below working application.
App.java
package com.sample.app;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
public class App {
public static List<String> getTokens(String text, String fieldName, Analyzer analyzer) throws IOException {
TokenStream tokenStream = analyzer.tokenStream(fieldName, text);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
List<String> result = new ArrayList<String>();
while (tokenStream.incrementToken()) {
result.add(charTermAttribute.toString());
}
return result;
}
public static void main(String args[]) throws IOException {
Analyzer whitespaceAnalyzer = new WhitespaceAnalyzer();
List<String> tokens = getTokens("Hello, How Are you\t\tI am Fine \n Thank you", null, whitespaceAnalyzer);
for (String token : tokens) {
System.out.println(token);
}
}
}
Output
Hello, How Are you I am Fine Thank you
No comments:
Post a Comment