PositionIncrementAttribute is used to determine the position of this token relative to the previous Token in a TokenStream, used in phrase searching.
For example, let’s take the string 'Java is a Programming Language', when you tokenize this string, you will get following tokens.
When you tokenize the string using EnglishAnalyzer, you will get below 3 tokens.
java
program
languag
As you see, there are two words (is, a) between the tokens 'java' and 'program'. You can get this gap using PositionIncrementAttribute.
Find the below working application.
App.java
package com.sample.app;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
public class App {
public static void main(String args[]) throws IOException {
try (Analyzer analyzer = new EnglishAnalyzer()) {
Reader reader = new StringReader("Java is a Programming Language");
try (TokenStream tokenStream = analyzer.tokenStream("myField", reader)) {
OffsetAttribute offsetAtt = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
PositionIncrementAttribute positionIncrementAttribute = tokenStream
.addAttribute(PositionIncrementAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
String token = charTermAttribute.toString();
System.out.println("[" + token + "]");
System.out.println("Token starting offset: " + offsetAtt.startOffset());
System.out.println("Token ending offset: " + offsetAtt.endOffset());
System.out.println("Position Increment: " + positionIncrementAttribute.getPositionIncrement());
System.out.println("");
}
}
}
}
}
Output
[java] Token starting offset: 0 Token ending offset: 4 Position Increment: 1 [program] Token starting offset: 10 Token ending offset: 21 Position Increment: 3 [languag] Token starting offset: 22 Token ending offset: 30 Position Increment: 1
No comments:
Post a Comment