Monday 28 June 2021

Lucene: PositionIncrementAttribute: Get the gap between two tokens

PositionIncrementAttribute is used to determine the position of this token relative to the previous Token in a TokenStream, used in phrase searching.

 

For example, let’s take the string 'Java is a Programming Language', when you tokenize this string, you will get following tokens.

 

When you tokenize the string using EnglishAnalyzer, you will get below 3 tokens.

java

program

languag

 

As you see, there are two words (is, a) between the tokens 'java' and 'program'. You can get this gap using PositionIncrementAttribute.

 

Find the below working application.

 

App.java

package com.sample.app;

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;

public class App {

	public static void main(String args[]) throws IOException {

		try (Analyzer analyzer = new EnglishAnalyzer()) {
			Reader reader = new StringReader("Java is a Programming Language");

			try (TokenStream tokenStream = analyzer.tokenStream("myField", reader)) {
				OffsetAttribute offsetAtt = tokenStream.addAttribute(OffsetAttribute.class);
				CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
				PositionIncrementAttribute positionIncrementAttribute = tokenStream
						.addAttribute(PositionIncrementAttribute.class);

				tokenStream.reset();

				while (tokenStream.incrementToken()) {
					String token = charTermAttribute.toString();
					System.out.println("[" + token + "]");

					System.out.println("Token starting offset: " + offsetAtt.startOffset());
					System.out.println("Token ending offset: " + offsetAtt.endOffset());

					System.out.println("Position Increment: " + positionIncrementAttribute.getPositionIncrement());

					System.out.println("");
				}
			}

		}

	}

}

 

Output

[java]
Token starting offset: 0
Token ending offset: 4
Position Increment: 1

[program]
Token starting offset: 10
Token ending offset: 21
Position Increment: 3

[languag]
Token starting offset: 22
Token ending offset: 30
Position Increment: 1

 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment