Monday 28 June 2021

Lucene: Token Attributes

Token attributes are used to access the values associated with token. For example, 'CharTermAttribute' is used to access the term text of a token. Similarly, there are number of attribute classes provided by Lucene for different purposes.

 

Following table summarizes different Token Attribute classes.

 

Interface

Description

Attribute

Base interface for attributes.

BoostAttribute

Used to control the boost factor for each matching term in MultiTermQuery.

CharTermAttribute

Used to get term text of a token.

FlagsAttribute

This attribute can be used to pass different flags down the Tokenizer chain.

Ex: You can pass some information from one TokenFilter to another one.

KeywordAttribute

Used to mark a token as keyword.

MaxNonCompetitiveBoostAttribute

Add this Attribute to a fresh AttributeSource before calling MultiTermQuery.getTermsEnum(Terms,AttributeSource).

FuzzyQuery is using this to control its internal behaviour to only return competitive terms.

OffsetAttribute

Used to get the start and end character offset of a Token.

PayloadAttribute

This stores the payload at each index position and is generally useful in scoring when used with Payload-based queries.

PositionIncrementAttribute

Determines the position of this token relative to the previous Token in a TokenStream, used in phrase searching.

PositionLengthAttribute

Determines how many positions this token spans.

TermFrequencyAttribute

Sets the custom term frequency of a term within one document.

TermToBytesRefAttribute

This attribute is requested by TermsHashPerField to index the contents. This attribute can be used to customize the final byte[] encoding of terms.

TypeAttribute

A Token's lexical type. The Default value is "word".

 

To access the values associated with Attributes, you need to add the attributes to token stream like below.

OffsetAttribute offsetAtt = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);

 

Whenever you increment the token, Lucene automatically populate these attributes with correct values.

while (tokenStream.incrementToken()) {
	String token = charTermAttribute.toString();
	System.out.println("[" + token + "]");

	System.out.println("Token starting offset: " + offsetAtt.startOffset());
	System.out.println(" Token ending offset: " + offsetAtt.endOffset());

	System.out.println("");
}

 

Find the below working application.

 

App.java

package com.sample.app;

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;

public class App {

	public static void main(String args[]) throws IOException {

		try (Analyzer analyzer = new EnglishAnalyzer()) {
			Reader reader = new StringReader("Java is a Programming Language");

			try (TokenStream tokenStream = analyzer.tokenStream("myField", reader)) {
				OffsetAttribute offsetAtt = tokenStream.addAttribute(OffsetAttribute.class);
				CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);

				tokenStream.reset();

				while (tokenStream.incrementToken()) {
					String token = charTermAttribute.toString();
					System.out.println("[" + token + "]");

					System.out.println("Token starting offset: " + offsetAtt.startOffset());
					System.out.println("Token ending offset: " + offsetAtt.endOffset());

					System.out.println("");
				}
			}

		}

	}

}

 

Output

[java]
Token starting offset: 0
Token ending offset: 4

[program]
Token starting offset: 10
Token ending offset: 21

[languag]
Token starting offset: 22
Token ending offset: 30

 

 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment