Programming for beginners: Lucene: Token Attributes

Token attributes are used to access the values associated with token. For example, 'CharTermAttribute' is used to access the term text of a token. Similarly, there are number of attribute classes provided by Lucene for different purposes.

Following table summarizes different Token Attribute classes.

Interface	Description
Attribute	Base interface for attributes.
BoostAttribute	Used to control the boost factor for each matching term in MultiTermQuery.
CharTermAttribute	Used to get term text of a token.
FlagsAttribute	This attribute can be used to pass different flags down the Tokenizer chain. Ex: You can pass some information from one TokenFilter to another one.
KeywordAttribute	Used to mark a token as keyword.
MaxNonCompetitiveBoostAttribute	Add this Attribute to a fresh AttributeSource before calling MultiTermQuery.getTermsEnum(Terms,AttributeSource). FuzzyQuery is using this to control its internal behaviour to only return competitive terms.
OffsetAttribute	Used to get the start and end character offset of a Token.
PayloadAttribute	This stores the payload at each index position and is generally useful in scoring when used with Payload-based queries.
PositionIncrementAttribute	Determines the position of this token relative to the previous Token in a TokenStream, used in phrase searching.
PositionLengthAttribute	Determines how many positions this token spans.
TermFrequencyAttribute	Sets the custom term frequency of a term within one document.
TermToBytesRefAttribute	This attribute is requested by TermsHashPerField to index the contents. This attribute can be used to customize the final byte[] encoding of terms.
TypeAttribute	A Token's lexical type. The Default value is "word".

To access the values associated with Attributes, you need to add the attributes to token stream like below.

OffsetAttribute offsetAtt = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);

Whenever you increment the token, Lucene automatically populate these attributes with correct values.

while (tokenStream.incrementToken()) {
	String token = charTermAttribute.toString();
	System.out.println("[" + token + "]");

	System.out.println("Token starting offset: " + offsetAtt.startOffset());
	System.out.println(" Token ending offset: " + offsetAtt.endOffset());

	System.out.println("");
}

Find the below working application.

App.java

package com.sample.app;

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;

public class App {

	public static void main(String args[]) throws IOException {

		try (Analyzer analyzer = new EnglishAnalyzer()) {
			Reader reader = new StringReader("Java is a Programming Language");

			try (TokenStream tokenStream = analyzer.tokenStream("myField", reader)) {
				OffsetAttribute offsetAtt = tokenStream.addAttribute(OffsetAttribute.class);
				CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);

				tokenStream.reset();

				while (tokenStream.incrementToken()) {
					String token = charTermAttribute.toString();
					System.out.println("[" + token + "]");

					System.out.println("Token starting offset: " + offsetAtt.startOffset());
					System.out.println("Token ending offset: " + offsetAtt.endOffset());

					System.out.println("");
				}
			}

		}

	}

}

Output

[java]
Token starting offset: 0
Token ending offset: 4

[program]
Token starting offset: 10
Token ending offset: 21

[languag]
Token starting offset: 22
Token ending offset: 30

Previous Next Home

Programming for beginners

Monday, 28 June 2021

Lucene: Token Attributes

No comments:

Post a Comment