Token attributes are used to access the values associated with token. For example, 'CharTermAttribute' is used to access the term text of a token. Similarly, there are number of attribute classes provided by Lucene for different purposes.
Following table summarizes different Token Attribute classes.
Interface |
Description |
Attribute |
Base interface for attributes. |
BoostAttribute |
Used to control the boost factor for each matching term in MultiTermQuery. |
CharTermAttribute |
Used to get term text of a token. |
FlagsAttribute |
This attribute can be used to pass different flags down the Tokenizer chain. Ex: You can pass some information from one TokenFilter to another one. |
KeywordAttribute |
Used to mark a token as keyword. |
MaxNonCompetitiveBoostAttribute |
Add this Attribute to a fresh AttributeSource before calling MultiTermQuery.getTermsEnum(Terms,AttributeSource). FuzzyQuery is using this to control its internal behaviour to only return competitive terms. |
OffsetAttribute |
Used to get the start and end character offset of a Token. |
PayloadAttribute |
This stores the payload at each index position and is generally useful in scoring when used with Payload-based queries. |
PositionIncrementAttribute |
Determines the position of this token relative to the previous Token in a TokenStream, used in phrase searching. |
PositionLengthAttribute |
Determines how many positions this token spans. |
TermFrequencyAttribute |
Sets the custom term frequency of a term within one document. |
TermToBytesRefAttribute |
This attribute is requested by TermsHashPerField to index the contents. This attribute can be used to customize the final byte[] encoding of terms. |
TypeAttribute |
A Token's lexical type. The Default value is "word". |
To access the values associated with Attributes, you need to add the attributes to token stream like below.
OffsetAttribute offsetAtt = tokenStream.addAttribute(OffsetAttribute.class); CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
Whenever you increment the token, Lucene automatically populate these attributes with correct values.
while (tokenStream.incrementToken()) {
String token = charTermAttribute.toString();
System.out.println("[" + token + "]");
System.out.println("Token starting offset: " + offsetAtt.startOffset());
System.out.println(" Token ending offset: " + offsetAtt.endOffset());
System.out.println("");
}
Find the below working application.
App.java
package com.sample.app;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
public class App {
public static void main(String args[]) throws IOException {
try (Analyzer analyzer = new EnglishAnalyzer()) {
Reader reader = new StringReader("Java is a Programming Language");
try (TokenStream tokenStream = analyzer.tokenStream("myField", reader)) {
OffsetAttribute offsetAtt = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
String token = charTermAttribute.toString();
System.out.println("[" + token + "]");
System.out.println("Token starting offset: " + offsetAtt.startOffset());
System.out.println("Token ending offset: " + offsetAtt.endOffset());
System.out.println("");
}
}
}
}
}
Output
[java] Token starting offset: 0 Token ending offset: 4 [program] Token starting offset: 10 Token ending offset: 21 [languag] Token starting offset: 22 Token ending offset: 30
Previous Next Home
No comments:
Post a Comment