OpenNLP
provides number of tokenizers to tokenize given sentence. Tokens are usually
words, punctuation, numbers, etc.
Open NLP
provides following tokenizer implementations.
Tokenizer
|
Description
|
Whitespace
|
Non
whitespace sequences are identified as tokens
|
Simple
|
Sequences
of the same character class are tokens
|
Learnable
|
Detects
token boundaries based on probability model
|
Tokenization
is a two step process.
1.
Sentence
boundaries are identified in first stage
2.
Token
are found in each sentence
Tokenization using CLI (Command line interface)
OpenNLP provides command line tools for simple, learnable tokenizers (SimpleTokenizer,
TokenizerME).
To get the
usage of a tokenizaer use the command ‘opennlp tokenizer’.
For example,
following command give help to TokenizerME.
$ opennlp
TokenizerME
Usage:
opennlp TokenizerME model < sentences
Note:
To get help
for simple tokenizer use the command ‘$ opennlp SimpleTokenizer’.
Tokenizing file
I am going
to use learnable tokenizer for following example, if you want you can try with
simple tokenizer. Following command is used to tokenize the data in a file.
opennlp TokenizerME en-token.bin < input.txt
> output.txt
Download
‘en-token.bin’ from following location.
Lets say
‘input.txt’ contains following data.
We are living in an Environment, where multiple Hardware Architectures and Multiple platforms presents. So it is very difficult to write, compile and link the same Application, for each platform and each Architecture separately. The Java Programming Language solves all the above problems. The Java programming language platform provides a portable, interpreted, high-performance, simple, object-oriented programming language and supporting run-time environment.
$ opennlp TokenizerME ./en-token.bin <input.txt >output.txt Loading Tokenizer model ... done (0.113s) Average: 363.6 sent/s Total: 4 sent Runtime: 0.011s
Tokenization using Java API
Following is the complete java application to tokenize data.
Following is the complete java application to tokenize data.
import static java.nio.file.Files.readAllBytes; import static java.nio.file.Paths.get; import java.io.IOException; import java.util.Objects; public class FileUtils { /** * Get file data as string * * @param fileName * @return */ public static String getFileDataAsString(String fileName) { Objects.nonNull(fileName); try { String data = new String(readAllBytes(get(fileName))); return data; } catch (IOException e) { System.out.println(e.getMessage()); return null; } } }
import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.IOException; import java.io.InputStream; import java.util.Objects; import opennlp.tools.tokenize.Tokenizer; import opennlp.tools.tokenize.TokenizerME; import opennlp.tools.tokenize.TokenizerModel; import opennlp.tools.tokenize.WhitespaceTokenizer; public class TokenizerUtil { TokenizerModel model = null; Tokenizer learnableTokenizer = null; public TokenizerUtil(String modelFile) { initTokenizerModel(modelFile); learnableTokenizer = new TokenizerME(model); } private void initTokenizerModel(String modelFile) { Objects.nonNull(modelFile); InputStream modelIn = null; try { modelIn = new FileInputStream(modelFile); } catch (FileNotFoundException e) { System.out.println(e.getMessage()); return; } try { model = new TokenizerModel(modelIn); } catch (IOException e) { e.printStackTrace(); } finally { if (modelIn != null) { try { modelIn.close(); } catch (IOException e) { } } } } public Tokenizer getLearnableTokenizer() { return learnableTokenizer; } public Tokenizer getWhitespaceTokenizer() { return WhitespaceTokenizer.INSTANCE; } public String[] tokenizeFileUsingLearnableTokenizer(String file) { String data = FileUtils.getFileDataAsString(file); return learnableTokenizer.tokenize(data); } public String[] tokenizeUsingLearnableTokenizer(String data) { return learnableTokenizer.tokenize(data); } public String[] tokenizeFileUsingWhiteSpaceTokenizer(String file) { String data = FileUtils.getFileDataAsString(file); return getWhitespaceTokenizer().tokenize(data); } public String[] tokenizeUsingWhiteSpaceTokenizer(String data) { return getWhitespaceTokenizer().tokenize(data); } }
import java.io.IOException; public class Main { public static void main(String args[]) throws IOException { String modelFile = "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/en-token.bin"; String data = "We are living in an Environment, where multiple Hardware Architectures and Multiple platforms presents. So it is very difficult to write, compile and link the same Application, for each platform and each Architecture separately. The Java Programming Language solves all the above problems. The Java programming language platform provides a portable, interpreted, high-performance, simple, object-oriented programming language and supporting run-time environment."; TokenizerUtil util = new TokenizerUtil(modelFile); String[] tokens = util.getLearnableTokenizer().tokenize(data); for (String s : tokens) { System.out.println(s); } } }
Output
We are living in an Environment , where multiple Hardware Architectures and Multiple platforms presents . So it is very difficult to write , compile and link the same Application , for each platform and each Architecture separately . The Java Programming Language solves all the above problems . The Java programming language platform provides a portable , interpreted , high-performance , simple , object-oriented programming language and supporting run-time environment .
No comments:
Post a Comment