Thursday 1 October 2015

OpenNLP: Tokenization

OpenNLP provides number of tokenizers to tokenize given sentence. Tokens are usually words, punctuation, numbers, etc.

Open NLP provides following tokenizer implementations.
Tokenizer
Description
Whitespace
Non whitespace sequences are identified as tokens
Simple
Sequences of the same character class are tokens
Learnable
Detects token boundaries based on probability model

Tokenization is a two step process.
1.   Sentence boundaries are identified in first stage
2.   Token are found in each sentence

Tokenization using CLI (Command line interface)
OpenNLP provides command line tools for simple, learnable tokenizers (SimpleTokenizer, TokenizerME).

To get the usage of a tokenizaer use the command ‘opennlp tokenizer’.

For example, following command give help to TokenizerME.

$ opennlp TokenizerME
Usage: opennlp TokenizerME model < sentences

Note:
To get help for simple tokenizer use the command ‘$ opennlp SimpleTokenizer’.

Tokenizing file
I am going to use learnable tokenizer for following example, if you want you can try with simple tokenizer. Following command is used to tokenize the data in a file.

opennlp TokenizerME en-token.bin < input.txt > output.txt

Download ‘en-token.bin’ from following location.


Lets say ‘input.txt’ contains following data.

We are living in an Environment, where multiple Hardware Architectures and Multiple platforms presents. So it is very difficult to write, compile and link the same Application, for each platform and each Architecture separately.

The Java Programming Language solves all the above problems. The Java programming language platform provides a portable, interpreted, high-performance, simple, object-oriented programming language and supporting run-time environment.


 $ opennlp TokenizerME  ./en-token.bin <input.txt >output.txt
Loading Tokenizer model ... done (0.113s)


Average: 363.6 sent/s 
Total: 4 sent
Runtime: 0.011s

Tokenization using Java API
Following is the complete java application to tokenize data.

import static java.nio.file.Files.readAllBytes;
import static java.nio.file.Paths.get;

import java.io.IOException;
import java.util.Objects;

public class FileUtils {
 /**
  * Get file data as string
  * 
  * @param fileName
  * @return
  */
 public static String getFileDataAsString(String fileName) {
  Objects.nonNull(fileName);
  try {
   String data = new String(readAllBytes(get(fileName)));
   return data;
  } catch (IOException e) {
   System.out.println(e.getMessage());
   return null;
  }
 }
}


import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.util.Objects;

import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.tokenize.WhitespaceTokenizer;

public class TokenizerUtil {
 TokenizerModel model = null;
 Tokenizer learnableTokenizer = null;

 public TokenizerUtil(String modelFile) {
  initTokenizerModel(modelFile);
  learnableTokenizer = new TokenizerME(model);
 }

 private void initTokenizerModel(String modelFile) {
  Objects.nonNull(modelFile);

  InputStream modelIn = null;
  try {
   modelIn = new FileInputStream(modelFile);
  } catch (FileNotFoundException e) {
   System.out.println(e.getMessage());
   return;
  }

  try {
   model = new TokenizerModel(modelIn);
  } catch (IOException e) {
   e.printStackTrace();
  } finally {
   if (modelIn != null) {
    try {
     modelIn.close();
    } catch (IOException e) {
    }
   }
  }
 }

 public Tokenizer getLearnableTokenizer() {
  return learnableTokenizer;
 }

 public Tokenizer getWhitespaceTokenizer() {
  return WhitespaceTokenizer.INSTANCE;
 }

 public String[] tokenizeFileUsingLearnableTokenizer(String file) {
  String data = FileUtils.getFileDataAsString(file);
  return learnableTokenizer.tokenize(data);
 }

 public String[] tokenizeUsingLearnableTokenizer(String data) {
  return learnableTokenizer.tokenize(data);
 }

 public String[] tokenizeFileUsingWhiteSpaceTokenizer(String file) {
  String data = FileUtils.getFileDataAsString(file);
  return getWhitespaceTokenizer().tokenize(data);
 }

 public String[] tokenizeUsingWhiteSpaceTokenizer(String data) {
  return getWhitespaceTokenizer().tokenize(data);
 }
}


import java.io.IOException;

public class Main {
 public static void main(String args[]) throws IOException {
  String modelFile = "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/en-token.bin";
  String data = "We are living in an Environment, where multiple Hardware Architectures and Multiple platforms presents. So it is very difficult to write, compile and link the same Application, for each platform and each Architecture separately. The Java Programming Language solves all the above problems. The Java programming language platform provides a portable, interpreted, high-performance, simple, object-oriented programming language and supporting run-time environment.";

  TokenizerUtil util = new TokenizerUtil(modelFile);

  String[] tokens = util.getLearnableTokenizer().tokenize(data);

  for (String s : tokens) {
   System.out.println(s);
  }
 }
}


Output

We
are
living
in
an
Environment
,
where
multiple
Hardware
Architectures
and
Multiple
platforms
presents
.
So
it
is
very
difficult
to
write
,
compile
and
link
the
same
Application
,
for
each
platform
and
each
Architecture
separately
.
The
Java
Programming
Language
solves
all
the
above
problems
.
The
Java
programming
language
platform
provides
a
portable
,
interpreted
,
high-performance
,
simple
,
object-oriented
programming
language
and
supporting
run-time
environment
.



Prevoius                                                 Next                                                 Home

No comments:

Post a Comment