Thursday, 1 October 2015

openNLP: Detokenizing

Detokenizing is the reverse process of tokenization. Detokenization constructs original non-tokenized string out of a token sequence.

Following rules are applied while forming a sentence using tokens.

Rule
Description
MOVE_BOTH
Attaches the token to the token on the left and right sides.
MOVE_LEFT

Attaches the token to the token on the left side.
MOVE_RIGHT

Attaches the token to the token on the right side.
RIGHT_LEFT_MATCHING

Attaches the token token to the right token on first occurrence, and to the token on the left side on the second occurrence.


import opennlp.tools.tokenize.DetokenizationDictionary;
import opennlp.tools.tokenize.DetokenizationDictionary.Operation;
import opennlp.tools.tokenize.DictionaryDetokenizer;

public class DeTokenizerUtil {

 public static String deTokenize(String[] tokens,
   DetokenizationDictionary.Operation operation) {
  Operation[] operations = new Operation[tokens.length];

  for (int i = 0; i < tokens.length; i++) {
   operations[i] = operation;
  }

  DetokenizationDictionary dictionary = new DetokenizationDictionary(
    tokens, operations);
  DictionaryDetokenizer detokenizer = new DictionaryDetokenizer(
    dictionary);

  return detokenizer.detokenize(tokens, " ");
 }
}


import java.io.IOException;

import opennlp.tools.tokenize.DetokenizationDictionary;

public class Main {
 public static void main(String args[]) throws IOException {
  String tokens[] = { "We", "are", "living", "in", "an", "Environment",
    ",", "where", "multiple", "Hardware", "Architectures", "and",
    "Multiple", "platforms", "presents", ".", "So", "it", "is",
    "very", "difficult", "to", "write", ",", "compile", "and",
    "link", "the", "same", "Application", ",", "for", "each",
    "platform", "and", "each", "Architecture", "separately", ".",
    "The", "Java", "Programming", "Language", "solves", "all",
    "the", "above", "problems", ".", "The", "Java", "programming",
    "language", "platform", "provides", "a", "portable", ",",
    "interpreted", ",", "high-performance", ",", "simple", ",",
    "object-oriented", "programming", "language", "and",
    "supporting", "run-time", "environment", "." };

  String data = DeTokenizerUtil.deTokenize(tokens,
    DetokenizationDictionary.Operation.MOVE_LEFT);
  System.out.println(data);
 }
}


Output

We are living in an Environment , where multiple Hardware Architectures and Multiple platforms presents . So it is very difficult to write , compile and link the same Application , for each platform and each Architecture separately . The Java Programming Language solves all the above problems . The Java programming language platform provides a portable , interpreted , high-performance , simple , object-oriented programming language and supporting run-time environment .




Prevoius                                                 Next                                                 Home

No comments:

Post a Comment