Thursday, 1 October 2015

openNLP: Parts Of Speech tagger

Parts of speech tagger reads text in some language and assigns parts of speech to each word, such as noun, verb, adjective, etc.,

Using Command line Interface (CLI)
openNLP Provides command line tool ‘POSTagger’ to identify parts of speech.

$ opennlp POSTagger
Usage: opennlp POSTagger model < sentences

Download 'en-pos-maxent.bin' file from following location.


Let’s say input.txt contains following data.
Pierre , 27 years old, is a software Engineer joined xyz organization.
Mr . Vinken is the project manager, team size 10.

Use following command to identify parts of the sentence. 

opennlp POSTagger en-pos-maxent.bin < input.txt


$ opennlp POSTagger ./en-pos-maxent.bin < input.txt
Loading POS Tagger model ... done (0.703s)

Pierre_NNP ,_, 27_CD years_NNS old,_, is_VBZ a_DT software_NN Engineer_NNP joined_VBD xyz_NNP organization._.
Mr_NNP ._. Vinken_NNP is_VBZ the_DT project_NN manager,_NN team_NN size_NN 10._.


Average: 375.0 sent/s 
Total: 3 sent
Runtime: 0.008s


As you observe the output, each token is appended with a tag. For example ‘Pierre’ appended with _NNP (NNP stands for Proper noun, singular).

Following table contains the details of each tag.
S.NO
Tag
Description
1
CC
Coordinating conjunction
2
CD
Cardinal number
3
DT
Determiner
4
EX
Existential there
5
FW
Foreign word
6
IN
Preposition or subordinating conjunction
7
JJ
Adjective
8
JJR
Adjective, comparative
9
JJS
Adjective, superlative
10
LS
List item marker
11
MD
Modal
12
NN
Noun, singular or mass
13
NNS
Noun, plural
14
NNP
Proper noun, singular
15
NNPS
Proper noun, plural
16
PDT
Predeterminer
17
POS
Possessive ending
18
PRP
Personal pronoun
19
PRP$
Possessive pronoun
20
RB
Adverb
21
RBR
Adverb, comparative
22
RBS
Adverb, superlative
23
RP
Particle
24
SYM
Symbol
25
TO
to
26
UH
Interjection
27
VB
Verb, base form
28
VBD
Verb, past tense
29
VBG
Verb, gerund or present participle
30
VBN
Verb, past participle
31
VBP
Verb, non-3rd person singular present
32
VBZ
Verb, 3rd person singular present
33
WDT
Wh-determiner
34
WP
Wh-pronoun
35
WP$
Possessive wh-pronoun
36
WRB
Wh-adverb
  

Using Java API
Following is the complete working application.
import static java.nio.file.Files.readAllBytes;
import static java.nio.file.Paths.get;

import java.io.IOException;
import java.util.Objects;

public class FileUtils {
 /**
  * Get file data as string
  * 
  * @param fileName
  * @return
  */
 public static String getFileDataAsString(String fileName) {
  Objects.nonNull(fileName);
  try {
   String data = new String(readAllBytes(get(fileName)));
   return data;
  } catch (IOException e) {
   System.out.println(e.getMessage());
   return null;
  }
 }
}

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.util.Objects;

import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.tokenize.WhitespaceTokenizer;

public class TokenizerUtil {
 TokenizerModel model = null;
 Tokenizer learnableTokenizer = null;

 public TokenizerUtil(String modelFile) {
  initTokenizerModel(modelFile);
  learnableTokenizer = new TokenizerME(model);
 }

 private void initTokenizerModel(String modelFile) {
  Objects.nonNull(modelFile);

  InputStream modelIn = null;
  try {
   modelIn = new FileInputStream(modelFile);
  } catch (FileNotFoundException e) {
   System.out.println(e.getMessage());
   return;
  }

  try {
   model = new TokenizerModel(modelIn);
  } catch (IOException e) {
   e.printStackTrace();
  } finally {
   if (modelIn != null) {
    try {
     modelIn.close();
    } catch (IOException e) {
    }
   }
  }
 }

 public Tokenizer getLearnableTokenizer() {
  return learnableTokenizer;
 }

 public Tokenizer getWhitespaceTokenizer() {
  return WhitespaceTokenizer.INSTANCE;
 }

 public String[] tokenizeFileUsingLearnableTokenizer(String file) {
  String data = FileUtils.getFileDataAsString(file);
  return learnableTokenizer.tokenize(data);
 }

 public String[] tokenizeUsingLearnableTokenizer(String data) {
  return learnableTokenizer.tokenize(data);
 }

 public String[] tokenizeFileUsingWhiteSpaceTokenizer(String file) {
  String data = FileUtils.getFileDataAsString(file);
  return getWhitespaceTokenizer().tokenize(data);
 }

 public String[] tokenizeUsingWhiteSpaceTokenizer(String data) {
  return getWhitespaceTokenizer().tokenize(data);
 }
}

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Objects;

import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;

public class POSTaggerUtil {

 POSModel posModel = null;
 String modelFile = null;
 POSTaggerME tagger = null;
 TokenizerUtil tokenizerUtil = null;

 public POSTaggerUtil(String posModelFile, String tokenizerModelFile) {
  Objects.nonNull(posModelFile);
  Objects.nonNull(tokenizerModelFile);
  modelFile = posModelFile;
  initModel();
  tagger = new POSTaggerME(posModel);
  tokenizerUtil = new TokenizerUtil(tokenizerModelFile);
 }

 private void initModel() {
  InputStream modelIn = null;

  try {
   modelIn = new FileInputStream(modelFile);
   posModel = new POSModel(modelIn);
  } catch (IOException e) {
   System.out.println(e.getMessage());
  } finally {
   if (modelIn != null) {
    try {
     modelIn.close();
    } catch (IOException e) {
    }
   }
  }
 }

 public String[] getTags(String sentence) {
  String[] tokens = tokenizerUtil
    .tokenizeUsingWhiteSpaceTokenizer(sentence);
  return tagger.tag(tokens);
 }

 public String[] getTagsForFile(String fileName) {
  Objects.nonNull(fileName);
  String data = FileUtils.getFileDataAsString(fileName);
  return getTags(data);
 }

}

import java.io.IOException;

public class Test {
 public static void main(String args[]) throws IOException {
  String tokenizerModelFile = "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/en-token.bin";
  String posModelFile = "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/en-pos-maxent.bin";
  String inputFile = "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/input.txt";

  POSTaggerUtil util = new POSTaggerUtil(posModelFile, tokenizerModelFile);
  String[] tags = util.getTagsForFile(inputFile);

  for (String tag : tags)
   System.out.println(tag);

 }
}


Output
Output
NNP
,
CD
NNS
,
VBZ
DT
NN
NNP
VBD
NNP
NNP
NNP
.
NNP
VBZ
DT
NN
NN
NN
NN
.


Referred Articles



Prevoius                                                 Next                                                 Home

No comments:

Post a Comment