Parts of
speech tagger reads text in some language and assigns parts of speech to each
word, such as noun, verb, adjective, etc.,
Using Command line Interface (CLI)
openNLP
Provides command line tool ‘POSTagger’ to identify parts of speech.
$ opennlp
POSTagger
Usage:
opennlp POSTagger model < sentences
Download
'en-pos-maxent.bin' file from following location.
Let’s say
input.txt contains following data.
Pierre , 27 years old, is a software Engineer joined xyz organization. Mr . Vinken is the project manager, team size 10.
Use
following command to identify parts of the sentence. 
opennlp
POSTagger en-pos-maxent.bin < input.txt
$ opennlp POSTagger ./en-pos-maxent.bin < input.txt Loading POS Tagger model ... done (0.703s) Pierre_NNP ,_, 27_CD years_NNS old,_, is_VBZ a_DT software_NN Engineer_NNP joined_VBD xyz_NNP organization._. Mr_NNP ._. Vinken_NNP is_VBZ the_DT project_NN manager,_NN team_NN size_NN 10._. Average: 375.0 sent/s Total: 3 sent Runtime: 0.008s
As you
observe the output, each token is appended with a tag. For example ‘Pierre’
appended with _NNP (NNP stands for Proper noun, singular).
Following
table contains the details of each tag.
| 
S.NO | 
Tag | 
Description | 
| 
1 | 
CC | 
Coordinating
  conjunction | 
| 
2 | 
CD | 
Cardinal
  number | 
| 
3 | 
DT | 
Determiner | 
| 
4 | 
EX | 
Existential
  there | 
| 
5 | 
FW | 
Foreign
  word | 
| 
6 | 
IN | 
Preposition
  or subordinating conjunction | 
| 
7 | 
JJ | 
Adjective | 
| 
8 | 
JJR | 
Adjective,
  comparative | 
| 
9 | 
JJS | 
Adjective,
  superlative | 
| 
10 | 
LS | 
List item
  marker | 
| 
11 | 
MD | 
Modal | 
| 
12 | 
NN | 
Noun,
  singular or mass | 
| 
13 | 
NNS | 
Noun,
  plural | 
| 
14 | 
NNP | 
Proper
  noun, singular | 
| 
15 | 
NNPS | 
Proper
  noun, plural | 
| 
16 | 
PDT | 
Predeterminer | 
| 
17 | 
POS | 
Possessive
  ending | 
| 
18 | 
PRP | 
Personal
  pronoun | 
| 
19 | 
PRP$ | 
Possessive
  pronoun | 
| 
20 | 
RB | 
Adverb | 
| 
21 | 
RBR | 
Adverb,
  comparative | 
| 
22 | 
RBS | 
Adverb,
  superlative | 
| 
23 | 
RP | 
Particle | 
| 
24 | 
SYM | 
Symbol | 
| 
25 | 
TO | 
to | 
| 
26 | 
UH | 
Interjection | 
| 
27 | 
VB | 
Verb, base
  form | 
| 
28 | 
VBD | 
Verb, past
  tense | 
| 
29 | 
VBG | 
Verb,
  gerund or present participle | 
| 
30 | 
VBN | 
Verb, past
  participle | 
| 
31 | 
VBP | 
Verb,
  non-3rd person singular present | 
| 
32 | 
VBZ | 
Verb, 3rd
  person singular present | 
| 
33 | 
WDT | 
Wh-determiner | 
| 
34 | 
WP | 
Wh-pronoun | 
| 
35 | 
WP$ | 
Possessive
  wh-pronoun | 
| 
36 | 
WRB | 
Wh-adverb | 
Using Java API
Following is
the complete working application.
import static java.nio.file.Files.readAllBytes; import static java.nio.file.Paths.get; import java.io.IOException; import java.util.Objects; public class FileUtils { /** * Get file data as string * * @param fileName * @return */ public static String getFileDataAsString(String fileName) { Objects.nonNull(fileName); try { String data = new String(readAllBytes(get(fileName))); return data; } catch (IOException e) { System.out.println(e.getMessage()); return null; } } }
import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.IOException; import java.io.InputStream; import java.util.Objects; import opennlp.tools.tokenize.Tokenizer; import opennlp.tools.tokenize.TokenizerME; import opennlp.tools.tokenize.TokenizerModel; import opennlp.tools.tokenize.WhitespaceTokenizer; public class TokenizerUtil { TokenizerModel model = null; Tokenizer learnableTokenizer = null; public TokenizerUtil(String modelFile) { initTokenizerModel(modelFile); learnableTokenizer = new TokenizerME(model); } private void initTokenizerModel(String modelFile) { Objects.nonNull(modelFile); InputStream modelIn = null; try { modelIn = new FileInputStream(modelFile); } catch (FileNotFoundException e) { System.out.println(e.getMessage()); return; } try { model = new TokenizerModel(modelIn); } catch (IOException e) { e.printStackTrace(); } finally { if (modelIn != null) { try { modelIn.close(); } catch (IOException e) { } } } } public Tokenizer getLearnableTokenizer() { return learnableTokenizer; } public Tokenizer getWhitespaceTokenizer() { return WhitespaceTokenizer.INSTANCE; } public String[] tokenizeFileUsingLearnableTokenizer(String file) { String data = FileUtils.getFileDataAsString(file); return learnableTokenizer.tokenize(data); } public String[] tokenizeUsingLearnableTokenizer(String data) { return learnableTokenizer.tokenize(data); } public String[] tokenizeFileUsingWhiteSpaceTokenizer(String file) { String data = FileUtils.getFileDataAsString(file); return getWhitespaceTokenizer().tokenize(data); } public String[] tokenizeUsingWhiteSpaceTokenizer(String data) { return getWhitespaceTokenizer().tokenize(data); } }
import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.util.Objects; import opennlp.tools.postag.POSModel; import opennlp.tools.postag.POSTaggerME; public class POSTaggerUtil { POSModel posModel = null; String modelFile = null; POSTaggerME tagger = null; TokenizerUtil tokenizerUtil = null; public POSTaggerUtil(String posModelFile, String tokenizerModelFile) { Objects.nonNull(posModelFile); Objects.nonNull(tokenizerModelFile); modelFile = posModelFile; initModel(); tagger = new POSTaggerME(posModel); tokenizerUtil = new TokenizerUtil(tokenizerModelFile); } private void initModel() { InputStream modelIn = null; try { modelIn = new FileInputStream(modelFile); posModel = new POSModel(modelIn); } catch (IOException e) { System.out.println(e.getMessage()); } finally { if (modelIn != null) { try { modelIn.close(); } catch (IOException e) { } } } } public String[] getTags(String sentence) { String[] tokens = tokenizerUtil .tokenizeUsingWhiteSpaceTokenizer(sentence); return tagger.tag(tokens); } public String[] getTagsForFile(String fileName) { Objects.nonNull(fileName); String data = FileUtils.getFileDataAsString(fileName); return getTags(data); } }
import java.io.IOException; public class Test { public static void main(String args[]) throws IOException { String tokenizerModelFile = "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/en-token.bin"; String posModelFile = "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/en-pos-maxent.bin"; String inputFile = "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/input.txt"; POSTaggerUtil util = new POSTaggerUtil(posModelFile, tokenizerModelFile); String[] tags = util.getTagsForFile(inputFile); for (String tag : tags) System.out.println(tag); } }
Output
Output NNP , CD NNS , VBZ DT NN NNP VBD NNP NNP NNP . NNP VBZ DT NN NN NN NN .
Referred Articles
 
 
No comments:
Post a Comment