Parts of
speech tagger reads text in some language and assigns parts of speech to each
word, such as noun, verb, adjective, etc.,
Using Command line Interface (CLI)
openNLP
Provides command line tool ‘POSTagger’ to identify parts of speech.
$ opennlp
POSTagger
Usage:
opennlp POSTagger model < sentences
Download
'en-pos-maxent.bin' file from following location.
Let’s say
input.txt contains following data.
Pierre , 27 years old, is a software Engineer joined xyz organization. Mr . Vinken is the project manager, team size 10.
Use
following command to identify parts of the sentence.
opennlp
POSTagger en-pos-maxent.bin < input.txt
$ opennlp POSTagger ./en-pos-maxent.bin < input.txt Loading POS Tagger model ... done (0.703s) Pierre_NNP ,_, 27_CD years_NNS old,_, is_VBZ a_DT software_NN Engineer_NNP joined_VBD xyz_NNP organization._. Mr_NNP ._. Vinken_NNP is_VBZ the_DT project_NN manager,_NN team_NN size_NN 10._. Average: 375.0 sent/s Total: 3 sent Runtime: 0.008s
As you
observe the output, each token is appended with a tag. For example ‘Pierre’
appended with _NNP (NNP stands for Proper noun, singular).
Following
table contains the details of each tag.
S.NO
|
Tag
|
Description
|
1
|
CC
|
Coordinating
conjunction
|
2
|
CD
|
Cardinal
number
|
3
|
DT
|
Determiner
|
4
|
EX
|
Existential
there
|
5
|
FW
|
Foreign
word
|
6
|
IN
|
Preposition
or subordinating conjunction
|
7
|
JJ
|
Adjective
|
8
|
JJR
|
Adjective,
comparative
|
9
|
JJS
|
Adjective,
superlative
|
10
|
LS
|
List item
marker
|
11
|
MD
|
Modal
|
12
|
NN
|
Noun,
singular or mass
|
13
|
NNS
|
Noun,
plural
|
14
|
NNP
|
Proper
noun, singular
|
15
|
NNPS
|
Proper
noun, plural
|
16
|
PDT
|
Predeterminer
|
17
|
POS
|
Possessive
ending
|
18
|
PRP
|
Personal
pronoun
|
19
|
PRP$
|
Possessive
pronoun
|
20
|
RB
|
Adverb
|
21
|
RBR
|
Adverb,
comparative
|
22
|
RBS
|
Adverb,
superlative
|
23
|
RP
|
Particle
|
24
|
SYM
|
Symbol
|
25
|
TO
|
to
|
26
|
UH
|
Interjection
|
27
|
VB
|
Verb, base
form
|
28
|
VBD
|
Verb, past
tense
|
29
|
VBG
|
Verb,
gerund or present participle
|
30
|
VBN
|
Verb, past
participle
|
31
|
VBP
|
Verb,
non-3rd person singular present
|
32
|
VBZ
|
Verb, 3rd
person singular present
|
33
|
WDT
|
Wh-determiner
|
34
|
WP
|
Wh-pronoun
|
35
|
WP$
|
Possessive
wh-pronoun
|
36
|
WRB
|
Wh-adverb
|
Using Java API
Following is
the complete working application.
import static java.nio.file.Files.readAllBytes; import static java.nio.file.Paths.get; import java.io.IOException; import java.util.Objects; public class FileUtils { /** * Get file data as string * * @param fileName * @return */ public static String getFileDataAsString(String fileName) { Objects.nonNull(fileName); try { String data = new String(readAllBytes(get(fileName))); return data; } catch (IOException e) { System.out.println(e.getMessage()); return null; } } }
import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.IOException; import java.io.InputStream; import java.util.Objects; import opennlp.tools.tokenize.Tokenizer; import opennlp.tools.tokenize.TokenizerME; import opennlp.tools.tokenize.TokenizerModel; import opennlp.tools.tokenize.WhitespaceTokenizer; public class TokenizerUtil { TokenizerModel model = null; Tokenizer learnableTokenizer = null; public TokenizerUtil(String modelFile) { initTokenizerModel(modelFile); learnableTokenizer = new TokenizerME(model); } private void initTokenizerModel(String modelFile) { Objects.nonNull(modelFile); InputStream modelIn = null; try { modelIn = new FileInputStream(modelFile); } catch (FileNotFoundException e) { System.out.println(e.getMessage()); return; } try { model = new TokenizerModel(modelIn); } catch (IOException e) { e.printStackTrace(); } finally { if (modelIn != null) { try { modelIn.close(); } catch (IOException e) { } } } } public Tokenizer getLearnableTokenizer() { return learnableTokenizer; } public Tokenizer getWhitespaceTokenizer() { return WhitespaceTokenizer.INSTANCE; } public String[] tokenizeFileUsingLearnableTokenizer(String file) { String data = FileUtils.getFileDataAsString(file); return learnableTokenizer.tokenize(data); } public String[] tokenizeUsingLearnableTokenizer(String data) { return learnableTokenizer.tokenize(data); } public String[] tokenizeFileUsingWhiteSpaceTokenizer(String file) { String data = FileUtils.getFileDataAsString(file); return getWhitespaceTokenizer().tokenize(data); } public String[] tokenizeUsingWhiteSpaceTokenizer(String data) { return getWhitespaceTokenizer().tokenize(data); } }
import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.util.Objects; import opennlp.tools.postag.POSModel; import opennlp.tools.postag.POSTaggerME; public class POSTaggerUtil { POSModel posModel = null; String modelFile = null; POSTaggerME tagger = null; TokenizerUtil tokenizerUtil = null; public POSTaggerUtil(String posModelFile, String tokenizerModelFile) { Objects.nonNull(posModelFile); Objects.nonNull(tokenizerModelFile); modelFile = posModelFile; initModel(); tagger = new POSTaggerME(posModel); tokenizerUtil = new TokenizerUtil(tokenizerModelFile); } private void initModel() { InputStream modelIn = null; try { modelIn = new FileInputStream(modelFile); posModel = new POSModel(modelIn); } catch (IOException e) { System.out.println(e.getMessage()); } finally { if (modelIn != null) { try { modelIn.close(); } catch (IOException e) { } } } } public String[] getTags(String sentence) { String[] tokens = tokenizerUtil .tokenizeUsingWhiteSpaceTokenizer(sentence); return tagger.tag(tokens); } public String[] getTagsForFile(String fileName) { Objects.nonNull(fileName); String data = FileUtils.getFileDataAsString(fileName); return getTags(data); } }
import java.io.IOException; public class Test { public static void main(String args[]) throws IOException { String tokenizerModelFile = "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/en-token.bin"; String posModelFile = "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/en-pos-maxent.bin"; String inputFile = "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/input.txt"; POSTaggerUtil util = new POSTaggerUtil(posModelFile, tokenizerModelFile); String[] tags = util.getTagsForFile(inputFile); for (String tag : tags) System.out.println(tag); } }
Output
Output NNP , CD NNS , VBZ DT NN NNP VBD NNP NNP NNP . NNP VBZ DT NN NN NN NN .
Referred Articles
No comments:
Post a Comment