You can
train a model to identify tokenizers using command line interface, API.
Using CLI (Command Line Interface):
opennlp
provides 'TokenizerTrainer' tool to train data. The OpenNLP format contains one
sentence per line. You can also specify tokens either separated by a whitespace
or by a special <SPLIT> tag.
To get help
for 'TokenizerTrainer' run the command 'opennlp TokenizerTrainer'.
$ opennlp TokenizerTrainer Usage: opennlp TokenizerTrainer[.ad|.pos|.conllx|.namefinder|.parse] [-factory factoryName] [-abbDict path] [-alphaNumOpt isAlphaNumOpt] [-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName] Arguments description: -factory factoryName A sub-class of TokenizerFactory where to get implementation and resources. -abbDict path abbreviation dictionary in XML format. -alphaNumOpt isAlphaNumOpt Optimization flag to skip alpha numeric tokens for further tokenization -params paramsFile training parameters file. -lang language language which is being processed. -model modelFile output model file. -data sampleData data to be used, usually a file name. -encoding charsetName encoding for reading and writing text, if absent the system default is used.
Following
command takes training data from input.txt and generates ‘token_model.bin’
file.
opennlp
TokenizerTrainer -model token_model.bin -alphaNumOpt false -lang en -data
input.txt -encoding UTF-8
‘input.txt’ contains following data.
Hari krishna Gurram<SPLIT>, 27 years old<SPLIT>, is a software Engineer<SPLIT> joined xyz organization. Mr. Ananad Bandaru <SPLIT> is the project manager,<SPLIT> team size 10<SPLIT>.
$ opennlp TokenizerTrainer -model ./token_model.bin -alphaNumOpt false -lang en -data ./input.txt -encoding UTF-8
Indexing events using cutoff of 5
Computing event counts... done. 94 events
Indexing... done.
Sorting and merging events... done. Reduced 94 events to 89.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 89
Number of Outcomes: 2
Number of Predicates: 30
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-65.15583497263488 0.9680851063829787
2: ... loglikelihood=-27.919978747347933 0.9680851063829787
3: ... loglikelihood=-18.830459832857034 0.9680851063829787
4: ... loglikelihood=-15.404986758262185 0.9680851063829787
5: ... loglikelihood=-13.72901035449004 0.9680851063829787
6: ... loglikelihood=-12.736402318032512 0.9680851063829787
7: ... loglikelihood=-12.058379907327403 0.9680851063829787
8: ... loglikelihood=-11.546862469089756 0.9680851063829787
9: ... loglikelihood=-11.135581655830746 0.9680851063829787
10: ... loglikelihood=-10.79190969013221 0.9787234042553191
11: ... loglikelihood=-10.49809101645788 0.9787234042553191
12: ... loglikelihood=-10.243346245232049 0.9787234042553191
13: ... loglikelihood=-10.020385951866078 0.9787234042553191
14: ... loglikelihood=-9.823823683852883 0.9787234042553191
15: ... loglikelihood=-9.649432894870161 0.9680851063829787
16: ... loglikelihood=-9.493781940674689 0.9680851063829787
17: ... loglikelihood=-9.354037635587453 0.9680851063829787
18: ... loglikelihood=-9.227844090785963 0.9680851063829787
19: ... loglikelihood=-9.113236646005166 0.9680851063829787
20: ... loglikelihood=-9.008574348661416 0.9680851063829787
21: ... loglikelihood=-8.912484387352526 0.9680851063829787
22: ... loglikelihood=-8.823815723133473 0.9680851063829787
23: ... loglikelihood=-8.741600464033226 0.9680851063829787
24: ... loglikelihood=-8.665021902495617 0.9680851063829787
25: ... loglikelihood=-8.593388239589714 0.9680851063829787
26: ... loglikelihood=-8.526111085322134 0.9680851063829787
27: ... loglikelihood=-8.46268790979203 0.9680851063829787
28: ... loglikelihood=-8.402687724984585 0.9680851063829787
29: ... loglikelihood=-8.345739388224226 0.9680851063829787
30: ... loglikelihood=-8.291522024176455 0.9680851063829787
31: ... loglikelihood=-8.239757156276054 0.9680851063829787
32: ... loglikelihood=-8.190202218191189 0.9680851063829787
33: ... loglikelihood=-8.142645181598036 0.9680851063829787
34: ... loglikelihood=-8.096900089605423 0.9680851063829787
35: ... loglikelihood=-8.052803327557944 0.9680851063829787
36: ... loglikelihood=-8.010210496588895 0.9680851063829787
37: ... loglikelihood=-7.968993781919306 0.9680851063829787
38: ... loglikelihood=-7.929039728962359 0.9680851063829787
39: ... loglikelihood=-7.890247356978089 0.9680851063829787
40: ... loglikelihood=-7.852526553274457 0.9680851063829787
41: ... loglikelihood=-7.81579670150884 0.9680851063829787
42: ... loglikelihood=-7.7799855060881375 0.9680851063829787
43: ... loglikelihood=-7.745027981445366 0.9680851063829787
44: ... loglikelihood=-7.710865580437717 0.9680851063829787
45: ... loglikelihood=-7.677445440537027 0.9680851063829787
46: ... loglikelihood=-7.644719730081948 0.9680851063829787
47: ... loglikelihood=-7.6126450797987175 0.9680851063829787
48: ... loglikelihood=-7.581182087204402 0.9680851063829787
49: ... loglikelihood=-7.55029488348716 0.9680851063829787
50: ... loglikelihood=-7.519950754092984 0.9680851063829787
51: ... loglikelihood=-7.490119805603684 0.9680851063829787
52: ... loglikelihood=-7.460774672617553 0.9680851063829787
53: ... loglikelihood=-7.431890259284192 0.9680851063829787
54: ... loglikelihood=-7.403443510932034 0.9680851063829787
55: ... loglikelihood=-7.375413211887318 0.9680851063829787
56: ... loglikelihood=-7.347779806140007 0.9680851063829787
57: ... loglikelihood=-7.320525237981628 0.9680851063829787
58: ... loglikelihood=-7.29363281013785 0.9680851063829787
59: ... loglikelihood=-7.267087057256732 0.9680851063829787
60: ... loglikelihood=-7.240873632900681 0.9680851063829787
61: ... loglikelihood=-7.214979208436198 0.9680851063829787
62: ... loglikelihood=-7.189391382424853 0.9680851063829787
63: ... loglikelihood=-7.164098599299557 0.9680851063829787
64: ... loglikelihood=-7.139090076264654 0.9680851063829787
65: ... loglikelihood=-7.11435573749169 0.9680851063829787
66: ... loglikelihood=-7.0898861547979495 0.9680851063829787
67: ... loglikelihood=-7.065672494094207 0.9680851063829787
68: ... loglikelihood=-7.041706466974542 0.9680851063829787
69: ... loglikelihood=-7.017980286895933 0.9680851063829787
70: ... loglikelihood=-6.994486629460506 0.9680851063829787
71: ... loglikelihood=-6.971218596370116 0.9680851063829787
72: ... loglikelihood=-6.948169682672612 0.9680851063829787
73: ... loglikelihood=-6.925333746962407 0.9680851063829787
74: ... loglikelihood=-6.902704984236104 0.9680851063829787
75: ... loglikelihood=-6.880277901137151 0.9680851063829787
76: ... loglikelihood=-6.858047293352859 0.9680851063829787
77: ... loglikelihood=-6.836008224953014 0.9680851063829787
78: ... loglikelihood=-6.814156009481834 0.9680851063829787
79: ... loglikelihood=-6.792486192635316 0.9680851063829787
80: ... loglikelihood=-6.7709945363736495 0.9680851063829787
81: ... loglikelihood=-6.749677004334072 0.9680851063829787
82: ... loglikelihood=-6.728529748423526 0.9680851063829787
83: ... loglikelihood=-6.70754909648284 0.9680851063829787
84: ... loglikelihood=-6.686731540925016 0.9680851063829787
85: ... loglikelihood=-6.66607372826015 0.9680851063829787
86: ... loglikelihood=-6.645572449428095 0.9680851063829787
87: ... loglikelihood=-6.6252246308677805 0.9680851063829787
88: ... loglikelihood=-6.605027326259023 0.9680851063829787
89: ... loglikelihood=-6.584977708878869 0.9680851063829787
90: ... loglikelihood=-6.565073064520063 0.9680851063829787
91: ... loglikelihood=-6.545310784924165 0.9680851063829787
92: ... loglikelihood=-6.525688361686311 0.9680851063829787
93: ... loglikelihood=-6.5062033805926305 0.9680851063829787
94: ... loglikelihood=-6.486853516354962 0.9680851063829787
95: ... loglikelihood=-6.46763652771055 0.9680851063829787
96: ... loglikelihood=-6.448550252857571 0.9680851063829787
97: ... loglikelihood=-6.429592605199842 0.9680851063829787
98: ... loglikelihood=-6.4107615693763105 0.9680851063829787
99: ... loglikelihood=-6.392055197553354 0.9680851063829787
100: ... loglikelihood=-6.373471605959484 0.9680851063829787
Writing tokenizer model ... done (0.013s)
Wrote tokenizer model to
path: /Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/./token_model.bin
Using Java Training API
Following is
the complete working application.
import java.io.BufferedOutputStream; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStream; import java.nio.charset.Charset; import java.util.Objects; import opennlp.tools.dictionary.Dictionary; import opennlp.tools.tokenize.TokenSample; import opennlp.tools.tokenize.TokenSampleStream; import opennlp.tools.tokenize.TokenizerFactory; import opennlp.tools.tokenize.TokenizerME; import opennlp.tools.tokenize.TokenizerModel; import opennlp.tools.util.MarkableFileInputStreamFactory; import opennlp.tools.util.ObjectStream; import opennlp.tools.util.PlainTextByLineStream; import opennlp.tools.util.TrainingParameters; public class TokenizerTrainer { /** * @param inputFile * contains training data * @param modelFile * Generated model file after training * @throws IOException */ public static void trainModel(String inputFile, String modelFile) throws IOException { Objects.nonNull(inputFile); Objects.nonNull(modelFile); Charset charset = Charset.forName("UTF-8"); MarkableFileInputStreamFactory factory = new MarkableFileInputStreamFactory( new File(inputFile)); ObjectStream<String> lineStream = new PlainTextByLineStream(factory, charset); ObjectStream<TokenSample> sampleStream = new TokenSampleStream( lineStream); TokenizerModel model; try { TokenizerFactory tokenizerFactory = new TokenizerFactory("en", new Dictionary(), false, null); model = TokenizerME.train(sampleStream, tokenizerFactory, TrainingParameters.defaultParams()); } finally { sampleStream.close(); } OutputStream modelOut = null; try { modelOut = new BufferedOutputStream(new FileOutputStream(modelFile)); model.serialize(modelOut); } finally { if (modelOut != null) modelOut.close(); } } }
import java.io.IOException; public class Main { public static void main(String args[]) throws IOException { String inputFile = "/Users/harikrishna_gurram/input.txt"; String modelFile = "/Users/harikrishna_gurram/model_sample"; TokenizerTrainer.trainModel(inputFile, modelFile); } }
No comments:
Post a Comment