You can
train a model to identify tokenizers using command line interface, API.
Using CLI (Command Line Interface):
opennlp
provides 'TokenizerTrainer' tool to train data. The OpenNLP format contains one
sentence per line. You can also specify tokens either separated by a whitespace
or by a special <SPLIT> tag.
To get help
for 'TokenizerTrainer' run the command 'opennlp TokenizerTrainer'.
$ opennlp TokenizerTrainer Usage: opennlp TokenizerTrainer[.ad|.pos|.conllx|.namefinder|.parse] [-factory factoryName] [-abbDict path] [-alphaNumOpt isAlphaNumOpt] [-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName] Arguments description: -factory factoryName A sub-class of TokenizerFactory where to get implementation and resources. -abbDict path abbreviation dictionary in XML format. -alphaNumOpt isAlphaNumOpt Optimization flag to skip alpha numeric tokens for further tokenization -params paramsFile training parameters file. -lang language language which is being processed. -model modelFile output model file. -data sampleData data to be used, usually a file name. -encoding charsetName encoding for reading and writing text, if absent the system default is used.
Following
command takes training data from input.txt and generates ‘token_model.bin’
file.
opennlp
TokenizerTrainer -model token_model.bin -alphaNumOpt false -lang en -data
input.txt -encoding UTF-8
‘input.txt’ contains following data.
Hari krishna Gurram<SPLIT>, 27 years old<SPLIT>, is a software Engineer<SPLIT> joined xyz organization. Mr. Ananad Bandaru <SPLIT> is the project manager,<SPLIT> team size 10<SPLIT>.
$ opennlp TokenizerTrainer -model ./token_model.bin -alphaNumOpt false -lang en -data ./input.txt -encoding UTF-8 Indexing events using cutoff of 5 Computing event counts... done. 94 events Indexing... done. Sorting and merging events... done. Reduced 94 events to 89. Done indexing. Incorporating indexed data for training... done. Number of Event Tokens: 89 Number of Outcomes: 2 Number of Predicates: 30 ...done. Computing model parameters ... Performing 100 iterations. 1: ... loglikelihood=-65.15583497263488 0.9680851063829787 2: ... loglikelihood=-27.919978747347933 0.9680851063829787 3: ... loglikelihood=-18.830459832857034 0.9680851063829787 4: ... loglikelihood=-15.404986758262185 0.9680851063829787 5: ... loglikelihood=-13.72901035449004 0.9680851063829787 6: ... loglikelihood=-12.736402318032512 0.9680851063829787 7: ... loglikelihood=-12.058379907327403 0.9680851063829787 8: ... loglikelihood=-11.546862469089756 0.9680851063829787 9: ... loglikelihood=-11.135581655830746 0.9680851063829787 10: ... loglikelihood=-10.79190969013221 0.9787234042553191 11: ... loglikelihood=-10.49809101645788 0.9787234042553191 12: ... loglikelihood=-10.243346245232049 0.9787234042553191 13: ... loglikelihood=-10.020385951866078 0.9787234042553191 14: ... loglikelihood=-9.823823683852883 0.9787234042553191 15: ... loglikelihood=-9.649432894870161 0.9680851063829787 16: ... loglikelihood=-9.493781940674689 0.9680851063829787 17: ... loglikelihood=-9.354037635587453 0.9680851063829787 18: ... loglikelihood=-9.227844090785963 0.9680851063829787 19: ... loglikelihood=-9.113236646005166 0.9680851063829787 20: ... loglikelihood=-9.008574348661416 0.9680851063829787 21: ... loglikelihood=-8.912484387352526 0.9680851063829787 22: ... loglikelihood=-8.823815723133473 0.9680851063829787 23: ... loglikelihood=-8.741600464033226 0.9680851063829787 24: ... loglikelihood=-8.665021902495617 0.9680851063829787 25: ... loglikelihood=-8.593388239589714 0.9680851063829787 26: ... loglikelihood=-8.526111085322134 0.9680851063829787 27: ... loglikelihood=-8.46268790979203 0.9680851063829787 28: ... loglikelihood=-8.402687724984585 0.9680851063829787 29: ... loglikelihood=-8.345739388224226 0.9680851063829787 30: ... loglikelihood=-8.291522024176455 0.9680851063829787 31: ... loglikelihood=-8.239757156276054 0.9680851063829787 32: ... loglikelihood=-8.190202218191189 0.9680851063829787 33: ... loglikelihood=-8.142645181598036 0.9680851063829787 34: ... loglikelihood=-8.096900089605423 0.9680851063829787 35: ... loglikelihood=-8.052803327557944 0.9680851063829787 36: ... loglikelihood=-8.010210496588895 0.9680851063829787 37: ... loglikelihood=-7.968993781919306 0.9680851063829787 38: ... loglikelihood=-7.929039728962359 0.9680851063829787 39: ... loglikelihood=-7.890247356978089 0.9680851063829787 40: ... loglikelihood=-7.852526553274457 0.9680851063829787 41: ... loglikelihood=-7.81579670150884 0.9680851063829787 42: ... loglikelihood=-7.7799855060881375 0.9680851063829787 43: ... loglikelihood=-7.745027981445366 0.9680851063829787 44: ... loglikelihood=-7.710865580437717 0.9680851063829787 45: ... loglikelihood=-7.677445440537027 0.9680851063829787 46: ... loglikelihood=-7.644719730081948 0.9680851063829787 47: ... loglikelihood=-7.6126450797987175 0.9680851063829787 48: ... loglikelihood=-7.581182087204402 0.9680851063829787 49: ... loglikelihood=-7.55029488348716 0.9680851063829787 50: ... loglikelihood=-7.519950754092984 0.9680851063829787 51: ... loglikelihood=-7.490119805603684 0.9680851063829787 52: ... loglikelihood=-7.460774672617553 0.9680851063829787 53: ... loglikelihood=-7.431890259284192 0.9680851063829787 54: ... loglikelihood=-7.403443510932034 0.9680851063829787 55: ... loglikelihood=-7.375413211887318 0.9680851063829787 56: ... loglikelihood=-7.347779806140007 0.9680851063829787 57: ... loglikelihood=-7.320525237981628 0.9680851063829787 58: ... loglikelihood=-7.29363281013785 0.9680851063829787 59: ... loglikelihood=-7.267087057256732 0.9680851063829787 60: ... loglikelihood=-7.240873632900681 0.9680851063829787 61: ... loglikelihood=-7.214979208436198 0.9680851063829787 62: ... loglikelihood=-7.189391382424853 0.9680851063829787 63: ... loglikelihood=-7.164098599299557 0.9680851063829787 64: ... loglikelihood=-7.139090076264654 0.9680851063829787 65: ... loglikelihood=-7.11435573749169 0.9680851063829787 66: ... loglikelihood=-7.0898861547979495 0.9680851063829787 67: ... loglikelihood=-7.065672494094207 0.9680851063829787 68: ... loglikelihood=-7.041706466974542 0.9680851063829787 69: ... loglikelihood=-7.017980286895933 0.9680851063829787 70: ... loglikelihood=-6.994486629460506 0.9680851063829787 71: ... loglikelihood=-6.971218596370116 0.9680851063829787 72: ... loglikelihood=-6.948169682672612 0.9680851063829787 73: ... loglikelihood=-6.925333746962407 0.9680851063829787 74: ... loglikelihood=-6.902704984236104 0.9680851063829787 75: ... loglikelihood=-6.880277901137151 0.9680851063829787 76: ... loglikelihood=-6.858047293352859 0.9680851063829787 77: ... loglikelihood=-6.836008224953014 0.9680851063829787 78: ... loglikelihood=-6.814156009481834 0.9680851063829787 79: ... loglikelihood=-6.792486192635316 0.9680851063829787 80: ... loglikelihood=-6.7709945363736495 0.9680851063829787 81: ... loglikelihood=-6.749677004334072 0.9680851063829787 82: ... loglikelihood=-6.728529748423526 0.9680851063829787 83: ... loglikelihood=-6.70754909648284 0.9680851063829787 84: ... loglikelihood=-6.686731540925016 0.9680851063829787 85: ... loglikelihood=-6.66607372826015 0.9680851063829787 86: ... loglikelihood=-6.645572449428095 0.9680851063829787 87: ... loglikelihood=-6.6252246308677805 0.9680851063829787 88: ... loglikelihood=-6.605027326259023 0.9680851063829787 89: ... loglikelihood=-6.584977708878869 0.9680851063829787 90: ... loglikelihood=-6.565073064520063 0.9680851063829787 91: ... loglikelihood=-6.545310784924165 0.9680851063829787 92: ... loglikelihood=-6.525688361686311 0.9680851063829787 93: ... loglikelihood=-6.5062033805926305 0.9680851063829787 94: ... loglikelihood=-6.486853516354962 0.9680851063829787 95: ... loglikelihood=-6.46763652771055 0.9680851063829787 96: ... loglikelihood=-6.448550252857571 0.9680851063829787 97: ... loglikelihood=-6.429592605199842 0.9680851063829787 98: ... loglikelihood=-6.4107615693763105 0.9680851063829787 99: ... loglikelihood=-6.392055197553354 0.9680851063829787 100: ... loglikelihood=-6.373471605959484 0.9680851063829787 Writing tokenizer model ... done (0.013s) Wrote tokenizer model to path: /Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/./token_model.bin
Using Java Training API
Following is
the complete working application.
import java.io.BufferedOutputStream; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStream; import java.nio.charset.Charset; import java.util.Objects; import opennlp.tools.dictionary.Dictionary; import opennlp.tools.tokenize.TokenSample; import opennlp.tools.tokenize.TokenSampleStream; import opennlp.tools.tokenize.TokenizerFactory; import opennlp.tools.tokenize.TokenizerME; import opennlp.tools.tokenize.TokenizerModel; import opennlp.tools.util.MarkableFileInputStreamFactory; import opennlp.tools.util.ObjectStream; import opennlp.tools.util.PlainTextByLineStream; import opennlp.tools.util.TrainingParameters; public class TokenizerTrainer { /** * @param inputFile * contains training data * @param modelFile * Generated model file after training * @throws IOException */ public static void trainModel(String inputFile, String modelFile) throws IOException { Objects.nonNull(inputFile); Objects.nonNull(modelFile); Charset charset = Charset.forName("UTF-8"); MarkableFileInputStreamFactory factory = new MarkableFileInputStreamFactory( new File(inputFile)); ObjectStream<String> lineStream = new PlainTextByLineStream(factory, charset); ObjectStream<TokenSample> sampleStream = new TokenSampleStream( lineStream); TokenizerModel model; try { TokenizerFactory tokenizerFactory = new TokenizerFactory("en", new Dictionary(), false, null); model = TokenizerME.train(sampleStream, tokenizerFactory, TrainingParameters.defaultParams()); } finally { sampleStream.close(); } OutputStream modelOut = null; try { modelOut = new BufferedOutputStream(new FileOutputStream(modelFile)); model.serialize(modelOut); } finally { if (modelOut != null) modelOut.close(); } } }
import java.io.IOException; public class Main { public static void main(String args[]) throws IOException { String inputFile = "/Users/harikrishna_gurram/input.txt"; String modelFile = "/Users/harikrishna_gurram/model_sample"; TokenizerTrainer.trainModel(inputFile, modelFile); } }
No comments:
Post a Comment