‘opennlp’
provides 'TokenNameFinderTrainer' tool to train the model. Documentation
recommends that minimum 15000 sentences required for creating a model that
works well.
The
sentences format looks like below.
<START:person>
Hari Krishna <END> , 27 years old , working in xyz organisation
Mr . <START:person> Phalgun <END> is chairman of abcd organization
Mr . <START:person> Phalgun <END> is chairman of abcd organization
$ opennlp TokenNameFinderTrainer Usage: opennlp TokenNameFinderTrainer[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat] [-factory factoryName] [-resources resourcesDir] [-type modelType] [-featuregen featuregenFile] [-nameTypes types] [-sequenceCodec codec] [-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName] Arguments description: -factory factoryName A sub-class of TokenNameFinderFactory -resources resourcesDir The resources directory -type modelType The type of the token name finder model -featuregen featuregenFile The feature generator descriptor file -nameTypes types name types to use for training -sequenceCodec codec sequence codec used to code name spans -params paramsFile training parameters file. -lang language language which is being processed. -model modelFile output model file. -data sampleData data to be used, usually a file name. -encoding charsetName encoding for reading and writing text, if absent the system default is used.
Use
following command to train data.
opennlp TokenNameFinderTrainer -model
en-ner-person.bin -lang en -data en-ner-person.train -encoding UTF-8
Using Java API
Following is the complete working application.
Following is the complete working application.
import java.io.BufferedOutputStream; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStream; import java.nio.charset.Charset; import opennlp.tools.namefind.NameFinderME; import opennlp.tools.namefind.NameSample; import opennlp.tools.namefind.NameSampleDataStream; import opennlp.tools.namefind.TokenNameFinderFactory; import opennlp.tools.namefind.TokenNameFinderModel; import opennlp.tools.util.MarkableFileInputStreamFactory; import opennlp.tools.util.ObjectStream; import opennlp.tools.util.PlainTextByLineStream; import opennlp.tools.util.TrainingParameters; public class NameFinderTrainUtil { public static void trainModel(String inputFile, String modelFile) throws IOException { Charset charset = Charset.forName("UTF-8"); MarkableFileInputStreamFactory factory = new MarkableFileInputStreamFactory( new File(inputFile)); ObjectStream<String> lineStream = new PlainTextByLineStream(factory, charset); ObjectStream<NameSample> sampleStream = new NameSampleDataStream( lineStream); TokenNameFinderModel model; try { model = NameFinderME.train("en", "person", sampleStream, TrainingParameters.defaultParams(), new TokenNameFinderFactory()); } finally { sampleStream.close(); } OutputStream modelOut = null; try { modelOut = new BufferedOutputStream(new FileOutputStream(modelFile)); model.serialize(modelOut); } finally { if (modelOut != null) modelOut.close(); } } }
No comments:
Post a Comment