Thursday, 1 October 2015

openNLP: Name Finder Training

‘opennlp’ provides 'TokenNameFinderTrainer' tool to train the model. Documentation recommends that minimum 15000 sentences required for creating a model that works well.

The sentences format looks like below.

<START:person> Hari Krishna <END> , 27 years old , working in xyz organisation
Mr . <START:person> Phalgun <END> is chairman of abcd organization
$ opennlp TokenNameFinderTrainer
Usage: opennlp TokenNameFinderTrainer[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat] [-factory factoryName] [-resources resourcesDir] [-type modelType] [-featuregen featuregenFile] [-nameTypes types] [-sequenceCodec codec] [-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName]

Arguments description:
 -factory factoryName
  A sub-class of TokenNameFinderFactory
 -resources resourcesDir
  The resources directory
 -type modelType
  The type of the token name finder model
 -featuregen featuregenFile
  The feature generator descriptor file
 -nameTypes types
  name types to use for training
 -sequenceCodec codec
  sequence codec used to code name spans
 -params paramsFile
  training parameters file.
 -lang language
  language which is being processed.
 -model modelFile
  output model file.
 -data sampleData
  data to be used, usually a file name.
 -encoding charsetName
  encoding for reading and writing text, if absent the system default is used.

Use following command to train data.
opennlp TokenNameFinderTrainer -model en-ner-person.bin -lang en -data en-ner-person.train -encoding UTF-8

Using Java API
Following is the complete working application.

import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.nio.charset.Charset;

import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.NameSample;
import opennlp.tools.namefind.NameSampleDataStream;
import opennlp.tools.namefind.TokenNameFinderFactory;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;

public class NameFinderTrainUtil {

 public static void trainModel(String inputFile, String modelFile)
   throws IOException {
  Charset charset = Charset.forName("UTF-8");

  MarkableFileInputStreamFactory factory = new MarkableFileInputStreamFactory(
    new File(inputFile));
  ObjectStream<String> lineStream = new PlainTextByLineStream(factory,
    charset);

  ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
    lineStream);

  TokenNameFinderModel model;

  try {
   model = NameFinderME.train("en", "person", sampleStream,
     TrainingParameters.defaultParams(),
     new TokenNameFinderFactory());
  } finally {
   sampleStream.close();
  }

  OutputStream modelOut = null;
  try {
   modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
   model.serialize(modelOut);
  } finally {
   if (modelOut != null)
    modelOut.close();
  }
 }
}



Prevoius                                                 Next                                                 Home

No comments:

Post a Comment