Thursday 1 October 2015

openNLP: Document categorizer Training


openNLP provides a way to train model to categorize given set of documents.

$ opennlp DoccatTrainer
Usage: opennlp DoccatTrainer[.leipzig] [-factory factoryName] [-tokenizer tokenizer] [-featureGenerators fg] [-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName]

Arguments description:
 -factory factoryName
  A sub-class of DoccatFactory where to get implementation and resources.
 -tokenizer tokenizer
  Tokenizer implementation. WhitespaceTokenizer is used if not specified.
 -featureGenerators fg
  Comma separated feature generator classes. Bag of words is used if not specified.
 -params paramsFile
  training parameters file.
 -lang language
  language which is being processed.
 -model modelFile
  output model file.
 -data sampleData
  data to be used, usually a file name.
 -encoding charsetName
  encoding for reading and writing text, if absent the system default is used.

Following command is used to train model
opennlp DoccatTrainer -model en-doccat.bin -lang en -data en-doccat.train -encoding UTF-8

en-doccat.train file format
One document per line, containing category and text separated by a whitespace.

history some text
history some text
politics some text
politics some text
…….
…….
…….

As you observe above file, ‘history’, ‘politics’ are the two categories.

Training using Java API
Suppose following is my training data.


training_data.txt

washing_machine  Easy to use
washing_machine  Suitable for more variety of cloth types
washing_machine  Suitable for high quantity of clothes
washing_machine  Joint families
washing_machine  Commercial use
washing_machine  washing clothes
washing_machine  Suitable for specific variety of cloth types
washing_machine  Suitable for low quantity of clothes
washing_machine  Lower electricity bills
washing_machine  Small Family
washing_machine  Domestic Use
washing_machine  Suitable for minimum variety of cloth types
washing_machine  Suitable for medium quantity of clothes
washing_machine  Medium families
washing_machine  Domestic use
Bike Reasonable Speed
Bike Good Mileage
Bike No of Gears X
Bike Support Center Close By XXX
Bike For Ladies
Bike City driving
Bike Highway Riding
Bike High performance
Bike High Speed
Mobile Advanced
Mobile Latest phones
Mobile Great features
Mobile High end
Mobile Resolution
Mobile Take photos
Mobile Travel frequently
Mobile Hangout
Mobile Skype
Mobile Rear camera
Mobile Front camera
Mobile Quality pictures
Mobile Least noise
Mobile Play games
Mobile WIFI
Mobile FM
Mobile GPS
Mobile Travel frequently
Mobile More time
Mobile Less charge
Mobile User friendly
Mobile Fastest
Mobile More apps
Mobile Entertainment
Mobile Songs
Mobile Photos
Mobile Movies
Mobile Micro SD
Mobile View photos
Mobile Audio message
Mobile Mail
Mobile Ear phones
Mobile Voice features
Mobile Active voice
Mobile Loud speaker
Mobile FM
Mobile Voice recognition
Mobile Watch movies
Mobile Browse net
Mobile Read books
Mobile Listen music
Mobile MMS
Mobile SMS
Tv want to stream movies directly from Internet Wi-Fi connection
Laptop average sound system
Laptop MS office
Laptop browse internet
Laptop reading books
Laptop listening songs
Laptop watching movies
Laptop chatting
Laptop for kids
Laptop service warranty
Laptop security
Laptop water proof
Laptop Desktop replacement
Laptop elder people
Laptop college student
Laptop programming
Laptop and shows
Laptop Watch movies
Laptop view pictures and videos
Laptop listen to music
Laptop audio books
Laptop read comics
Laptop edit documents
Laptop add notes on documents
Laptop purchase newspapers
Laptop download magazines
Laptop browsing books
Laptop reading books
Laptop less weight
Laptop easy to carry
Laptop portable
Laptop watch movies
Laptop reading books
Laptop data warehousing
Laptop spatial data analysis
Laptop document processing
Laptop simulation
Laptop save movies
Laptop Store movies
Laptop data analysis.
Laptop Extreme gaming
Laptop photoshop
Laptop every day usage
Laptop multi media
Laptop High End gaming
Laptop simulation
Laptop Scientific calculations
Laptop Network simulation
Laptop Engineering simulation
Laptop image processing
Laptop Algoritmic computations
Laptop document processing
Laptop Touch
Laptop High visibility
Laptop HD movies
Laptop simulations
Laptop gaming
Laptop elder people
Laptop college student
Laptop programming
Laptop audio chat…
Laptop video calls
Laptop listen songs
Laptop watch movies
Laptop pre in buit speakers
Laptop easy to carry
Laptop portable
Laptop Home theatre model
Laptop video conference
Laptop broader screen
Laptop watch movies
Laptop portable
Laptop video chatting
sleeping_bags protective bag to sleep
sleeping_bags excellent fit
sleeping_bags backcountry skiing
sleeping_bags Luxurious three-season camping
sleeping_bags Car camping
sleeping_bags short backpacks
sleeping_bags big wall climbing
sleeping_bags high quality materials
sleeping_bags Lightweight and compressible for a synthetic bag
sleeping_bags comfortable
sleeping_bags warm
sleeping_bags best hood design
sleeping_bags waterproof-breathable shell material
sleeping_bags mountaineering
sleeping_bags Backpacking
sleeping_bags alpine climbing
sleeping_bags a warm lined padded bag to sleep in
sleeping_bags durable
sleeping_bags warm
sleeping_bags inexpensive for a down bag
sleeping_bags inexpensive
sleeping_bags Lightweight sleeping bag
sleeping_bags Extremely comfortable
sleeping_bags very comfortable
sleeping_bags child’s sleepovers
sleeping_bags extended wet weather trips
Clothes Belt loops
Clothes Button and zipped closure
Clothes Machine wash warm
Clothes Elasticated waistband
Clothes Cotton
Clothes Machine wash cold
Clothes Pocket at sides
Clothes Do not dry in direct sunlight
Clothes Warm iron
Clothes Wash separately in cold water
Clothes Use mild detergent
Clothes Gentle wash
Clothes Zipped pocket
Clothes Ribbed collar
Clothes Llong sleeves
Clothes Full zipper closure
Clothes Polyester blend
Clothes Dry clean
Clothes Stand collar
Clothes Polyester
Clothes Cotton
Clothes Full zip closure
Clothes Machine wash warm
Clothes Machine wash cold
Clothes Short sleeves
Clothes Long sleeves
Clothes Ribbed round neck
Clothes Cotton
Clothes Pockets on the sides
Clothes Machine wash warm
Clothes Machine wash cold
Clothes Polyester
Clothes Machine wash cold
Clothes Machine wash warm
Clothes Henley
Clothes Jersey
Clothes Sleeveless
Clothes Sport
Clothes Casual
Clothes Giza cotton
Clothes Dry clean
Clothes Button cuffs
Clothes Long sleeves
Clothes Formal shirts
Clothes T-shirts
Clothes Machine wash
Clothes Polyester
Clothes Cotton blend
Clothes full button placket
Clothes round neck
Clothes Synthetic
Clothes Hand made
Clothes Ribbed spread collar
Clothes Short sleeves
Clothes Longer back
Clothes Machine wash warm
Clothes Skinny fit
Clothes Slim fit
Clothes Regular fit
Clothes Cotton
Clothes Machine wash cold
Clothes Hand made
Clothes Button and a zip fly closure
Clothes Scoop pockets
Clothes Elastane
Clothes Line dry
Clothes Zip fly
Clothes Darts on the front
Clothes No wash
Clothes Heavily washed
Clothes Lightly washed
kindle play games
kindle download ganes
kindle dictionary support
kindle read comics
kindle edit documents
kindle add notes on documents
kindle purchase newspapers
kindle download magazines
kindle browsing books
kindle reading books
kindle and shows
kindle Watch movies
kindle view pictures and videos
kindle listen to music
kindle audio books
kindle surf the net
kindle Access twitter
kindle Access facebook
kindle Wi-Fi
kindle check emails
kindle Browse net
TV Normal Cable Content
TV Better picture quality
TV want to play games
TV memory disk
TV want to play movies from my laptop
TV want to play HD movies/channels
TV Small Hall
TV Bedroom
TV Better display
TV Blu Ray Player
TV Want to play Fully HD Movies
TV Home Theater
TV Marriage Halls
TV Spacious Room
TV Big Hall
TV Living Room


import java.io.*;
import java.util.Objects;

import opennlp.tools.doccat.DoccatFactory;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentSample;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;

public class CategoryTrainUtil {

 public static void trainModel(String inputFile, String modelFile)
   throws IOException {
  Objects.nonNull(inputFile);
  Objects.nonNull(modelFile);

  DoccatModel model = null;

  try {

   MarkableFileInputStreamFactory factory = new MarkableFileInputStreamFactory(
     new File(inputFile));
   ObjectStream<String> lineStream = new PlainTextByLineStream(
     factory, "UTF-8");

   ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(
     lineStream);

   model = DocumentCategorizerME.train("en", sampleStream,
     TrainingParameters.defaultParams(), new DoccatFactory());

   OutputStream modelOut = null;
   File modelFileTmp = new File(modelFile);
   modelOut = new BufferedOutputStream(new FileOutputStream(
     modelFileTmp));
   model.serialize(modelOut);
  } catch (IOException e) {
   e.printStackTrace();
  }

 }
}


import java.io.IOException;

public class Main {
 public static void main(String args[]) throws IOException {
  String modelFile = "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/cat_train.bin";
  String inputFile = "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/training_data.txt";

  CategoryTrainUtil.trainModel(inputFile, modelFile);

 }
}


Following code is used to find best category for given text.

import java.io.FileInputStream;
import java.io.InputStream;
import java.util.Objects;

import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;

public class CategoryDetectorUtil {
 private InputStream inputStream;
 private DoccatModel docCatModel;
 private DocumentCategorizerME myCategorizer;

 public CategoryDetectorUtil(String modelFile) {
  Objects.nonNull(modelFile);
  initModel(modelFile);
 }

 private void initModel(String modelFile) {
  try {
   inputStream = new FileInputStream(modelFile);
   docCatModel = new DoccatModel(inputStream);
   myCategorizer = new DocumentCategorizerME(docCatModel);
  } catch (Exception e) {
   System.out.println(e.getMessage());
  }

 }

 public String getCategory(String text) {
  double[] outcomes = myCategorizer.categorize(text);
  String category = myCategorizer.getBestCategory(outcomes);
  return category;
 }
}


import java.io.IOException;

public class Test {
 public static void main(String args[]) throws IOException {
  String modelFile = "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/cat_train.bin";

  CategoryDetectorUtil detector = new CategoryDetectorUtil(modelFile);

  System.out.println(detector.getCategory("read books"));
 }
}


Output

Laptop




Prevoius                                                 Next                                                 Home

No comments:

Post a Comment