openNLP
provides a way to train model to categorize given set of documents.
$ opennlp DoccatTrainer Usage: opennlp DoccatTrainer[.leipzig] [-factory factoryName] [-tokenizer tokenizer] [-featureGenerators fg] [-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName] Arguments description: -factory factoryName A sub-class of DoccatFactory where to get implementation and resources. -tokenizer tokenizer Tokenizer implementation. WhitespaceTokenizer is used if not specified. -featureGenerators fg Comma separated feature generator classes. Bag of words is used if not specified. -params paramsFile training parameters file. -lang language language which is being processed. -model modelFile output model file. -data sampleData data to be used, usually a file name. -encoding charsetName encoding for reading and writing text, if absent the system default is used.
Following command is used to train model
opennlp DoccatTrainer -model en-doccat.bin -lang en
-data en-doccat.train -encoding UTF-8
en-doccat.train file format
One document
per line, containing category and text separated by a whitespace.
history some
text
history some
text
politics
some text
politics
some text
…….
…….
…….
As you observe
above file, ‘history’, ‘politics’ are the two categories.
Training using Java API
Suppose
following is my training data.
training_data.txt
washing_machine Easy to use washing_machine Suitable for more variety of cloth types washing_machine Suitable for high quantity of clothes washing_machine Joint families washing_machine Commercial use washing_machine washing clothes washing_machine Suitable for specific variety of cloth types washing_machine Suitable for low quantity of clothes washing_machine Lower electricity bills washing_machine Small Family washing_machine Domestic Use washing_machine Suitable for minimum variety of cloth types washing_machine Suitable for medium quantity of clothes washing_machine Medium families washing_machine Domestic use Bike Reasonable Speed Bike Good Mileage Bike No of Gears X Bike Support Center Close By XXX Bike For Ladies Bike City driving Bike Highway Riding Bike High performance Bike High Speed Mobile Advanced Mobile Latest phones Mobile Great features Mobile High end Mobile Resolution Mobile Take photos Mobile Travel frequently Mobile Hangout Mobile Skype Mobile Rear camera Mobile Front camera Mobile Quality pictures Mobile Least noise Mobile Play games Mobile WIFI Mobile FM Mobile GPS Mobile Travel frequently Mobile More time Mobile Less charge Mobile User friendly Mobile Fastest Mobile More apps Mobile Entertainment Mobile Songs Mobile Photos Mobile Movies Mobile Micro SD Mobile View photos Mobile Audio message Mobile Mail Mobile Ear phones Mobile Voice features Mobile Active voice Mobile Loud speaker Mobile FM Mobile Voice recognition Mobile Watch movies Mobile Browse net Mobile Read books Mobile Listen music Mobile MMS Mobile SMS Tv want to stream movies directly from Internet Wi-Fi connection Laptop average sound system Laptop MS office Laptop browse internet Laptop reading books Laptop listening songs Laptop watching movies Laptop chatting Laptop for kids Laptop service warranty Laptop security Laptop water proof Laptop Desktop replacement Laptop elder people Laptop college student Laptop programming Laptop and shows Laptop Watch movies Laptop view pictures and videos Laptop listen to music Laptop audio books Laptop read comics Laptop edit documents Laptop add notes on documents Laptop purchase newspapers Laptop download magazines Laptop browsing books Laptop reading books Laptop less weight Laptop easy to carry Laptop portable Laptop watch movies Laptop reading books Laptop data warehousing Laptop spatial data analysis Laptop document processing Laptop simulation Laptop save movies Laptop Store movies Laptop data analysis. Laptop Extreme gaming Laptop photoshop Laptop every day usage Laptop multi media Laptop High End gaming Laptop simulation Laptop Scientific calculations Laptop Network simulation Laptop Engineering simulation Laptop image processing Laptop Algoritmic computations Laptop document processing Laptop Touch Laptop High visibility Laptop HD movies Laptop simulations Laptop gaming Laptop elder people Laptop college student Laptop programming Laptop audio chat… Laptop video calls Laptop listen songs Laptop watch movies Laptop pre in buit speakers Laptop easy to carry Laptop portable Laptop Home theatre model Laptop video conference Laptop broader screen Laptop watch movies Laptop portable Laptop video chatting sleeping_bags protective bag to sleep sleeping_bags excellent fit sleeping_bags backcountry skiing sleeping_bags Luxurious three-season camping sleeping_bags Car camping sleeping_bags short backpacks sleeping_bags big wall climbing sleeping_bags high quality materials sleeping_bags Lightweight and compressible for a synthetic bag sleeping_bags comfortable sleeping_bags warm sleeping_bags best hood design sleeping_bags waterproof-breathable shell material sleeping_bags mountaineering sleeping_bags Backpacking sleeping_bags alpine climbing sleeping_bags a warm lined padded bag to sleep in sleeping_bags durable sleeping_bags warm sleeping_bags inexpensive for a down bag sleeping_bags inexpensive sleeping_bags Lightweight sleeping bag sleeping_bags Extremely comfortable sleeping_bags very comfortable sleeping_bags child’s sleepovers sleeping_bags extended wet weather trips Clothes Belt loops Clothes Button and zipped closure Clothes Machine wash warm Clothes Elasticated waistband Clothes Cotton Clothes Machine wash cold Clothes Pocket at sides Clothes Do not dry in direct sunlight Clothes Warm iron Clothes Wash separately in cold water Clothes Use mild detergent Clothes Gentle wash Clothes Zipped pocket Clothes Ribbed collar Clothes Llong sleeves Clothes Full zipper closure Clothes Polyester blend Clothes Dry clean Clothes Stand collar Clothes Polyester Clothes Cotton Clothes Full zip closure Clothes Machine wash warm Clothes Machine wash cold Clothes Short sleeves Clothes Long sleeves Clothes Ribbed round neck Clothes Cotton Clothes Pockets on the sides Clothes Machine wash warm Clothes Machine wash cold Clothes Polyester Clothes Machine wash cold Clothes Machine wash warm Clothes Henley Clothes Jersey Clothes Sleeveless Clothes Sport Clothes Casual Clothes Giza cotton Clothes Dry clean Clothes Button cuffs Clothes Long sleeves Clothes Formal shirts Clothes T-shirts Clothes Machine wash Clothes Polyester Clothes Cotton blend Clothes full button placket Clothes round neck Clothes Synthetic Clothes Hand made Clothes Ribbed spread collar Clothes Short sleeves Clothes Longer back Clothes Machine wash warm Clothes Skinny fit Clothes Slim fit Clothes Regular fit Clothes Cotton Clothes Machine wash cold Clothes Hand made Clothes Button and a zip fly closure Clothes Scoop pockets Clothes Elastane Clothes Line dry Clothes Zip fly Clothes Darts on the front Clothes No wash Clothes Heavily washed Clothes Lightly washed kindle play games kindle download ganes kindle dictionary support kindle read comics kindle edit documents kindle add notes on documents kindle purchase newspapers kindle download magazines kindle browsing books kindle reading books kindle and shows kindle Watch movies kindle view pictures and videos kindle listen to music kindle audio books kindle surf the net kindle Access twitter kindle Access facebook kindle Wi-Fi kindle check emails kindle Browse net TV Normal Cable Content TV Better picture quality TV want to play games TV memory disk TV want to play movies from my laptop TV want to play HD movies/channels TV Small Hall TV Bedroom TV Better display TV Blu Ray Player TV Want to play Fully HD Movies TV Home Theater TV Marriage Halls TV Spacious Room TV Big Hall TV Living Room
import java.io.*; import java.util.Objects; import opennlp.tools.doccat.DoccatFactory; import opennlp.tools.doccat.DoccatModel; import opennlp.tools.doccat.DocumentCategorizerME; import opennlp.tools.doccat.DocumentSample; import opennlp.tools.doccat.DocumentSampleStream; import opennlp.tools.util.MarkableFileInputStreamFactory; import opennlp.tools.util.ObjectStream; import opennlp.tools.util.PlainTextByLineStream; import opennlp.tools.util.TrainingParameters; public class CategoryTrainUtil { public static void trainModel(String inputFile, String modelFile) throws IOException { Objects.nonNull(inputFile); Objects.nonNull(modelFile); DoccatModel model = null; try { MarkableFileInputStreamFactory factory = new MarkableFileInputStreamFactory( new File(inputFile)); ObjectStream<String> lineStream = new PlainTextByLineStream( factory, "UTF-8"); ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream( lineStream); model = DocumentCategorizerME.train("en", sampleStream, TrainingParameters.defaultParams(), new DoccatFactory()); OutputStream modelOut = null; File modelFileTmp = new File(modelFile); modelOut = new BufferedOutputStream(new FileOutputStream( modelFileTmp)); model.serialize(modelOut); } catch (IOException e) { e.printStackTrace(); } } }
import java.io.IOException; public class Main { public static void main(String args[]) throws IOException { String modelFile = "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/cat_train.bin"; String inputFile = "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/training_data.txt"; CategoryTrainUtil.trainModel(inputFile, modelFile); } }
Following
code is used to find best category for given text.
import java.io.FileInputStream; import java.io.InputStream; import java.util.Objects; import opennlp.tools.doccat.DoccatModel; import opennlp.tools.doccat.DocumentCategorizerME; public class CategoryDetectorUtil { private InputStream inputStream; private DoccatModel docCatModel; private DocumentCategorizerME myCategorizer; public CategoryDetectorUtil(String modelFile) { Objects.nonNull(modelFile); initModel(modelFile); } private void initModel(String modelFile) { try { inputStream = new FileInputStream(modelFile); docCatModel = new DoccatModel(inputStream); myCategorizer = new DocumentCategorizerME(docCatModel); } catch (Exception e) { System.out.println(e.getMessage()); } } public String getCategory(String text) { double[] outcomes = myCategorizer.categorize(text); String category = myCategorizer.getBestCategory(outcomes); return category; } }
import java.io.IOException; public class Test { public static void main(String args[]) throws IOException { String modelFile = "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/cat_train.bin"; CategoryDetectorUtil detector = new CategoryDetectorUtil(modelFile); System.out.println(detector.getCategory("read books")); } }
Output
Laptop
No comments:
Post a Comment