Thursday 1 October 2015

OpenNLP: Sentence detection

In this post, I am going to explain, how to detect a sentence using OpenNLP.

Approach 1: Using opennlp command line interface.
OpenNLP provides number of trained models for tokenization, sentence detection, name finding etc.,

Download ‘en-sent.bin’ model from following link.

Following command is used to extract all sentences from given document.

opennlp SentenceDetector en-sent.bin < input.txt > output.txt

Lets say ‘input.txt’ contains following data.
We are living in an Environment, where multiple Hardware Architectures and Multiple platforms presents. So it is very difficult to write, compile and link the same Application, for each platform and each Architecture separately.

The Java Programming Language solves all the above problems. The Java programming language platform provides a portable, interpreted, high-performance, simple, object-oriented programming language and supporting run-time environment.

Following command parse input.txt and write all sentences to output.txt file.

‘opennlp SentenceDetector ./en-sent.bin < ./input.txt > output.txt’.
$ opennlp SentenceDetector ./en-sent.bin < ./input.txt > output.txt
Loading Sentence Detector model ... done (0.037s)


Average: 1333.3 sent/s 
Total: 4 sent
Runtime: 0.003s

$ cat output.txt 
We are living in an Environment, where multiple Hardware Architectures and Multiple platforms presents.
So it is very difficult to write, compile and link the same Application, for each platform and each Architecture separately.

The Java Programming Language solves all the above problems.
The Java programming language platform provides a portable, interpreted, high-performance, simple, object-oriented programming language and supporting run-time environment.


Approach 2: Using Java API
a. Load sentence model.

initSentenceModel(String file) {
 InputStream modelIn;
 try {
  modelIn = new FileInputStream(file);
 } catch (FileNotFoundException e) {
  System.out.println(e.getMessage());
  return null;
 }

 try {
  model = new SentenceModel(modelIn);
 } catch (IOException e) {
  e.printStackTrace();
 } finally {
  if (modelIn != null) {
   try {
    modelIn.close();
   } catch (IOException e) {
   }
  }
 }
 return model;
}


b. Initialize SentenceDetectorME
SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);

c. Use ‘sentDetect’ method to get sentences.
String sentences[] = sentenceDetector.sentDetect("string of information");

Following is the complete working application.

import static java.nio.file.Files.readAllBytes;
import static java.nio.file.Paths.get;

import java.io.IOException;
import java.util.Objects;

public class FileUtils {
 /**
  * Get file data as string
  * 
  * @param fileName
  * @return
  */
 public static String getFileDataAsString(String fileName) {
  Objects.nonNull(fileName);
  try {
   String data = new String(readAllBytes(get(fileName)));
   return data;
  } catch (IOException e) {
   System.out.println(e.getMessage());
   return null;
  }
 }
}

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.util.Objects;

import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;

public class SentenceDetectorUtil {
 private SentenceModel model = null;
 SentenceDetectorME sentenceDetector = null;

 public SentenceDetectorUtil(String modelFile) {
  Objects.nonNull(modelFile);
  initSentenceModel(modelFile);
  initSentenceDetectorME();
 }

 private void initSentenceDetectorME() {
  sentenceDetector = new SentenceDetectorME(model);
 }

 private SentenceModel initSentenceModel(String file) {
  InputStream modelIn;
  try {
   modelIn = new FileInputStream(file);
  } catch (FileNotFoundException e) {
   System.out.println(e.getMessage());
   return null;
  }

  try {
   model = new SentenceModel(modelIn);
  } catch (IOException e) {
   e.printStackTrace();
  } finally {
   if (modelIn != null) {
    try {
     modelIn.close();
    } catch (IOException e) {
    }
   }
  }
  return model;
 }

 public String[] getSentencesFromFile(String inputFile) {
  String data = FileUtils.getFileDataAsString(inputFile);
  return sentenceDetector.sentDetect(data);
 }

 public String[] getSentences(String data) {
  return sentenceDetector.sentDetect(data);
 }

}

public class Main {
 public static void main(String args[]) {
  SentenceDetectorUtil util = new SentenceDetectorUtil(
    "/Users/harikrishna_gurram/study1/OpenNLP/apache-opennlp-1.6.0/bin/models/en-sent.bin");

  String data = "We are living in an Environment, where multiple Hardware Architectures and Multiple platforms presents. So it is very difficult to write, compile and link the same Application, for each platform and each Architecture separately. The Java Programming Language solves all the above problems. The Java programming language platform provides a portable, interpreted, high-performance, simple, object-oriented programming language and supporting run-time environment.";

  String[] sentences = util.getSentences(data);

  for (String s : sentences)
   System.out.println(s +"\n");
 }
}


Output

We are living in an Environment, where multiple Hardware Architectures and Multiple platforms presents.

So it is very difficult to write, compile and link the same Application, for each platform and each Architecture separately.

The Java Programming Language solves all the above problems.

The Java programming language platform provides a portable, interpreted, high-performance, simple, object-oriented programming language and supporting run-time environment.



Prevoius                                                 Next                                                 Home

No comments:

Post a Comment