Programming for beginners: PDFBox: Get metadata of PDF document

Metadata contains information about the document that describes about itself such as document creation data, title, author etc.,

Accessing basic metadata using PDDocumentInformation

PDDocumentInformation class provide getter methods to extract information about the PDF document.

Method	Description
public String getTitle()	Return the title of the document. This will return null if no title exists.
public String getAuthor()	Return the author of the document. This will return null if no author exists.
public String getSubject()	Return the subject of the document. This will return null if no subject exists.
public String getKeywords()	Return keywords of the document. This will return null if no keywords exists.
public String getCreator()	Return creator of the document. This will return null if no creator exists.
public String getProducer()	Return Producer of the document. This will return null if no producer exists.
public Calendar getCreationDate()	Return creation date of the document. This will return null if no creation date exists.
public Calendar getModificationDate()	Return modification date of the document. This will return null if no modification date exists.
public String getTrapped()	Return trapped value of the document. This will return null if no trapped value exists.

Following statements print metadata of PDF document.

PDDocument pdDoc = PDDocument.load(new File("/Users/harikrishna_gurram/Downloads/Saurabh.pdf"));
PDDocumentInformation info = pdDoc.getDocumentInformation();
System.out.println( "Title=" + info.getTitle() );
System.out.println( "Author=" + info.getAuthor() );
System.out.println( "Subject=" + info.getSubject() );
System.out.println( "Keywords=" + info.getKeywords() );
System.out.println( "Creator=" + info.getCreator() );
System.out.println( "Producer=" + info.getProducer() );
System.out.println( "Creation Date=" + info.getCreationDate() );
System.out.println( "Modification Date=" + info.getModificationDate());
System.out.println( "Trapped=" + info.getTrapped() );

In addition to above methods PDDocumentInformation class provides getMetadataKeys method, which will get the keys of all metadata information fields for the document.

public Set<String> getMetadataKeys()
This will get the keys of all metadata information fields for the document.

public static Optional<Map<String, Object>> getDocumentBasicMetaData(final String fileName) {
 if (Objects.isNull(fileName)) {
  throw new NullPointerException("fileName shouldn't be null");
 }

 try (final PDDocument pdDoc = PDDocument.load(new File(fileName))) {
  PDDocumentInformation docInfo = pdDoc.getDocumentInformation();
  Set<String> keys = docInfo.getMetadataKeys();

  Map<String, Object> map = new HashMap<>();

  for (String key : keys) {
   map.put(key, docInfo.getPropertyStringValue(key));
  }

  return Optional.of(map);

 } catch (IOException e) {
  return Optional.empty();
 }
}

PDF documents can have XML metadata associated with them. Following classes are used to extract the XML meta data.

PDDocumentCatalog

PDPage

PDXObject

PDICCBased

PDStream

Following snippet is used to get catalog metadata from PDDocumentCatalog.

public static Optional<List<String>> getCatalogMetaData(final String fileName) {
 if (Objects.isNull(fileName)) {
  throw new NullPointerException("fileName shouldn't be null");
 }

 try (final PDDocument pdDoc = PDDocument.load(new File(fileName))) {
  PDDocumentCatalog catalog = pdDoc.getDocumentCatalog();
  PDMetadata metadata = catalog.getMetadata();
  return getMeatData(metadata);
 } catch (IOException e) {
  System.out.println(e.getMessage());
  return Optional.empty();
 }

}

private static Optional<List<String>> getDataFromStream(InputStream in) {

 try (BufferedReader br = new BufferedReader(new InputStreamReader(in))) {
  List<String> data = new ArrayList<>();
  String str;

  while ((str = br.readLine()) != null) {
   data.add(str);
  }
  return Optional.of(data);
 } catch (IOException e) {
  System.out.println(e.getMessage());
  return Optional.empty();
 }

}

private static Optional<List<String>> getMeatData(PDMetadata metadata) {
 if (metadata == null) {
  System.out.println("There is no meta data associated");
  return Optional.empty();
 }

 try (InputStream in = metadata.createInputStream()) {
  return getDataFromStream(in);
 } catch (IOException e) {
  return Optional.empty();
 }
}

Following snippet is used to get meta data of a PDF page.

public static Optional<List<String>> getPDPageMetaData(final String fileName, int pageIndex) {
 if (Objects.isNull(fileName)) {
  throw new NullPointerException("fileName shouldn't be null");
 }

 if (pageIndex < 1) {
  throw new IllegalArgumentException("pageIndex must >= 1");
 }

 try (final PDDocument pdDoc = PDDocument.load(new File(fileName))) {

  if (pageIndex > pdDoc.getNumberOfPages()) {
   throw new IllegalArgumentException("pageIndex : " + pageIndex
     + " must <= " + pdDoc.getNumberOfPages());
  }

  PDPage pdPage = pdDoc.getPage(pageIndex);
  PDMetadata metadata = pdPage.getMetadata();
  return getMeatData(metadata);
 } catch (IOException e) {
  System.out.println(e.getMessage());
  return Optional.empty();
 }
}

Following is the complete working application.

import java.io.BufferedReader;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import java.util.Optional;
import java.util.Set;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.PDDocumentInformation;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDMetadata;
import org.apache.pdfbox.text.PDFTextStripper;

public class PDFTextStripperUtil {

 /**
  * @param fileName
  * @return complete file data as string
  * @throws NullPointerException
  *             if fileName is null
  */
 public static Optional<String> getDataAsString(final String fileName) {
  if (Objects.isNull(fileName)) {
   throw new NullPointerException("fileName shouldn't be null");
  }

  try (final PDDocument pdDoc = PDDocument.load(new File(fileName))) {

   PDFTextStripper stripper = new PDFTextStripper();
   stripper.setLineSeparator("\n");
   stripper.setAddMoreFormatting(true);

   return Optional.of(stripper.getText(pdDoc));

  } catch (IOException e) {
   System.out.println(e.getMessage());
   return Optional.empty();
  }

 }

 public static Optional<String> getDataAsString(final String fileName,
   final int startPage, final int endPage) {
  if (Objects.isNull(fileName)) {
   throw new NullPointerException("fileName shouldn't be null");
  }

  if (startPage < 1 || endPage < 1 || endPage < startPage) {
   throw new IllegalArgumentException(
     "startPage, endPage must >= 1 and  endPage >= startPage");
  }

  try (final PDDocument pdDoc = PDDocument.load(new File(fileName))) {

   PDFTextStripper stripper = new PDFTextStripper();
   stripper.setLineSeparator("\n");
   stripper.setAddMoreFormatting(true);
   stripper.setStartPage(startPage);
   stripper.setEndPage(endPage);

   return Optional.of(stripper.getText(pdDoc));

  } catch (IOException e) {
   System.out.println(e.getMessage());
   return Optional.empty();
  }
 }

 public static Optional<String> getDataAsStringFromStartPage(
   String fileName, int startPage) {
  if (Objects.isNull(fileName)) {
   throw new NullPointerException("fileName shouldn't be null");
  }

  if (startPage < 1) {
   throw new IllegalArgumentException("startPage must >= 1");
  }

  try (final PDDocument pdDoc = PDDocument.load(new File(fileName))) {
   int noOfPages = pdDoc.getNumberOfPages();
   return getDataAsString(fileName, startPage, noOfPages);

  } catch (IOException e) {
   System.out.println(e.getMessage());
   return Optional.empty();
  }
 }

 public static Optional<String> getDataAsStringTillEndPage(String fileName,
   int endPage) {
  if (Objects.isNull(fileName)) {
   throw new NullPointerException("fileName shouldn't be null");
  }

  if (endPage < 1) {
   throw new IllegalArgumentException("endPage must >= 1");
  }

  return getDataAsString(fileName, 1, endPage);
 }

 public static Optional<Integer> getNumberOfPages(String fileName) {
  if (Objects.isNull(fileName)) {
   throw new NullPointerException("fileName shouldn't be null");
  }
  try (final PDDocument pdDoc = PDDocument.load(new File(fileName))) {
   return Optional.of(pdDoc.getNumberOfPages());
  } catch (IOException e) {
   System.out.println(e.getMessage());
   return Optional.empty();
  }
 }

 public static Optional<Map<String, Object>> getDocumentBasicMetaData(
   final String fileName) {
  if (Objects.isNull(fileName)) {
   throw new NullPointerException("fileName shouldn't be null");
  }

  try (final PDDocument pdDoc = PDDocument.load(new File(fileName))) {
   PDDocumentInformation docInfo = pdDoc.getDocumentInformation();
   Set<String> keys = docInfo.getMetadataKeys();

   Map<String, Object> map = new HashMap<>();

   for (String key : keys) {
    map.put(key, docInfo.getPropertyStringValue(key));
   }

   return Optional.of(map);

  } catch (IOException e) {
   System.out.println(e.getMessage());
   return Optional.empty();
  }
 }

 public static Optional<List<String>> getCatalogMetaData(
   final String fileName) {
  if (Objects.isNull(fileName)) {
   throw new NullPointerException("fileName shouldn't be null");
  }

  try (final PDDocument pdDoc = PDDocument.load(new File(fileName))) {
   PDDocumentCatalog catalog = pdDoc.getDocumentCatalog();
   PDMetadata metadata = catalog.getMetadata();
   return getMeatData(metadata);
  } catch (IOException e) {
   System.out.println(e.getMessage());
   return Optional.empty();
  }

 }

 private static Optional<List<String>> getDataFromStream(InputStream in) {

  try (BufferedReader br = new BufferedReader(new InputStreamReader(in))) {
   List<String> data = new ArrayList<>();
   String str;

   while ((str = br.readLine()) != null) {
    data.add(str);
   }
   return Optional.of(data);
  } catch (IOException e) {
   System.out.println(e.getMessage());
   return Optional.empty();
  }

 }

 private static Optional<List<String>> getMeatData(PDMetadata metadata) {
  if (metadata == null) {
   System.out.println("There is no meta data associated");
   return Optional.empty();
  }

  try (InputStream in = metadata.createInputStream()) {
   return getDataFromStream(in);
  } catch (IOException e) {
   return Optional.empty();
  }
 }

 public static Optional<List<String>> getPDPageMetaData(
   final String fileName, int pageIndex) {
  if (Objects.isNull(fileName)) {
   throw new NullPointerException("fileName shouldn't be null");
  }

  if (pageIndex < 1) {
   throw new IllegalArgumentException("pageIndex must >= 1");
  }

  try (final PDDocument pdDoc = PDDocument.load(new File(fileName))) {

   if (pageIndex > pdDoc.getNumberOfPages()) {
    throw new IllegalArgumentException("pageIndex : " + pageIndex
      + " must <= " + pdDoc.getNumberOfPages());
   }

   PDPage pdPage = pdDoc.getPage(pageIndex);
   PDMetadata metadata = pdPage.getMetadata();
   return getMeatData(metadata);
  } catch (IOException e) {
   System.out.println(e.getMessage());
   return Optional.empty();
  }
 }

}

Previous Next Home

Programming for beginners

Monday, 7 March 2016

PDFBox: Get metadata of PDF document

No comments:

Post a Comment