In building LLM (Large Language Model) applications, one of the first and most crucial steps is loading and parsing unstructured data from various document formats. LangChain4j simplifies this step with its powerful DocumentParser interface and a set of pre built parsers for common file types. This blog post helps to introduce Java developers to the different DocumentParser implementations available in LangChain4j and how to use them to ingest data from PDFs, DOCX, TXT, and more.
What is a DocumentParser?
A DocumentParser in LangChain4j is an interface responsible for converting raw files into a standardized Document object that the rest of the LangChain4j ecosystem can work with including embedding, retrieval, and question answering.
public interface DocumentParser {
Document parse(InputStream inputStream);
}
Following are the built-in parsers.
· TextDocumentParser: Parses: .txt, .md, .html, and other plain text files.
· ApachePdfBoxDocumentParser: Parses: .pdf files. You should add the dependency 'langchain4j-document-parser-apache-pdfbox' to consume this parser.
· ApachePoiDocumentParser: Parses: .doc, .docx, .ppt, .pptx, .xls, .xlsx. It is good for for handling MS Office documents using Apache POI. You should use the dependency 'langchain4j-document-parser-apache-poi' to consume the parser.
· ApacheTikaDocumentParser: It parses almose any file format. You should use the dependency langchain4j-document-parser-apache-tika to consume this parse.
Find the below working Application.
TextDocumentParserDemo.java
package com.sample.app.documentparsers; import dev.langchain4j.data.document.Document; import dev.langchain4j.data.document.loader.FileSystemDocumentLoader; import dev.langchain4j.data.document.parser.TextDocumentParser; public class TextDocumentParserDemo { public static void main(String[] args) { Document document = FileSystemDocumentLoader.loadDocument("/Users/Shared/llm_docs/test.txt", new TextDocumentParser()); System.out.println(document.text()); } }
No comments:
Post a Comment