Langchain4j is a powerful Java framework for building LLM-powered applications. One of its most essential utilities is the FileSystemDocumentLoader, which simplifies loading documents directly from the file system for further processing and indexing. In this post, we will explore how FileSystemDocumentLoader works, the available methods to load documents, and how you can customize document parsing with your own DocumentParser.
What is FileSystemDocumentLoader?
The FileSystemDocumentLoader class offers static methods to load documents from file paths or Path objects. It reads the file content and parses it into Langchain4j's Document model.
You can get a Document object using given file path, custom Document Parser using below methods.
public static Document loadDocument(String filePath) public static Document loadDocument(Path filePath) public static Document loadDocument(Path filePath, DocumentParser documentParser) public static Document loadDocument(String filePath, DocumentParser documentParser)
If you call loadDocument with just the file path, the loader uses the default DocumentParser. Langchain4j loads the default parser through the Java SPI (Service Provider Interface) mechanism, which means you can plug in custom parsers easily without modifying the loader code.
You can specify your own DocumentParser implementation to handle file formats in a custom way.
How Document Parsing Works?
At the core of parsing is the DocumentParser interface.
public interface DocumentParser { Document parse(InputStream inputStream); }
It takes an InputStream from the file. Parses the content and returns a Document object.
Loading Multiple Documents from a Directory
Langchain4j’s FileSystemDocumentLoader also provides convenient methods to load all documents within a specified folder. This makes it easy to bulk ingest content from a directory without manually handling individual files.
You can load all documents in a folder using the following overloaded methods:
public static List<Document> loadDocuments(Path directoryPath) public static List<Document> loadDocuments(String directoryPath) public static List<Document> loadDocuments(Path directoryPath, DocumentParser documentParser) public static List<Document> loadDocuments(String directoryPath, DocumentParser documentParser)
Simply provide the directory path (as a String or Path) and the loader will process every file found in that folder using the default parser. If you want to apply a custom parsing strategy to all files, pass your own DocumentParser instance to control how each document is read and converted.
Filtering Documents Using PathMatcher
Sometimes, you may want to load only a subset of documents from a folder, for example, only .txt files or files matching a specific naming pattern. The loader supports this by accepting a PathMatcher, which acts as a filter to select files that meet certain criteria.
public static List<Document> loadDocuments(Path directoryPath, PathMatcher pathMatcher) public static List<Document> loadDocuments(String directoryPath, PathMatcher pathMatcher) public static List<Document> loadDocuments(Path directoryPath, PathMatcher pathMatcher, DocumentParser documentParser) public static List<Document> loadDocuments(String directoryPath, PathMatcher pathMatcher, DocumentParser documentParser)
You can also specify a DocumentParser to handle filtered files with your preferred parsing logic.
Recursive Loading of Documents in Subdirectories
In many real-world scenarios, documents are organized into nested folders. Langchain4j’s loader allows you to recursively load documents from a directory and all its subdirectories with a set of similarly overloaded methods.
public static List<Document> loadDocumentsRecursively(Path directoryPath) public static List<Document> loadDocumentsRecursively(String directoryPath) public static List<Document> loadDocumentsRecursively(String directoryPath, DocumentParser documentParser) public static List<Document> loadDocumentsRecursively(Path directoryPath, DocumentParser documentParser) public static List<Document> loadDocumentsRecursively(Path directoryPath, PathMatcher pathMatcher) public static List<Document> loadDocumentsRecursively(String directoryPath, PathMatcher pathMatcher) public static List<Document> loadDocumentsRecursively(String directoryPath, PathMatcher pathMatcher, DocumentParser documentParser) public static List<Document> loadDocumentsRecursively(Path directoryPath, PathMatcher pathMatcher, DocumentParser documentParser)
Find the below working application.
FileSystemDocumentLoaderDemo.java
package com.sample.app.documentloaders; import java.util.List; import dev.langchain4j.data.document.Document; import dev.langchain4j.data.document.loader.FileSystemDocumentLoader; public class FileSystemDocumentLoaderDemo { public static void main(String[] args) { List<Document> documents = FileSystemDocumentLoader.loadDocuments("/Users/Shared/llm_docs/"); for(Document document: documents) { System.out.println(document.metadata()); } } }
In summary, the FileSystemDocumentLoader provide flexible methods for loading documents from folders allow you to easily bulk ingest documents.
· Load all documents in a folder with or without a custom parser.
· Filter which documents to load by specifying a PathMatcher.
· Recursively traverse folders and subfolders to load documents deeply.
· Combine filtering and custom parsing seamlessly during recursive loading.
This rich API helps you build robust, scalable pipelines for document ingestion in Langchain4j applications.
Previous Next Home
No comments:
Post a Comment