Sunday, 10 August 2025

Introduction to DocumentTransformer in LangChain4j

In the world of LLM applications, how you prepare your documents often determines the quality of your results. Before feeding documents into an embedding model or a retriever, it’s essential to transform them into a more structured, optimized format. That’s where LangChain4j's DocumentTransformer interface comes in.

This interface is designed to help you apply various transformations to documents, making them cleaner, more relevant, and more informative for downstream tasks such as retrieval-augmented generation (RAG).

 

What Can DocumentTransformer Do?

Removes unnecessary or noisy content from the document text, like boilerplate footers, HTML tags, redundant line breaks, etc., It helps to reduce token usage and prevents distractions in the LLM’s attention span.

 

·      Filtering: Allows you to programmatically discard documents that don't meet certain criteria. For example, documents that are too short, outdated, or irrelevant to your domain.

 

·      Enriching: We can add extra metadata or context to documents, such as source labels, topic categories, or timestamps. It improves search relevance and enables better ranking or grouping during retrieval.

 

·      Summarizing: We can generate a concise summary of the document and stores it in the document’s metadata. This summary can later be attached to each TextSegment (explained later) to enhance context during retrieval.

 

In addition to transforming document text, DocumentTransformer implementations can add, modify, or remove metadata entries. This is important for setting up searchable tags, user-defined filters, or dynamic content scoring strategies.

 

Here’s what the actual interface looks like.

public interface DocumentTransformer {
    Document transform(Document document);

    default List<Document> transformAll(List<Document> documents) {
        return documents.stream()
                .map(this::transform)
                .filter(Objects::nonNull)
                .collect(toList());
    }
}

·      transform(Document document): Transforms a single document.

·      transformAll(List<Document> documents): Applies the transformation across a batch of documents.

·      Filters out null results, which is useful for transformers that may exclude some documents (e.g., filters).

 

Find the below working application.

 

DocumentTransformerDemo.java

package com.sample.app.document.transformers;

import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.DocumentTransformer;

public class DocumentTransformerDemo {

	public static void main(String[] args) {

		Document doc = Document.from("""
				This is just a simple     example document.
				It is meant to demonstrate a few really basic transformations.
				There are some     unnecessary spaces and very redundant words.
				""");

		// Print the content before transformation
		System.out.println("Before Transformation:\n" + doc.text());

		// Transformer 1: Trim whitespace in each line
		DocumentTransformer trimWhitespaceTransformer = document -> {
			String transformed = Arrays.stream(document.text().split("\n")).map(String::trim)
					.collect(Collectors.joining("\n"));
			return Document.from(transformed);
		};

		// Transformer 2: Convert to lowercase
		DocumentTransformer lowercaseTransformer = document -> {
			String transformed = document.text().toLowerCase();
			return Document.from(transformed);
		};

		// Transformer 3: Remove filler words
		DocumentTransformer removeFillerWordsTransformer = document -> {
			List<String> fillerWords = List.of("just", "really", "very", "actually", "basically", "simply");
			String transformed = Arrays.stream(document.text().split("\\s+"))
					.filter(word -> !fillerWords.contains(word.toLowerCase())).collect(Collectors.joining(" "));
			return Document.from(transformed);
		};

		// Apply all transformations sequentially
		Document transformedDoc = doc;
		transformedDoc = trimWhitespaceTransformer.transform(transformedDoc);
		transformedDoc = lowercaseTransformer.transform(transformedDoc);
		transformedDoc = removeFillerWordsTransformer.transform(transformedDoc);

		// Print the content after transformation
		System.out.println("\nAfter Transformation:\n" + transformedDoc.text());
	}
}

Output

Before Transformation:
This is just a simple     example document.
It is meant to demonstrate a few really basic transformations.
There are some     unnecessary spaces and very redundant words.


After Transformation:
this is a simple example document. it is meant to demonstrate a few basic transformations. there are some unnecessary spaces and redundant words.

Currently, LangChain4j provides a built-in transformer specifically for HTML documents,

the HtmlToTextDocumentTransformer, available in the langchain4j-document-transformer-jsoup module.

<dependency>
	<groupId>dev.langchain4j</groupId>
	<artifactId>langchain4j-document-transformer-jsoup</artifactId>
</dependency>

This transformer is designed to parse raw HTML, extract meaningful text content, and optionally extract metadata like the title, headings, etc. It uses the Jsoup library internally for parsing and cleaning up the HTML.

 

HtmlTransformerExample.java

package com.sample.app.document.transformers;

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.transformer.jsoup.HtmlToTextDocumentTransformer;

public class HtmlTransformerExample {

    public static void main(String[] args) {
        String html = """
                <html>
                    <head><title>Welcome</title></head>
                    <body>
                        <h1>Main Heading</h1>
                        <p>This is a <b>sample</b> HTML document.</p>
                        <p>It contains multiple elements like <a href="#">links</a>, paragraphs, and headings.</p>
                    </body>
                </html>
                """;

        // Create a Document from raw HTML
        Document htmlDocument = Document.from(html);

        // Create an instance of the HTML to Text transformer
        HtmlToTextDocumentTransformer transformer = new HtmlToTextDocumentTransformer();

        // Apply the transformation
        Document transformedDocument = transformer.transform(htmlDocument);

        // Print the plain text result
        System.out.println("Extracted Text:\n" + transformedDocument.text());

        // Optionally, access metadata (if configured in transformer)
        System.out.println("\nMetadata:\n" + transformedDocument.metadata());
    }
}

 

Output

Extracted Text:
Welcome  
Main Heading
 
This is a sample HTML document.
 
It contains multiple elements like links, paragraphs, and headings.

Metadata:
Metadata { metadata = {} }

In summary, the DocumentTransformer interface in LangChain4j provides a modular and powerful way to prepare your documents before they are fed into the LLM pipeline. Whether it’s about cleaning up messy text, enriching content with metadata, or summarizing for better retrieval, these transformers help you get the most out of your documents and ultimately, your AI system.

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment