Programming for beginners: Transforming TextSegments in LangChain4j: Enhance Retrieval Quality with Custom Transformers

When building retrieval-augmented generation (RAG) systems, it's not just how you chunk documents that matters, but also how you transform and enrich those chunks, known as TextSegments. LangChain4j provides the TextSegmentTransformer interface, which lets you customize these transformations. This can help to optimize search results, reduce noise, and add valuable context to each chunk (like titles or summaries).

This post shows how to implement a custom TextSegmentTransformer that enriches segments with the document title and filters out segments that are too short to be useful.

Why Use TextSegmentTransformer?

In LangChain4j, documents are split into smaller pieces called TextSegments. These are the units used for retrieval when querying your vector store. Before storing them, it's often beneficial to:

· Clean up text (e.g., remove boilerplate)

· Filter out irrelevant or tiny segments

· Enrich with metadata (e.g., document title or summary)

· The TextSegmentTransformer interface makes this easy to do.

Find the below working application.

TextSegmentTransformerDemo.java

package com.sample.app.textsegment;

import java.util.List;

import dev.langchain4j.data.document.Metadata;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.data.segment.TextSegmentTransformer;

public class TextSegmentTransformerDemo {

	public static class TitlePrependingSegmentTransformer implements TextSegmentTransformer {

		@Override
		public TextSegment transform(TextSegment segment) {
			// Filter out short segments
			String originalText = segment.text();
			if (originalText == null || originalText.length() < 20) {
				return null;
			}

			// Get the document title from metadata, if available
			String title = segment.metadata().getString("title");
			if (title == null || title.isEmpty()) {
				title = "Untitled";
			}

			// Prepend the title to the text
			String enrichedText = title + ": " + originalText;

			// Return a new TextSegment with enriched text and same metadata
			return new TextSegment(enrichedText, segment.metadata());
		}
	}

	public static void main(String[] args) {
		List<TextSegment> originalSegments = List.of(
				new TextSegment("Intro to LangChain4j.", Metadata.from("title", "LangChain4j Overview")),
				new TextSegment("Short", Metadata.from("title", "LangChain4j Overview")),
				new TextSegment("It supports document parsing, segmentation, and transformation.",
						Metadata.from("title", "LangChain4j Overview")));

		TextSegmentTransformer transformer = new TitlePrependingSegmentTransformer();
		List<TextSegment> transformed = transformer.transformAll(originalSegments);

		transformed.forEach(segment -> System.out.println(segment.text()));
	}

}

Output

LangChain4j Overview: Intro to LangChain4j.
LangChain4j Overview: It supports document parsing, segmentation, and transformation.

In summary, LangChain4j gives you full control over how your data flows through a RAG system. With TextSegmentTransformer, you can fine tune the quality of segments used for retrieval by filtering, enriching, or reshaping them.

Previous Next Home

Programming for beginners

Friday, 29 August 2025

Transforming TextSegments in LangChain4j: Enhance Retrieval Quality with Custom Transformers

No comments:

Post a Comment