When building retrieval-augmented generation (RAG) systems, it's not just how you chunk documents that matters, but also how you transform and enrich those chunks, known as TextSegments. LangChain4j provides the TextSegmentTransformer interface, which lets you customize these transformations. This can help to optimize search results, reduce noise, and add valuable context to each chunk (like titles or summaries).
This post shows how to implement a custom TextSegmentTransformer that enriches segments with the document title and filters out segments that are too short to be useful.
Why Use TextSegmentTransformer?
In LangChain4j, documents are split into smaller pieces called TextSegments. These are the units used for retrieval when querying your vector store. Before storing them, it's often beneficial to:
· Clean up text (e.g., remove boilerplate)
· Filter out irrelevant or tiny segments
· Enrich with metadata (e.g., document title or summary)
· The TextSegmentTransformer interface makes this easy to do.
Find the below working application.
TextSegmentTransformerDemo.java
package com.sample.app.textsegment; import java.util.List; import dev.langchain4j.data.document.Metadata; import dev.langchain4j.data.segment.TextSegment; import dev.langchain4j.data.segment.TextSegmentTransformer; public class TextSegmentTransformerDemo { public static class TitlePrependingSegmentTransformer implements TextSegmentTransformer { @Override public TextSegment transform(TextSegment segment) { // Filter out short segments String originalText = segment.text(); if (originalText == null || originalText.length() < 20) { return null; } // Get the document title from metadata, if available String title = segment.metadata().getString("title"); if (title == null || title.isEmpty()) { title = "Untitled"; } // Prepend the title to the text String enrichedText = title + ": " + originalText; // Return a new TextSegment with enriched text and same metadata return new TextSegment(enrichedText, segment.metadata()); } } public static void main(String[] args) { List<TextSegment> originalSegments = List.of( new TextSegment("Intro to LangChain4j.", Metadata.from("title", "LangChain4j Overview")), new TextSegment("Short", Metadata.from("title", "LangChain4j Overview")), new TextSegment("It supports document parsing, segmentation, and transformation.", Metadata.from("title", "LangChain4j Overview"))); TextSegmentTransformer transformer = new TitlePrependingSegmentTransformer(); List<TextSegment> transformed = transformer.transformAll(originalSegments); transformed.forEach(segment -> System.out.println(segment.text())); } }
Output
LangChain4j Overview: Intro to LangChain4j. LangChain4j Overview: It supports document parsing, segmentation, and transformation.
In summary, LangChain4j gives you full control over how your data flows through a RAG system. With TextSegmentTransformer, you can fine tune the quality of segments used for retrieval by filtering, enriching, or reshaping them.
Previous Next Home
No comments:
Post a Comment