Thursday, 28 August 2025

Efficient Prompting with Text Segmentation in LangChain4j

When working with Large Language Models (LLMs), feeding the entire document or knowledge base into the prompt is not always practical or even possible. That's where text segmentation comes in. LangChain4j offers a powerful abstraction to break down large documents into smaller, manageable chunks using the TextSegment class.

Understanding how and why to segment your documents is important for building high-performance, cost-effective LLM applications, especially for search, retrieval-augmented generation (RAG), or document question-answering use cases.

 

1. What is a TextSegment in LangChain4j?

In LangChain4j, once you’ve loaded your documents using tools like FileSystemDocumentLoader, you’ll typically want to split these documents into smaller parts, known as text segments.

 

LangChain4j provides an API called TextSegment to represent these chunks. A TextSegment encapsulates only textual content (not images or rich media) and is designed to be passed to LLMs more efficiently.

 

TextSegmentDemo.java

package com.sample.app.textsegment;

import java.util.HashMap;
import java.util.Map;

import dev.langchain4j.data.document.Metadata;
import dev.langchain4j.data.segment.TextSegment;

public class TextSegmentDemo {

	public static void main(String[] args) {
		Map<String, Object> metadata = new HashMap<>();
		metadata.put("title", "Demo of TextSegment");

		TextSegment segment = TextSegment.from("This is a chunk of my document", Metadata.from(metadata));

		System.out.println(segment.text());
		System.out.println(segment.metadata());

	}

}

Output

This is a chunk of my document
Metadata { metadata = {title=Demo of TextSegment} }

Why Segment Documents?

Here are some practical reasons why segmenting documents is essential.

 

·      LLM Context Window Limitations: LLMs can only process a limited number of tokens at once (e.g., 8K, 32K, 128K). Splitting allows you to fit relevant segments within this constraint.

·      Speed and Efficiency: A smaller prompt means faster processing. Why send 50 pages when 3 lines are enough?

·      Cost Control: Most LLM providers charge based on tokens. Segmenting allows you to reduce cost by sending only what’s necessary.

·      Improved Relevance: By filtering out irrelevant segments, you reduce noise and help the LLM focus on what matters, leading to better responses.

·      Traceability and Explainability: Smaller, focused segments make it easier to trace back how a specific answer was generated.

 

Let’s see a simple transformation from a full document to TextSegment objects using a naive splitter (you can later use more advanced strategies like recursive or semantic chunking).

 

Example

List<TextSegment> splitIntoSegments(Document document) {
	String[] paragraphs = document.text().split("\n");
	return Arrays.stream(paragraphs).map(TextSegment::from).collect(Collectors.toList());
}

 

DocumentSegmentation.java

package com.sample.app.textsegment;

import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.segment.TextSegment;

public class DocumentSegmentation {
	private static final String CONTENT = """
			LangChain4j is a powerful Java framework that helps developers integrate Large Language Models (LLMs) into their applications. It offers components like document loaders, text segmenters, retrievers, and chat interfaces, making it easy to build LLM-powered apps in Java.
			Text segmentation is a crucial step in building scalable and efficient LLM solutions. Rather than sending an entire document to the LLM, which might exceed token limits or introduce irrelevant noise, it's more practical to split the content into meaningful segments.
			For instance, if you’re building a document search engine, you want the LLM to search through relevant pieces instead of parsing a 100-page document at once. Segments can be paragraphs, sentences, or even semantically meaningful sections, depending on your use case.
			Let’s say you are analyzing a user manual. The introduction, setup instructions, troubleshooting steps, and safety guidelines can be treated as separate segments. This makes it easier to retrieve and serve only the part of the document that answers the user's query.
			LangChain4j supports different splitting strategies such as fixed-size chunking, recursive splitting, or custom logic based on punctuation or semantics. Choosing the right strategy impacts the quality of responses generated by the LLM.
			In summary, proper segmentation reduces cost, enhances response relevance, and ensures faster processing times. It also improves traceability, making it easier to understand why the LLM gave a certain response.

						""";

	private static List<TextSegment> splitIntoSegments(Document document) {
		String[] paragraphs = document.text().split("\n");
		return Arrays.stream(paragraphs).map(TextSegment::from).collect(Collectors.toList());
	}

	public static void main(String[] args) {
		Document doc = Document.from(CONTENT);

		List<TextSegment> segments = splitIntoSegments(doc);

		int sNo = 1;
		for (TextSegment textSegment : segments) {
			System.out.println(sNo + " : " + textSegment.text() + "\n");
			sNo++;
		}
	}

}

Output

1 : LangChain4j is a powerful Java framework that helps developers integrate Large Language Models (LLMs) into their applications. It offers components like document loaders, text segmenters, retrievers, and chat interfaces, making it easy to build LLM-powered apps in Java.

2 : Text segmentation is a crucial step in building scalable and efficient LLM solutions. Rather than sending an entire document to the LLM, which might exceed token limits or introduce irrelevant noise, it's more practical to split the content into meaningful segments.

3 : For instance, if you’re building a document search engine, you want the LLM to search through relevant pieces instead of parsing a 100-page document at once. Segments can be paragraphs, sentences, or even semantically meaningful sections, depending on your use case.

4 : Let’s say you are analyzing a user manual. The introduction, setup instructions, troubleshooting steps, and safety guidelines can be treated as separate segments. This makes it easier to retrieve and serve only the part of the document that answers the user's query.

5 : LangChain4j supports different splitting strategies such as fixed-size chunking, recursive splitting, or custom logic based on punctuation or semantics. Choosing the right strategy impacts the quality of responses generated by the LLM.

6 : In summary, proper segmentation reduces cost, enhances response relevance, and ensures faster processing times. It also improves traceability, making it easier to understand why the LLM gave a certain response.

 

Using DocumentSplitters for Smart Chunking

LangChain4j provides ready-made utilities like DocumentSplitters.

List<TextSegment> segments = DocumentSplitters.recursive(200, 30).split(doc);

 

This recursive splitter chunks your document with a target size of 200 characters and an overlap of 30, preserving context and structure.

 

Find the below working application.

 

DocumentSplittersDemo.java

package com.sample.app.textsegment;

import java.util.List;

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.splitter.DocumentSplitters;
import dev.langchain4j.data.segment.TextSegment;

public class DocumentSplittersDemo {
  private static final String CONTENT = """
      LangChain4j is a powerful Java framework that helps developers integrate Large Language Models (LLMs) into their applications. It offers components like document loaders, text segmenters, retrievers, and chat interfaces, making it easy to build LLM-powered apps in Java.
      Text segmentation is a crucial step in building scalable and efficient LLM solutions. Rather than sending an entire document to the LLM, which might exceed token limits or introduce irrelevant noise, it's more practical to split the content into meaningful segments.
      For instance, if you’re building a document search engine, you want the LLM to search through relevant pieces instead of parsing a 100-page document at once. Segments can be paragraphs, sentences, or even semantically meaningful sections, depending on your use case.
      Let’s say you are analyzing a user manual. The introduction, setup instructions, troubleshooting steps, and safety guidelines can be treated as separate segments. This makes it easier to retrieve and serve only the part of the document that answers the user's query.
      LangChain4j supports different splitting strategies such as fixed-size chunking, recursive splitting, or custom logic based on punctuation or semantics. Choosing the right strategy impacts the quality of responses generated by the LLM.
      In summary, proper segmentation reduces cost, enhances response relevance, and ensures faster processing times. It also improves traceability, making it easier to understand why the LLM gave a certain response.
            """;

  public static void main(String[] args) {
    Document doc = Document.from(CONTENT);

    List<TextSegment> segments = DocumentSplitters.recursive(200, 30).split(doc);

    int sNo = 1;
    for (TextSegment textSegment : segments) {
      System.out.println(sNo + " : " + textSegment.text() + "\n");
      sNo++;
    }
  }

}

 

Output

1 : LangChain4j is a powerful Java framework that helps developers integrate Large Language Models (LLMs) into their applications.

2 : It offers components like document loaders, text segmenters, retrievers, and chat interfaces, making it easy to build LLM-powered apps in Java.

3 : Text segmentation is a crucial step in building scalable and efficient LLM solutions.

4 : Rather than sending an entire document to the LLM, which might exceed token limits or introduce irrelevant noise, it's more practical to split the content into meaningful segments.

5 : For instance, if you’re building a document search engine, you want the LLM to search through relevant pieces instead of parsing a 100-page document at once.

6 : Segments can be paragraphs, sentences, or even semantically meaningful sections, depending on your use case.

7 : Let’s say you are analyzing a user manual. The introduction, setup instructions, troubleshooting steps, and safety guidelines can be treated as separate segments.

8 : This makes it easier to retrieve and serve only the part of the document that answers the user's query.

9 : LangChain4j supports different splitting strategies such as fixed-size chunking, recursive splitting, or custom logic based on punctuation or semantics.

10 : Choosing the right strategy impacts the quality of responses generated by the LLM.

11 : In summary, proper segmentation reduces cost, enhances response relevance, and ensures faster processing times.

12 : It also improves traceability, making it easier to understand why the LLM gave a certain response.

2. How Big Should Segments Be?

It depends on the retrieval strategy and the downstream use case (The way the retrieved segments will be used later, especially by the Large Language Model (LLM) to generate answers, summaries, or insights).

 

When integrating documents into a Retrieval-Augmented Generation (RAG) pipeline, determining the right segment size is a trade-off between context preservation and retrieval quality. Two main approaches are commonly used.

 

Approach 1: Treating the Entire Document as an Atomic Unit

In this strategy, each document (e.g., a PDF, webpage, etc.) is treated as indivisible. The retrieval system indexes the full document as a single chunk and retrieves the top-N most relevant documents at query time. These complete documents are then passed to the LLM.

 

This approach is best for use cases where complete information is required, such as legal, scientific, or compliance scenarios, where missing a detail could result in incorrect outcomes.

 

Pros

·      No context is lost: The LLM receives the full document, preserving all relationships between sections.

·      Preserves structure: Useful when structure, flow, or interdependent content matters.

 

Cons

·      High token consumption: Long documents can quickly hit token limits, especially with shorter-context models.

·      Reduced relevance granularity: A document might contain multiple unrelated topics. Only a subset may be relevant to the query.

·      Poorer vector embedding quality: Compressing a large and diverse document into a single vector often leads to diluted semantic representation. This reduces the accuracy of similarity-based retrieval.

 

Approach 2: Splitting Documents into Smaller Segments

Here, documents are segmented into smaller parts, typically at the paragraph or sentence level, often with sliding window overlap to preserve partial context. During retrieval, the system fetches the top-N most relevant segments rather than full documents.

 

It is best for broad knowledge base applications, FAQs, chatbots, and scenarios where lower token usage are priorities.

 

Pros

·      Improved vector search quality: Smaller chunks yield more precise embeddings, improving semantic matching.

·      Efficient use of tokens: Smaller inputs mean more results can be packed into the LLM context window.

·      Scalable: Works better with short-context LLMs.

 

Cons

·      Risk of missing context: Isolated segments might lack background needed for correct interpretation.

·      Possible hallucinations: LLMs may invent missing details if a segment is too sparse or ambiguous.

·      Overlapping chunks increase index size and retrieval time.

 

To mitigate the downsides of Approach 2, several strategies have emerged:

 

·      Sliding Window with Overlap: Chunks overlap by a few sentences to preserve context flow.

·      Sentence Window Retrieval: Instead of fixed chunks, retrieval includes adjacent sentences around the matched content.

·      Auto-Merging Retrieval: Merges nearby retrieved segments at runtime to reconstruct richer context dynamically.

·      Parent Document Retrieval: Retrieves small segments for search accuracy but returns the full parent document or section to the LLM for context.

 

In most modern RAG pipelines, Approach 2 with context aware chunking and retrieval enhancement techniques strikes the best balance. However, for highly sensitive or interconnected content, Approach 1 may be necessary.

 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment