Programming for beginners: Document Ingestion with LangChain4j using EmbeddingStoreIngestor

In this blog post, we'll explore how LangChain4j’s EmbeddingStoreIngestor works under the hood and how you can configure it to transform, split, and embed documents into an EmbeddingStore. If you're building LLM-based applications and need to semantically search or analyze documents, mastering this ingestion pipeline is a must.

The EmbeddingStoreIngestor is a utility that

· Takes in Documents

· Transforms and optionally splits them

· Embeds them using an EmbeddingModel

· Stores them in an EmbeddingStore

This modular pipeline enables efficient vectorization and storage of text data, laying the foundation for semantic search, recommendations, and knowledge retrieval in your applications.

Example

    EmbeddingStoreIngestor ingestor =
        EmbeddingStoreIngestor.builder()
            .embeddingModel(new BgeSmallEnV15QuantizedEmbeddingModel())
            .documentTransformer(
                document -> Document.from(document.text().toUpperCase(), document.metadata()))
            .textSegmentTransformer(
                segment -> TextSegment.from(segment.text().toUpperCase(), segment.metadata()))
            .embeddingStore(embeddingStore)
            .build();

Find the below working application.

EmbeddingStoreIngestorDemo.java

package com.sample.app.embeddings;

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.onnx.bgesmallenv15q.BgeSmallEnV15QuantizedEmbeddingModel;
import dev.langchain4j.model.ollama.OllamaChatModel;
import dev.langchain4j.model.output.TokenUsage;
import dev.langchain4j.rag.content.retriever.EmbeddingStoreContentRetriever;
import dev.langchain4j.service.AiServices;
import dev.langchain4j.store.embedding.EmbeddingStoreIngestor;
import dev.langchain4j.store.embedding.IngestionResult;
import dev.langchain4j.store.embedding.inmemory.InMemoryEmbeddingStore;
import java.util.ArrayList;
import java.util.List;

public class EmbeddingStoreIngestorDemo {

  public interface ChatAssistant {
    String chat(String userMessage);
  }

  public static void main(String[] args) {
    List<Document> documents = new ArrayList<>();
    Document doc1 =
        Document.from("The stock market surged as tech companies reported strong earnings.");
    Document doc2 =
        Document.from("Tech giants like Apple and Amazon saw record profits this quarter.");
    Document doc3 = Document.from("Heavy rain caused flooding in coastal towns yesterday.");
    documents.add(doc1);
    documents.add(doc2);
    documents.add(doc3);

    InMemoryEmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();

    EmbeddingStoreIngestor ingestor =
        EmbeddingStoreIngestor.builder()
            .embeddingModel(new BgeSmallEnV15QuantizedEmbeddingModel())
            .documentTransformer(
                document -> Document.from(document.text().toUpperCase(), document.metadata()))
            .textSegmentTransformer(
                segment -> TextSegment.from(segment.text().toUpperCase(), segment.metadata()))
            .embeddingStore(embeddingStore)
            .build();

    IngestionResult ingestionResult = ingestor.ingest(documents);
    TokenUsage tokenUsage = ingestionResult.tokenUsage();
    System.out.println("Input Token Count : " + tokenUsage.inputTokenCount());
    System.out.println("Output Token Count : " + tokenUsage.outputTokenCount());
    System.out.println("Total Token Count : " + tokenUsage.totalTokenCount());

    OllamaChatModel chatModel =
        OllamaChatModel.builder().baseUrl("http://localhost:11434").modelName("llama3.2").build();

    ChatAssistant assistant =
        AiServices.builder(ChatAssistant.class)
            .chatModel(chatModel)
            .contentRetriever(EmbeddingStoreContentRetriever.from(embeddingStore))
            .build();

    List<String> questionsToAsk = new ArrayList<>();
    questionsToAsk.add("Is tech companies stocks surged?");

    long time1 = System.currentTimeMillis();
    for (String question : questionsToAsk) {
      String answer = assistant.chat(question);
      System.out.println("----------------------------------------------------");
      System.out.println("Q: " + question);
      System.out.println("A : " + answer);
      System.out.println("----------------------------------------------------\n");
    }
    long time2 = System.currentTimeMillis();

    System.out.println("Total time taken is " + (time2 - time1));
  }
}

Output

Input Token Count : 32
Output Token Count : null
Total Token Count : 32
----------------------------------------------------
Q: Is tech companies stocks surged?
A : Yes, tech companies' stocks surged due to the strong earnings reported by these companies.
----------------------------------------------------

Total time taken is 1984

Previous Next Home

Programming for beginners

Friday, 12 September 2025

Document Ingestion with LangChain4j using EmbeddingStoreIngestor

No comments:

Post a Comment