Programming for beginners: LangChain4j: Simple In-Memory RAG

In this post, I’ll walk you through building a RAG (Retrieval-Augmented Generation) application using LangChain4j’s Easy RAG module. We'll use sample data about a fictional company called ChronoCore Industries and demonstrate how to query that information effectively.

1. Introduction to RAG

RAG (Retrieval-Augmented Generation) is a method that helps large language models (LLMs) to get better answers by first finding useful information from your data and adding it to the prompt before sending it to the model. This helps the model respond more accurately and reduces the chances of it making things up (called hallucinations).

There are different ways to find this useful information:

· Full-text (keyword) search: Looks for documents that match the words in your question. It uses methods like TF-IDF or BM25 to rank how well each document matches.

· Vector (semantic) search: Converts documents and questions into numbers (vectors) and compares them to find the most similar meanings, not just matching words.

· Hybrid search: Combines both full-text and vector search to get better results.

Right now, this guide mainly focuses on vector search.

2. RAG Stages

The Retrieval-Augmented Generation (RAG) process usually has two main steps: Indexing and Retrieval. LangChain4j provides helpful tools for both.

2.1 Indexing Stage

Indexing is the first step where documents are prepared so they can be quickly and accurately searched later. For vector-based search, this step includes:

· Cleaning the content to remove unnecessary or irrelevant parts

· Adding extra context or metadata if needed

· Splitting large documents into smaller parts (called chunks)

· Converting those chunks into numerical vectors using embeddings

· Saving them in a special database called an embedding store or vector database

This step is usually done in the background, not while users are interacting with the app. For example, a scheduled job (like a weekly cron job) might update the data during non peak hours, such as weekends. Sometimes, a separate service takes care of managing the vector database.

In other cases, like when users upload their own documents, the indexing needs to happen right away (in real time) and be built into the app.

For this proof of concept (POC), documents related to the fictional company called ChronoCore Industries are stored in my computer and loaded into memory. This simple setup works well for the demo.

Here is a simplified diagram of the indexing stage.

2.2 Retrieval Stage

The retrieval stage begins when a user submits a question. The system then searches through pre-indexed documents to identify the most relevant information.

For vector search, the user's query is converted into an embedding, a numerical representation of the query. The system compares this embedding to others in the embedding store to find the most similar document segments. These relevant chunks are then included in the prompt provided to the language model to help generate an accurate response.

3. RAG Variants in LangChain4j

LangChain4j provides three different ways to implement Retrieval-Augmented Generation (RAG), based on your experience level and the level of control you need. Each approach is designed to suit different use cases, from beginners to advanced developers.

3.1 Easy RAG: Zero Configuration, Fast Start

This is the simplest and most user-friendly way to get started with RAG. With Easy RAG, you don’t need to worry about the complex internals of how RAG works. It is best suited for beginners or quick prototypes

· No need to choose or configure an embedding model

· No need to set up or connect to a vector database

· No need to manually split or chunk documents

· No need to understand how metadata or embeddings work

You simply provide your documents like PDFs, text files, or web pages and LangChain4j handles the entire pipeline automatically (document parsing, chunking, embedding, and storing). It’s a perfect starting point for experimenting with RAG without deep technical knowledge.

3.2 Naive RAG: Basic, Customizable Implementation

It is well suited for Intermediate users who want a bit more control. The Naive RAG approach gives you a clearer view of how RAG works behind the scenes. Here, you manually handle some parts of the process, such as:

· Ingesting and processing your documents

· Creating embeddings using a chosen model

· Storing them in a vector database or in-memory store

Once your documents are embedded and stored, you can use EmbeddingStoreContentRetriever to perform vector searches during the retrieval stage.

ContentRetriever contentRetriever = EmbeddingStoreContentRetriever.builder()
    .embeddingStore(embeddingStore)
    .embeddingModel(embeddingModel)
    .maxResults(5)
    .minScore(0.75)
    .build();

Assistant assistant = AiServices.builder(Assistant.class)
    .chatModel(model)
    .contentRetriever(contentRetriever)
    .build();

It’s called "naive" because it’s a straightforward, linear RAG flow, no advanced logic like query rewriting or reranking. This option is ideal for those who want to use their own embedding models or vector stores but still keep the process relatively simple.

3.3 Advanced RAG: Full Control, Maximum Flexibility

It is suitable for Experts and production grade applications. Advanced RAG is a modular and highly customizable approach for building sophisticated RAG pipelines. It gives you full control over each stage of the process, allowing you to:

· Apply query transformation (e.g., expand or rewrite the user's question)

· Retrieve content from multiple sources (e.g., combine results from PDFs, APIs, databases)

· Apply re-ranking or filtering to the retrieved results based on relevance or metadata

· Integrate custom logic or models at every step

This is the best option when you need more control over retrieval quality, performance, and behavior. It’s suitable for building enterprise level applications that require precision, explainability, and modularity.

Following table summarizes all these three RAG types.

Type	Level	Usecase	Customization	Setup Effort
Easy RAG	Beginner	Quick demos, prototypes	Minimal	Easiest
Naive RAG	Intermediate	Simple custom vector search	Moderate	Moderate
Advanced RAG	Expert, Full control	production systems	Full	Complex

Follow below step-by-step procedure to build a working application using Easy RAG.

Step 1: Create new maven project inmemory-rag.

Step 2: We need to have to langchain4j-easy-rag dependency to use easy rag.

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-easy-rag</artifactId>
    <version>1.0.1-beta6</version>
</dependency>

Update pom.xml with maven dependencies.

pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.sample.app</groupId>
    <artifactId>inmemory-rag</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <properties>
        <maven.compiler.source>21</maven.compiler.source>
        <maven.compiler.target>21</maven.compiler.target>
        <java.version>21</java.version>
    </properties>

    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>dev.langchain4j</groupId>
                <artifactId>langchain4j-bom</artifactId>
                <version>1.0.1</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>

    <dependencies>

        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j</artifactId>
        </dependency>

        <!--
        https://mvnrepository.com/artifact/dev.langchain4j/langchain4j-ollama -->
        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-ollama</artifactId>
        </dependency>

        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-easy-rag</artifactId>
        </dependency>

        <!--
        https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-core -->
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-core</artifactId>
            <version>2.24.3</version>
        </dependency>

    </dependencies>


</project>

Step 3: Define ChatAssistant interface.

package com.sample.app.assistants;

public interface ChatAssistant {
    String chat(String userMessage);
}

Step 4: Load the Documents to embedded store

String resourcesFolderPath = "/Users/Shared/llm_docs";
List<Document> documents = FileSystemDocumentLoader.loadDocuments(resourcesFolderPath);

In this code snippet:

· resourcesFolderPath defines the path to a local folder containing documents (in this case, /Users/Shared/llm_docs). You can download timemachine.pdf file from https://github.com/harikrishna553/java-libs/tree/master/langchain4j/inmemory-rag/src/main/resources and keep it in resourcesFolderPath.

· FileSystemDocumentLoader.loadDocuments(resourcesFolderPath) is used to load and parse all supported documents from this folder, returning a List<Document> that can be used for further processing (e.g., indexing, retrieval, etc.).

Now, here's what happens under the hood, the Apache Tika library, known for its broad support of various document formats, is used to automatically detect file types and extract their content. Since no specific DocumentParser is defined, the FileSystemDocumentLoader leverages Java's Service Provider Interface (SPI) mechanism to load the default implementation ApacheTikaDocumentParser which is made available via the langchain4j-easy-rag dependency.

If you want to load documents from all subdirectories, you can use the loadDocumentsRecursively method.

List<Document> documents = FileSystemDocumentLoader.loadDocumentsRecursively(resourcesFolderPath);

Additionally, you can filter documents by using a glob or regex.

PathMatcher pathMatcher = FileSystems.getDefault().getPathMatcher("glob:*.pdf");
List<Document> documents = FileSystemDocumentLoader.loadDocuments(resourcesFolderPath, pathMatcher);

Now that we have successfully loaded and parsed our documents, the next step is to preprocess and store them in a specialized storage mechanism called an embedding store, also known as a vector database.

What is an Embedding Store?

An embedding store is a system designed to store and manage vector representations (also called embeddings) of data. These embeddings are numerical representations of textual content, where semantically similar pieces of text have vectors that are close together in the vector space. This makes it possible to perform fast and meaningful similarity searches, allowing the system to retrieve the most relevant content based on a user query.

Embedding stores are essential in Retrieval-Augmented Generation (RAG) pipelines because they enable efficient semantic search over large document collections, rather than relying solely on keyword matching.

InMemoryEmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
EmbeddingStoreIngestor.ingest(documents, embeddingStore);

In this example:

· We are using InMemoryEmbeddingStore, a simple in-memory implementation of an embedding store provided by LangChain4j. It’s suitable for prototyping or small-scale applications where persistence is not required.

· The EmbeddingStoreIngestor.ingest(...) method processes the list of documents, splits them into smaller units (typically called TextSegments), generates embeddings for each segment using a default or configured embedding model, and stores those embeddings in the provided embeddingStore

Internally, InMemoryEmbeddingStore maintains a thread-safe list of entries using a CopyOnWriteArrayList. Each entry consists of:

· The embedding (i.e., the vector representation of a text segment).

· The original text segment (or other Embedded object).

· A unique ID for reference.

This data structure is safe for concurrent reads and writes but is not optimized for large-scale or high-frequency modifications due to the way it copies the array on every mutation.

When a user performs a search query using InMemoryEmbeddingStore, the following steps occur internally:

· The user’s query (in natural language) is passed through an embedding model.

· This model generates a query embedding, a high-dimensional vector that captures the semantic meaning of the query.

· This vector will be compared to all the stored vectors to find similar meanings.

Internally, InMemoryEmbeddingStore performs a linear scan over its list of stored entries (CopyOnWriteArrayList). For each entry (if filter is specified, these are filtered out), it performs cosine similarity between the stored embedding and query embedding. Cosine similarity measures how close the two vectors point in the vector space. A higher similarity means the content is semantically closer to the query.

Is the Embedding Store Dependent on the Embedding Model?

Yes, the embedding store relies on an embedding model to transform text into vector form before storage. The embedding store itself is model-agnostic, as it simply stores and retrieves vectors.

LangChain4j allows you to plug in various embedding models, such as OpenAI, Hugging Face, Cohere, or even custom models. The embedding model must be configured appropriately, usually at application startup, so that when you ingest documents, embeddings are generated accordingly.

Are Embedding Vectors Created for Every Token or for a Group of Tokens?

The embedding vector is typically created for a group of tokens, not for each individual token, when generating document or query embeddings.

In search, retrieval, or RAG (Retrieval-Augmented Generation) systems, we do not generate embeddings for every token. Instead, we generate one embedding per text segment, where:

· A segment is a group of tokens, typically a few sentences, a paragraph, or a configurable chunk of text.

· Each segment is passed to an embedding model, which produces a single vector that captures the semantic meaning of the whole segment.

This approach is optimal for retrieval tasks because the goal is to compare semantic similarity between chunks and queries, not to analyze individual word meanings.

Step 5: Get an instance of ChatAssistant.

The last step is to create an AI Service that will serve as our API to the LLM.

OllamaChatModel chatModel = OllamaChatModel.builder().baseUrl("http://localhost:11434").modelName("llama3.2")
.build();

ChatAssistant assistant = AiServices.builder(ChatAssistant.class).chatModel(chatModel)
.contentRetriever(EmbeddingStoreContentRetriever.from(embeddingStore)).build();

Now you can ask any questions to the chat assistant.

String answer = assistant.chat(question);

Find the below working application.

InMemoryRag.java

package com.sample.app;

import java.util.ArrayList;
import java.util.List;

import com.sample.app.assistants.ChatAssistant;

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.loader.FileSystemDocumentLoader;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.ollama.OllamaChatModel;
import dev.langchain4j.rag.content.retriever.EmbeddingStoreContentRetriever;
import dev.langchain4j.service.AiServices;
import dev.langchain4j.store.embedding.EmbeddingStoreIngestor;
import dev.langchain4j.store.embedding.inmemory.InMemoryEmbeddingStore;

public class InMemoryRag {

    public static void main(String[] args) {

        // Load all the Documents
        String resourcesFolderPath = "/Users/Shared/llm_docs";
        System.out.println("Resources Folder Path is " + resourcesFolderPath);
        List<Document> documents = FileSystemDocumentLoader.loadDocuments(resourcesFolderPath);

        InMemoryEmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
        EmbeddingStoreIngestor.ingest(documents, embeddingStore);

        OllamaChatModel chatModel = OllamaChatModel.builder().baseUrl("http://localhost:11434").modelName("llama3.2")
                .build();

        ChatAssistant assistant = AiServices.builder(ChatAssistant.class).chatModel(chatModel)
                .contentRetriever(EmbeddingStoreContentRetriever.from(embeddingStore)).build();

        List<String> questionsToAsk = new ArrayList<>();
        questionsToAsk.add("What is the tag line of ChronoCore Industries?");

        long time1 = System.currentTimeMillis();
        for (String question : questionsToAsk) {
            String answer = assistant.chat(question);
            System.out.println("----------------------------------------------------");
            System.out.println("Q: " + question);
            System.out.println("A : " + answer);
            System.out.println("----------------------------------------------------\n");
        }
        long time2 = System.currentTimeMillis();

        System.out.println("Total time taken is " + (time2 - time1));
    }

}

Output

----------------------------------------------------
Q: What is the tag line of ChronoCore Industries?
A : The tagline for ChronoCore Industries is "Engineering Tomorrow. Preserving Yesterday. Living Today."
----------------------------------------------------

Total time taken is 3626

You can download the application from this link.

Previous Next Home

Programming for beginners

Sunday, 3 August 2025

LangChain4j: Simple In-Memory RAG

No comments:

Post a Comment