Programming for beginners: Building a RAG System With Ollama

1. What is RAG (Retrieval-Augmented Generation)?

RAG, or Retrieval-Augmented Generation, is a method used to enhance the capabilities of Large Language Models (LLMs) by allowing them to retrieve relevant information from an external data source (like documents, databases, or knowledge bases) and use that information to generate more accurate and contextually relevant responses.

There are two key steps in RAG

· Retrieval: The process of fetching relevant documents or pieces of information from an external source.

· Generation: The process where an LLM generates a response based on the retrieved documents or data.

1.1 Why is RAG needed?

LLMs, like GPT-3 or GPT-4, are trained on large datasets but have some limitations:

· They don't know what they don't know: They are restricted to the knowledge they were trained on. Once they are trained, they don't have the ability to learn new information unless retrained.

· They can sometimes produce hallucinated information: LLMs may generate responses that sound correct but are actually incorrect or fabricated. This is often due to the lack of up-to-date or specific knowledge beyond the model’s training data.

RAG addresses these limitations by augmenting the generation process with real-time retrieval of relevant information from a data source. This helps the model to provide more accurate, up-to-date, and contextually appropriate responses, reducing hallucinations and filling in gaps in knowledge.

1.2 RAG Process: Step-by-Step Explanation

1.2.1 Document Chunking and Storage into Vector Database

The first step is to take a large document corpus (knowledge base) and chunk it into smaller, manageable pieces. Each chunk could be a paragraph, sentence, or a small section of a document.

Each chunk of text is then converted into a vector representation using a method called embedding. These vector representations capture the meaning of the text in a form that a machine can understand and compare.

1.2.2 Storing in Vector Database

The resulting vectors (representations of the document chunks) are then stored in a vector database (like Faiss, Pinecone, or Chromadb). Vector databases are specialized systems that efficiently handle and retrieve high-dimensional vectors.

These databases index the vectors, allowing the system to quickly find and retrieve the most relevant chunks based on their similarity to a query.

1.2.3 User Query and Retrieval

When a user asks a question or provides a query, the system converts the query into a vector representation using the same embedding process.

This vector is then sent to the vector database, which retrieves the most relevant document chunks by comparing the query vector to the stored document vectors.

The retrieved chunks are typically the most relevant pieces of information based on the user's query.

1.2.4 Creating the Final Prompt

The system combines the retrieved chunks with the user’s original query to form a final prompt.

This final prompt now contains both the user’s query and the relevant context from the document corpus. The retrieved chunks are typically plain text.

1.2.5 Response Generation

The final prompt, which is a combination of the user’s query and the retrieved documents, is then sent to the LLM for generation. The LLM uses this context to generate a response that is accurate, relevant, and grounded in the documents retrieved.

The LLM generates a response that is informed by both its training data and the retrieved documents. This response is then presented to the user.

2. Langchain Framework

LangChain is a powerful framework designed to simplify the development of applications using Large Language Models (LLMs). It provides tools and abstractions to streamline tasks such as loading, processing, and querying data with LLMs. We are going to use langchain framework to build a simple RAG Application.

Following are the key features of Langchain

· Load and Parse Documents: LangChain can efficiently load and parse documents from various formats (e.g., PDFs, text files, web pages) into a usable format for further processing.

· Split Documents: It includes utilities for splitting long documents into smaller chunks, ensuring LLMs can handle them within their token limits while maintaining context.

· Generate Embeddings: LangChain integrates with embedding models to convert document chunks into vector representations for storage in vector databases. This enables efficient similarity-based retrieval.

· Unified Abstraction for LLMs: LangChain provides a unified interface to work with multiple LLMs (e.g., OpenAI, Anthropic, Hugging Face). It abstracts the complexities of interacting with these models, making it easier to build applications.

· Build LLM Applications: With LangChain, you can quickly prototype and build advanced applications like chatbots, retrieval-augmented generation (RAG) systems, and document-based search solutions.

3. Install Necessary dependencies

3.1 Install Python modules or packages

requirements.txt

ollama
chromadb
pdfplumber 
langchain
langchain-core
langchain-ollama
langchain-community
langchain_text_splitters
unstructured
unstructured[all-docs]
fastembed
pdfplumber
sentence-transformers
elevenlabs

Execute following command to install all the libraries that we need to build the Application.

pip3 install -r requirements.txt

Overview on the libraries

· ollama: A Python library for interacting with custom models like llama3.2. It allows you to create, query, and manage LLMs for tasks such as text generation or RAG-based systems.

· chromadb: An open-source vector database optimized for high-performance storage and retrieval of embeddings. Ideal for RAG systems, it supports similarity search and works seamlessly with LangChain.

· pdfplumber: A Python library for extracting text, tables, and metadata from PDF files. It provides detailed control over parsing PDF contents, including support for page layouts and bounding boxes.

· langchain: A robust framework for developing LLM-powered applications. It handles document processing, embedding generation, vector database integration, and creating LLM pipelines.

· langchain-core: The core components of LangChain, including chains, prompts, memory, and integrations with vector databases and LLMs. It forms the backbone of the LangChain framework.

· langchain-ollama: A LangChain integration for interacting with Ollama models, enabling seamless use of custom models like llama3.2 within LangChain pipelines.

· langchain-community: A collection of community-contributed extensions, tools, and integrations for LangChain, offering additional utilities to enhance its functionality.

· langchain_text_splitters: A LangChain module focused on splitting documents into manageable chunks, tailored for LLM processing. Supports custom splitting strategies like sentence, paragraph, or token limits.

· unstructured: A library for parsing and processing unstructured data from diverse document types (e.g., PDFs, HTML). It transforms messy content into structured formats for analysis.

· unstructured[all-docs]: A specialized installation option for the unstructured library that includes all parsers and dependencies required for handling a variety of document formats.

· fastembed: A library for generating embeddings quickly using pre-trained models, optimized for speed and efficiency in embedding generation for tasks like similarity search.

· sentence-transformers: A Python library for generating sentence or text embeddings using models like BERT or RoBERTa. It’s widely used in NLP tasks like semantic search and clustering.

· elevenlabs: A library for interacting with ElevenLabs' advanced text-to-speech models. It allows you to create realistic audio outputs from text, suitable for podcasts or voice assistants

3.2 Install poppler

Poppler is an open-source software library that provides tools for working with PDF files. It includes utilities that allow you to extract text, images, and metadata from PDF documents, as well as convert PDFs to other formats like images. It is widely used by many programs that need to handle PDF files.

Installing Poppler

Poppler is not a Python package that you can install with pip. It's a system-level program, which means you have to install it on your operating system (Windows, macOS, or Linux). Here's how you can install Poppler depending on your system.

For Mac OS

brew install poppler

For Linux

sudo apt update

sudo apt install poppler-utils

How to verify poppler installation?

Open Command Prompt and type below command.

pdfinfo -v

If Poppler is installed correctly, it should show the version number.

$ pdfinfo -v
pdfinfo version 25.01.0
Copyright 2005-2025 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011, 2022 Glyph & Cog, LLC

3.3 Install tesseract

Tesseract OCR is an open-source tool used for recognizing and extracting text from images or scanned documents. It's widely used for Optical Character Recognition (OCR) tasks in various programming applications.

For Mac

brew install tesseract

For Linux

sudo apt update

sudo apt install tesseract-ocr

How to verify tesseract installation?

Open Command Prompt and type below command.

tesseract --version

If tesseract is installed correctly, it should show the version number.

$ tesseract --version
tesseract 5.5.0
 leptonica-1.85.0
  libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.0.4) : libpng 1.6.45 : libtiff 4.7.0 : zlib 1.2.12 : libwebp 1.5.0 : libopenjp2 2.5.3
 Found NEON
 Found libarchive 3.7.7 zlib/1.2.12 liblzma/5.6.3 bz2lib/1.0.8 liblz4/1.10.0 libzstd/1.5.6
 Found libcurl/8.7.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.12 nghttp2/1.63.0

4. Final Program

timeTravelRag.py

from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import OnlinePDFLoader

import nltk
nltk.download('punkt_tab')
# print(nltk.data.path)

documentPath = "./data/timeTravel.pdf"
model = "llama3.2"

if documentPath:
    loader = UnstructuredPDFLoader(file_path=documentPath)
    data = loader.load()
    #print("Done loading the pdf file")
else:
    print("Upload a PDF file")

# Preview First few line of the pdf document
content = data[0].page_content
#print(content[:100])

# Extract the text from pdf file and split into small chunks
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

# Split Chunk
textSplitter = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=300)
chunks = textSplitter.split_documents(data)
#print("Done Splitting....")
#print(f"Total Chunks are {len(chunks)}")
#print(f"First Chunk {chunks[0]}")

# Add embedding model
import ollama
ollama.pull("nomic-embed-text")

vectorDb = Chroma.from_documents(
    documents=chunks,
    embedding=OllamaEmbeddings(model="nomic-embed-text"),
    collection_name="simple-rag",
)
#print("done adding to vector database....")

## Retrieval
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser

from langchain_ollama import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

# Set the model
llm = ChatOllama(model=model)

queryPrompt = PromptTemplate(
    input_variables=["question"],
    template = """
You are an AI language model assistant. Your task is to generate appropriate Answer
 to the question by retrieving relevant documents from a vector database.
 Original question: {question}
"""
)

retriever = MultiQueryRetriever.from_llm(
    vectorDb.as_retriever(), llm, prompt=queryPrompt
)

# RAG prompt
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


# res = chain.invoke(input=("what is the document about?",))
# res = chain.invoke(
#     input=("what are the main points as a business owner I should be aware of?",)
# )
res = chain.invoke(input=("What is Yearly dividend for Doom industries?"))
print(res)

Output

The Yearly Dividend for Doom Industries is $12 per share.

You can download timeTravel.pdf file from this location.

Explanation of the code snippet

import nltk

nltk.download('punkt_tab')

Above snippet imports the Natural Language Toolkit (nltk), a powerful Python library for working with human language data (text). It provides tools for tasks like tokenization, stemming, tagging, parsing, and more.

punkt_tab is an internal resource used by the Punkt Tokenizer, which is part of the NLTK library. It contains language-specific tables and data necessary for the Punkt tokenizer to identify sentence boundaries accurately. nltk.download('punkt_tab') download the punkt_tab resource.

if documentPath:

loader = UnstructuredPDFLoader(file_path=documentPath)

data = loader.load()

#print("Done loading the pdf file")

else:

print("Upload a PDF file")

The code checks if documentPath contains a valid file path. If valid, it creates an instance of UnstructuredPDFLoader to load and process the PDF file at the specified path, storing the extracted data in the data variable. If documentPath is invalid or empty, it prompts the user to upload a PDF file by printing a message.

content = data[0].page_content

print(content[:100])

This snippet extracts the content of the first page from the data variable (likely a list of objects representing pages of a PDF). It accesses the page_content attribute of the first page (data[0]) and prints the first 100 characters of the content for preview purposes.

from langchain_ollama import OllamaEmbeddings

from langchain_text_splitters import RecursiveCharacterTextSplitter

from langchain_community.vectorstores import Chroma

textSplitter = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=300)

chunks = textSplitter.split_documents(data)

This code extracts text from a PDF file, splits it into smaller chunks for easier processing, and prepares it for embedding or analysis. It uses the RecursiveCharacterTextSplitter to divide the text into chunks of 600 characters each, with a 300-character overlap between consecutive chunks. The resulting chunks are stored in the chunks variable.

chunk_overlap is the number of characters that overlap between consecutive chunks when splitting text. It ensures that important context from one chunk is preserved in the next, helping to maintain consistency and avoid losing valuable information at the boundaries of chunks.

import ollama

ollama.pull("nomic-embed-text")

This code snippet uses the ollama library to pull a specific model named "nomic-embed-text". ‘nomic-embed-text’ is a high-performing open embedding model with a large token context window.

An embedding model converts text, images, or other data into vector representations (numerical arrays) that capture the semantic meaning of the input. These embeddings make it easier to compare, retrieve, or analyze data based on their meaning rather than their exact structure.

vectorDb = Chroma.from_documents(

documents=chunks,

embedding=OllamaEmbeddings(model="nomic-embed-text"),

collection_name="simple-rag",

)

This code snippet demonstrates how to store document chunks in a vector database using Chroma and Ollama Embeddings for semantic search or retrieval-augmented generation (RAG).

Chroma is a vector database that stores and manages document embeddings (vector representations). It's used to efficiently retrieve similar documents based on their vector embeddings. By default, Chroma can run in-memory, which means it doesn't require an external database, but all data will be lost when the application stops.

from_documents: This method is used to create a vector database by converting documents into vectors using a specified embedding model. Here, the documents (which are the chunks) are processed into vector embeddings.

· documents=chunks: The chunks variable contains the text of documents that have been split into smaller pieces. These chunks are what will be converted into vectors for storage in the vector database.

· embedding=OllamaEmbeddings(model="nomic-embed-text"): This line specifies the embedding model to be used for converting the document chunks into vectors. OllamaEmbeddings with the model "nomic-embed-text" will take each chunk and map it to a vector representation. The "nomic-embed-text" model is likely a high-performing model designed to capture semantic meaning and generate meaningful embeddings.

· collection_name="simple-rag": This specifies the name of the collection in the vector database where the embeddings will be stored. In this case, the collection is named "simple-rag". RAG (Retrieval-Augmented Generation) refers to a technique where a retrieval system (based on vectors) is used to retrieve relevant documents or information to augment the generation process.

from langchain_ollama import ChatOllama

llm = ChatOllama(model=model)

ChatOllama is a class from the langchain_ollama library designed to facilitate the integration of Ollama's chat models into applications.

queryPrompt = PromptTemplate(

input_variables=["question"],

template="""

You are an AI language model assistant. Your task is to generate appropriate Answer

to the question by retrieving relevant documents from a vector database.

Original question: {question}

"""

)

This template is used to instruct the language model on how to retrieve relevant documents. It takes a question as input and generates a prompt to fetch related documents from the vector database.

retriever = MultiQueryRetriever.from_llm(

vectorDb.as_retriever(), llm, prompt=queryPrompt

)

· MultiQueryRetriever: This retrieves relevant documents based on the given query using a vector database.

· vectorDb.as_retriever(): Converts the vector database (vectorDb) into a retriever that can fetch documents.

· llm: The language model that will be used for generating the response. Specifying the LLM (Language Model) in the retrieval process is crucial in the context of Retrieval-Augmented Generation (RAG) because it enables the retriever to dynamically interact with the language model in a way that improves the relevance and quality of the documents being fetched.

In MultiQueryRetriever, multiple queries can be formulated and sent to the vector database to improve the search process. The LLM aids in generating these multiple queries, making it more flexible and intelligent. Without the LLM, the retriever might just generate a single query or rely on simpler search methods, which could limit its ability to gather relevant information.

· prompt=queryPrompt: The template we defined earlier, which helps the LLM know how to retrieve the relevant documents.

template = """Answer the question based ONLY on the following context:

{context}

Question: {question}

"""

prompt = ChatPromptTemplate.from_template(template)

This defines the actual prompt that will be used by the language model when generating answers. It asks the model to base its response only on the provided context.

chain = (

{"context": retriever, "question": RunnablePassthrough()}

| prompt

| llm

| StrOutputParser()

)

res = chain.invoke(input=("Summarize about time machine and doom industries?"))

When you call chain.invoke(input=("Summarize about time machine and doom industries?")), the input undergoes the following steps:

The input question "Summarize about time machine and doom industries?" is passed to the chain via the invoke method.

Flow Through the Chain: The input flows through each of the components in the chain in the order they are defined:

· Retriever (retriever): The question is sent to the retriever component, which fetches relevant documents (based on the question) from the vector database. This happens before the question itself is passed to the LLM.

· Prompt: After retrieving the relevant documents, these documents are combined with the question and formatted using the prompt template. The template will structure the input (question + context) into a format the LLM can understand.

· LLM: The formatted prompt (question + context) is passed to the LLM, which generates an answer to the question based on the retrieved context. The LLM uses its training to understand the question and the provided documents to generate an informative answer.

· StrOutputParser: The generated answer from the LLM is passed to the StrOutputParser to format it appropriately. In this case, it's likely ensuring the answer is in a string format, or performing any final processing before the result is returned.

prompt vs queryPrompt variables in the code

queryPrompt is a template for the retrieval process. It is used by the retriever (MultiQueryRetriever) to fetch the most relevant documents from the vector database based on the input question.

The prompt is a template for the generation process. ‘prompt’ is used by the language model (LLM) to generate an answer based on the fetched context.

Previous Next Home

Programming for beginners

Wednesday, 23 April 2025

Building a RAG System With Ollama

No comments:

Post a Comment