An Embedding Store is a storage mechanism for vector embeddings, typically used in machine learning, natural language processing (NLP), and retrieval-augmented generation (RAG) systems.
1. What is an Embedding?
An embedding is a high-dimensional numeric vector that represents data (like text, images, or code) in such a way that similar items are close together in vector space.
2. What is an Embedding Store?
An Embedding Store is a component (or system) that is responsible for:
· Storing embeddings
· Retrieving embeddings by ID
· Searching for similar embeddings (usually using cosine similarity or other distance metrics)
· Managing entries (adding, deleting, updating)
In real-world applications like Semantic search engines, chatbots using RAG (retrieval-augmented generation), recommendation systems, clustering or classification tasks, you often need to store many embeddings and perform fast similarity searches on them. That’s where an embedding store is built for.
3. EmbeddingStore interface in LangChain4j
The EmbeddingStore interface in LangChain4j serves as an abstraction for a vector database, a specialized system designed to store and retrieve embeddings efficiently.
Embeddings are high-dimensional numeric representations of data (such as text), and the EmbeddingStore allows you to:
· Store these embeddings
· Search for embeddings that are similar (i.e., close in vector space)
· Manage embedding entries by ID
Following tables summarize various key methods of EmbeddingStore interface.
Add Methods
Method |
Description |
String add(Embedding embedding) |
Adds a single embedding and returns a randomly generated ID. |
void add(String id, Embedding embedding) |
Adds a single embedding with a specified custom ID. |
String add(Embedding embedding, Embedded embedded) |
Adds an embedding along with the original content (Embedded, e.g., TextSegment), and returns a generated ID. |
List<String> addAll(List<Embedding> embeddings) |
Adds multiple embeddings and returns a list of generated IDs. |
List<String> addAll(List<Embedding> embeddings, List<Embedded> embedded) |
Adds multiple embeddings with corresponding content and returns their generated IDs. |
void addAll(List<String> ids, List<Embedding> embeddings, List<Embedded> embedded) |
Adds multiple embeddings with provided IDs and associated content. |
Search Method
Method |
Description |
EmbeddingSearchResult<Embedded> search(EmbeddingSearchRequest request) |
Searches for embeddings most similar to the queryEmbedding based on optional filter, minimum score, and max results. |
EmbeddingSearchRequest properties:
· Embedding queryEmbedding: The embedding to compare against.
· int maxResults: Maximum number of similar results to return.
· double minScore: Minimum similarity score threshold.
· Filter filter: (Optional) Filter to restrict results based on metadata.
Remove methods
Method |
Description |
void remove(String id) |
Removes a single embedding by its ID. |
void removeAll(Collection<String> ids) |
Removes all embeddings whose IDs are in the provided collection. |
void removeAll(Filter filter) |
Removes all embeddings that match the given filter. |
void removeAll() |
Clears all embeddings from the store. |
In summary, using EmbeddedStore interface
· Add individual or bulk embeddings (with or without IDs and content),
· Search for similar embeddings using flexible criteria,
· Remove embeddings selectively or entirely.
InMemoryEmbeddingStore
InMemoryEmbeddingStore is a concrete implementation of the EmbeddingStore interface in LangChain4j. As the name suggests, it stores embeddings in memory (i.e., in the JVM heap), making it simple and fast for small to medium-scale use cases.
Key Features Of InMemoryEmbeddingStore
1. In-Memory Storage: All embeddings are stored in memory (RAM). It is ideal for prototyping, demos, or applications with a limited dataset, and not suitable for very large-scale or distributed systems due to memory limitations.
2. Brute-Force Search: Similarity search is implemented using a brute-force approach by iterating through all stored embeddings. It calculates similarity (e.g., cosine similarity) between the query and each stored embedding, sorts and returns the most relevant results. While this is simple, it's not optimized for speed when storing thousands/millions of embeddings.
3. Persistence Support:You can serialize and deserialize the store's contents:
· serializeToJson(): Converts the in-memory store to a JSON string.
· serializeToFile(Path): Writes the store to a file on disk.
· fromJson(String): Loads the store from a JSON string.
· fromFile(Path): Loads the store from a file.
Find the below working application.
EmbeddingStoreDemo.java
package com.sample.app.embeddings; import dev.langchain4j.data.embedding.Embedding; import dev.langchain4j.data.segment.TextSegment; import dev.langchain4j.model.embedding.EmbeddingModel; import dev.langchain4j.model.embedding.onnx.bgesmallenv15q.BgeSmallEnV15QuantizedEmbeddingModel; import dev.langchain4j.store.embedding.EmbeddingSearchRequest; import dev.langchain4j.store.embedding.EmbeddingSearchResult; import dev.langchain4j.store.embedding.EmbeddingStore; import dev.langchain4j.store.embedding.inmemory.InMemoryEmbeddingStore; public class EmbeddingStoreDemo { public static void main(String[] args) { // Step 1: Initialize embedding model EmbeddingModel model = new BgeSmallEnV15QuantizedEmbeddingModel(); // Step 2: Prepare text data String sentence1 = "The stock market surged as tech companies reported strong earnings."; String sentence2 = "Tech giants like Apple and Amazon saw record profits this quarter."; String sentence3 = "Heavy rain caused flooding in coastal towns yesterday."; try { // Step 3: Generate embeddings Embedding embedding1 = model.embed(sentence1).content(); Embedding embedding2 = model.embed(sentence2).content(); Embedding embedding3 = model.embed(sentence3).content(); // Step 4: Initialize in-memory embedding store EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>(); // Step 5: Add embeddings to the store embeddingStore.add(embedding1, TextSegment.from(sentence1)); embeddingStore.add(embedding2, TextSegment.from(sentence2)); embeddingStore.add(embedding3, TextSegment.from(sentence3)); System.out.println("Embeddings added to the store successfully."); // Step 6: Perform a search Embedding queryEmbedding = model.embed("Which companies had strong earnings?").content(); EmbeddingSearchRequest request = EmbeddingSearchRequest.builder().queryEmbedding(queryEmbedding) .maxResults(2).minScore(0.5).build(); EmbeddingSearchResult<TextSegment> embeddingSearchResult = embeddingStore.search(request); // Step 7: Display results System.out.println("\nTop relevant results:"); embeddingSearchResult.matches().forEach(match -> System.out.printf("Score: %.3f | Text: %s%n", match.score(), match.embedded() != null ? match.embedded().text() : "[No TextSegment stored]")); } catch (Exception e) { System.err.println("An error occurred while embedding or storing data: " + e.getMessage()); e.printStackTrace(); } } }
Output
Embeddings added to the store successfully. Top relevant results: Score: 0.910 | Text: The stock market surged as tech companies reported strong earnings. Score: 0.846 | Text: Tech giants like Apple and Amazon saw record profits this quarter.
LangChain4j support 25+ embedding stores while writing this post, you can get all the details from below link.
https://docs.langchain4j.dev/integrations/embedding-stores/
No comments:
Post a Comment