Thursday, 4 September 2025

Getting Started with Embedding Models in LangChain4j: A Beginner's Guide to Text Representation

As Large Language Models (LLMs) become more accessible, many Java developers are exploring how to integrate them into their applications. One core concept in this space is text embeddings, a way to convert text into numbers so that machines can understand and compare it.

 

In this blog post, we’ll explore what embeddings are, why they matter, and how you can use embedding models in LangChain4j, the Java library inspired by LangChain, to build smart, searchable applications.

 

1. What is an Embedding?

In machine learning, an embedding is a way to represent real-world data as a list of numbers (also called a vector) in a multi-dimensional space. Think of it as translating complex data like a sentence or an image into a form that a computer can understand and compare easily.

 

Imagine you have a huge spreadsheet where each row is a paragraph from a news article, and each paragraph is converted into a vector of numbers. Now, similar paragraphs will have similar vectors and those will be close together in this abstract "embedding space".

 

 

Let’s say we have three news paragraphs

 

·      "The stock market surged as tech companies reported strong earnings."

·      "Tech giants like Apple and Amazon saw record profits this quarter."

·      "Heavy rain caused flooding in coastal towns yesterday."

 

When converted into vectors (numbers), they might look like this (simplified for clarity):

·      Stock market news: (0.8, 0.6, 0.1)

·      Tech profits news: (0.7, 0.5, 0.2)

·      Weather news: (0.1, 0.2, 0.9)

 

Embedding Space

·      Think of it as a 3D map (It can be Nd based in the vector dimension) where each paragraph is a point.

·      Similar paragraphs (e.g., about tech earnings) will have close vector values, so they appear near each other.

·      Different topics (e.g., weather vs. stocks) will be far apart.

 


Diagram is drawn using https://www.geogebra.org/m/jcw38f5f

 

Why Does This Matter?

Machine learning models mostly work with numbers, not raw text, not images, not audio.

 

For example:

·      A word like “apple” needs to be turned into numbers before the model can work with it.

·      An image of a cat must be represented numerically so a model can learn what a cat looks like.

 

Embeddings solve this problem. They take things like words, sentences, images, users, or even abstract ideas (like political views) and convert them into vectors of numbers. These numbers are not random; they are learned representations that capture meaning or relationships.

 

2. Embedding Model

Embedding models are machine learning models that convert text (or other data like images, code, etc.) into numerical vectors (usually dense, fixed-size arrays of floating-point numbers).

 

Why do we need an Embedding Model?

Computers don't understand text like humans do. For example, we know "dog" and "puppy" are related. But computers just see number. So, we need to transform text into numbers, in such a way that:

 

·      Similar meanings -> similar vectors

·      Different meanings ->  vectors far apart

 

This is where embeddings come in.

 

For example, Let’s say we embed the sentence:

"The stock market surged as tech companies reported strong earnings."

 

The embedding model might convert it into a vector like below.

[0.033070844, 0.0153403785, -0.02272317, -0.03267831, 0.013239985, 0.013246246, 0.05294178, 0.035563156, 0.025776176, -0.04903232, 0.012191639, 0.042546302, 0.021109197, -0.008099615, 0.00165633, 0.0040932787, 0.0037315977, -0.10742734, -0.049215436, 0.019922065, -0.036086913, -0.044573653, 0.07453698, -0.074122086, 0.05406342, -0.011545511, -0.057294633, -0.01846788, -0.035941347, -0.08208674, -0.004388802, -0.06163799, 0.07204461, 0.018651184, 0.04795051, 0.04702034, -0.04636673, -0.022695776, -0.0036641115, -0.0022030638, -0.08239435, -0.041306805, -0.029798489, -0.02081099, 0.037323583, -0.018270181, 0.026182024, 0.018649098, 0.06550134, 0.03842208, 0.012267945, -0.011940169, 0.04147419, 0.06089057, 0.018816778, -0.031577885, -0.016752953, 0.04262636, -0.007962364, -0.0051396578, 0.03128127, 0.040906362, -0.1080848, 0.011305913, 0.00997217, 0.014159851, -0.022256566, 0.041289777, -0.0027448253, 0.0016317945, 0.03401922, 0.031134283, 0.008619056, -0.033724725, 0.06462856, 0.013624478, 0.034827136, -0.021642124, -0.037349213, -0.025700986, 0.073056035, -0.04729549, -0.02289604, -0.015683552, -0.025439097, 0.025560498, 0.036770534, 0.03807057, 0.08808378, -0.014827458, -0.028620843, -0.004633984, -0.06276622, -0.0068571623, 0.005210955, -0.013964536, 0.020354811, -0.027199754, 0.04310171, 0.35415047, -0.0146136535, 0.001414511, 0.0016273974, -0.044407286, 0.0023606454, -0.082163125, 0.024515683, 0.08722528, -0.0014071842, -0.008077949, 0.031971175, -0.038676813, 0.03294346, 0.015581203, -0.02130573, 0.019070702, 0.013223457, -0.059158575, 0.044978134, -0.012825453, 0.005463736, 0.095157154, 0.018110473, -0.05799851, -0.038619548, 0.01029621, 0.08199221, 0.11205499, 0.032383885, 0.045180123, -0.033854164, 0.052625906, -0.04446982, 6.355692E-4, 0.022681693, -0.03884721, -0.06945539, -0.015624263, -0.024084516, 0.07887284, -0.069804855, 0.010415647, -0.01134201, -0.10046027, 0.0031025826, 0.038310587, 0.059812363, 0.028313205, -0.007099648, -0.017788904, -0.06715632, 0.020134302, -1.4405562E-4, -0.06107411, -0.051835235, 0.037505187, 0.019266684, 0.08179941, -0.049029108, -0.0779474, -0.030252675, 2.815944E-4, -0.06772327, 0.022179205, 0.07085692, -0.06798135, -0.002741226, 0.027259829, 0.018961553, 0.0011803786, -0.041467443, -0.0018548345, -0.054024875, 0.014601997, -0.02700796, -0.06721615, -0.029471021, -0.05179544, -0.048539493, 0.021985177, 0.091861255, -0.039206695, -0.01648498, 0.06897709, 0.040281463, -0.0073529906, -0.039182853, -0.0030486279, -0.05513321, 0.022324804, 0.01696679, 0.047977816, -0.085962966, -0.007153532, -0.012249972, 0.028323472, 0.029731583, 0.08518008, -0.01689244, -0.045403104, 0.044246063, -0.0907299, -0.07893569, 0.061223287, 0.027898133, -0.0375997, -0.007880488, 0.03570469, 0.10015239, 0.019999068, 0.058656607, -0.011053108, 0.114033945, 0.027788946, -0.06479202, -0.024700003, -0.04006322, 0.07412892, -0.002731008, 0.035042636, 0.07365728, 0.01846474, -0.008577951, -0.26268747, -0.07022979, -0.023585841, 0.002306528, -0.023161048, -0.05367942, -0.03218322, -0.010237279, 0.07524844, 0.028549973, -0.006012188, 0.027169889, -0.011649943, -0.06298948, -9.529807E-4, -0.04452852, -0.022676094, 0.06124661, -0.035640858, 0.010471932, -0.05443845, -0.0101356795, -0.08213793, 0.0013792863, 0.072898485, 0.03140494, 0.12277298, 0.008339427, 0.0065326625, 0.027949437, -0.019226648, 0.021544697, -0.007378928, -0.063030586, 0.08376176, 0.032467384, 0.09542398, 0.0044797873, -0.058107913, 0.018324418, -0.05335521, -0.07582777, -0.031073147, -0.029305156, 0.045141138, -0.0077532325, 0.0423158, 0.014783636, 0.009300752, 0.06032439, 0.07248993, -0.023068322, 0.06518358, 0.009031491, 0.04318249, -0.07930742, -0.07411129, 0.009575875, 0.018408427, 0.0023626434, 0.0038866377, -0.067663744, -0.06288174, 0.04718162, 0.027922973, -0.025403826, -0.017856516, 0.03352653, -0.042773824, -0.015431333, -0.037848853, 0.0027097538, -0.001081324, 0.01838925, 0.037865803, -0.014267508, 0.055702325, -0.021839019, 0.025403272, 0.025324266, 0.051626686, 0.03710481, 0.07255522, 0.06901654, -0.019942189, -0.038095314, 0.09826931, -0.003796644, -0.025142774, 0.0035125979, 0.0037123424, -0.015396475, -0.06477862, -0.070024036, 0.019877424, -0.0148665365, -0.24709377, 0.0037612137, -0.032782324, -0.03660728, 0.006909102, -0.019770242, -0.060017716, 0.05396083, 0.022255836, 0.06507799, -0.010804685, 0.005521386, 0.028710663, -0.03825647, -4.455249E-4, -0.014409018, 0.031234719, -0.04957731, 0.028819706, 0.049558967, 0.05354901, -0.0012052947, 0.2083946, 0.015796421, 0.033166032, -0.019237522, -0.052421823, 0.013201535, -6.842741E-5, -0.049993843, 0.03948682, -0.030244732, -0.013565847, -0.035804164, -0.032084465, 0.055895522, 0.01719479, 0.025026042, -0.029850561, 0.008644259, -0.033024296, 0.05050811, 0.016706936, -0.0017193812, 0.02758552, -0.023197392, 6.104201E-4, -0.07175468, 0.06229996, -0.0036707665, -0.041534472, -0.027986828, 0.0030765615, -0.014145169, -0.03129912, -0.022823425, -0.053044498, 0.006604329, 0.002445822, -0.026870258, -0.06023863, -0.032316208, -0.040124733, 0.020203792, 0.032481972]

 

How Do Embedding Models Work?

At a high level, these models:

 

·      Take text as input.

·      Tokenize it (break into words/subwords).

·      Process using neural networks (often Transformers).

·      Output a vector that captures the semantic meaning.

 

3. EmbeddingModel interface in LangChain4j

EmbeddingModel interface in LangChain4j represents any model capable of converting text into embeddings, that is, a vector of numbers that captures the meaning of the text. Think of it like a plug-and-play adapter that lets your Java code talk to OpenAI, HuggingFace, or local embedding models, using a unified interface.

public interface EmbeddingModel {

    default Response<Embedding> embed(String text) {
        return embed(TextSegment.from(text));
    }

    default Response<Embedding> embed(TextSegment textSegment) {
        Response<List<Embedding>> response = embedAll(singletonList(textSegment));
        ValidationUtils.ensureEq(response.content().size(), 1,
                "Expected a single embedding, but got %d", response.content().size());
        return Response.from(response.content().get(0), response.tokenUsage(), response.finishReason());
    }

    Response<List<Embedding>> embedAll(List<TextSegment> textSegments);

    default int dimension() {
        return embed("test").content().dimension();
    }
}

EmbeddingModel.embed(String text)

This method takes a plain text string as input and returns its corresponding embedding, a vector representation that captures the meaning of the text. Internally, it wraps the string into a TextSegment object and then delegates the processing to the embed(TextSegment) method. It's a convenient shortcut for cases where you don't need to attach metadata to the text.

 

EmbeddingModel.embed(TextSegment textSegment)

This method accepts a TextSegment, which is a more structured form of text that can optionally include metadata (like a title, category, or source). It embeds this segment into a vector by internally calling the embedAll(List<TextSegment>) method with a single-item list. It also ensures that the output contains exactly one embedding, providing an extra layer of safety and validation.

 

EmbeddingModel.embedAll(List<TextSegment> textSegments)

This is the core method of the interface. It accepts a list of TextSegment objects and returns a list of corresponding embeddings, one for each input segment wrapped in a Response object.

 

EmbeddingModel.dimension()

This method returns the dimensionality (i.e., the length) of the vectors produced by the embedding model.

 

Find the below working application.

 

EmbeddingModelDemo.java

 

package com.sample.app.embeddings;

import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.embedding.onnx.bgesmallenv15q.BgeSmallEnV15QuantizedEmbeddingModel;
import dev.langchain4j.store.embedding.CosineSimilarity;

public class EmbeddingModelDemo {

    public static void main(String[] args) {
        // Initialize local embedding model
        EmbeddingModel model = new BgeSmallEnV15QuantizedEmbeddingModel();

        // Define sentences
        String sentence1 = "The stock market surged as tech companies reported strong earnings.";
        String sentence2 = "Tech giants like Apple and Amazon saw record profits this quarter.";
        String sentence3 = "Heavy rain caused flooding in coastal towns yesterday.";

        // Generate embeddings
        Embedding embedding1 = model.embed(sentence1).content();
        Embedding embedding2 = model.embed(sentence2).content();
        Embedding embedding3 = model.embed(sentence3).content();

        //System.out.println(embedding1.toString());

        // Compute cosine similarities
        double sim1vs2 = CosineSimilarity.between(embedding1, embedding2);
        double sim1vs3 = CosineSimilarity.between(embedding1, embedding3);
        double sim2vs3 = CosineSimilarity.between(embedding2, embedding3);

        // Display results
        System.out.printf("Similarity between sentence 1 and 2: %.4f%n", sim1vs2);
        System.out.printf("Similarity between sentence 1 and 3: %.4f%n", sim1vs3);
        System.out.printf("Similarity between sentence 2 and 3: %.4f%n", sim2vs3);
    }

}

 

Output

Similarity between sentence 1 and 2: 0.7657
Similarity between sentence 1 and 3: 0.4450
Similarity between sentence 2 and 3: 0.3622

Refer below link to see all the supported embedding models of Langchain4j.

https://docs.langchain4j.dev/category/embedding-models

  

Previous                                                    Next                                                    Home

No comments:

Post a Comment