If you're starting out in machine learning or AI, you'll frequently come across the term embedding. It sounds like a technical term, but it’s actually pretty easy to understand once you know what it means. This post will walk you through what embeddings are, why they matter, and how they are used to help machines understand complex data like text, images, and more.
1. What is an Embedding?
In machine learning, an embedding is a way to represent real-world data as a list of numbers (also called a vector) in a multi-dimensional space. Think of it as translating complex data like a sentence or an image into a form that a computer can understand and compare easily.
Imagine you have a huge spreadsheet where each row is a paragraph from a news article, and each paragraph is converted into a vector of numbers. Now, similar paragraphs will have similar vectors and those will be close together in this abstract "embedding space".
Let’s say we have three news paragraphs
· "The stock market surged as tech companies reported strong earnings."
· "Tech giants like Apple and Amazon saw record profits this quarter."
· "Heavy rain caused flooding in coastal towns yesterday."
When converted into vectors (numbers), they might look like this (simplified for clarity):
· Stock market news: (0.8, 0.6, 0.1)
· Tech profits news: (0.7, 0.5, 0.2)
· Weather news: (0.1, 0.2, 0.9)
Embedding Space
· Think of it as a 3D map (It can be Nd based in the vector dimension) where each paragraph is a point.
· Similar paragraphs (e.g., about tech earnings) will have close vector values, so they appear near each other.
· Different topics (e.g., weather vs. stocks) will be far apart.
Diagram is drawn using https://www.geogebra.org/m/jcw38f5f
Why Does This Matter?
Machine learning models mostly work with numbers, not raw text, not images, not audio.
For example:
· A word like “apple” needs to be turned into numbers before the model can work with it.
· An image of a cat must be represented numerically so a model can learn what a cat looks like.
Embeddings solve this problem. They take things like words, sentences, images, users, or even abstract ideas (like political views) and convert them into vectors of numbers. These numbers are not random; they are learned representations that capture meaning or relationships.
2. Where Are Embeddings Used?
Embeddings are used in almost every modern AI application:
· Search engines: to match your query with the most relevant content.
· Recommendation systems: to suggest products, movies, or songs based on your preferences.
· Chatbots and language models: to understand the meaning of your inputs.
· Clustering: to group similar items (like emails, documents, or customers).
· Classification: to assign labels to data (e.g., spam or not spam).
3. Types Of Vector Embeddings
Various types of vector embeddings are used across different applications. Here are some common examples.
3.1 Word Embeddings
Convert individual words into fixed-length vectors, preserving semantic (meaning) and syntactic (grammar) relationships.
Example
King – Man + Woman ≈ Queen (captures gender relationships).
3.2 Sentence Embeddings
Encode entire sentences into vectors while preserving meaning, even if wording differs.
Example
"The cat sat on the mat" ≈ "A feline rested on the rug" (similar vectors).
3.3 Document Embeddings
Represent long-form text (articles, reports) as dense vectors. A dense vector in machine learning is a vector where most or all of its elements (features) are non-zero
3.4 Image Embeddings
Convert images into vectors capturing visual features (edges, textures, objects).
3.5 User Embeddings
Encode user behavior (clicks, purchases) into vectors for personalization. Helps in the applications like Netflix recommendations, fraud detection (anomalous behavior).
3.6 Product Embeddings
Represent products (e-commerce items) for recommendation systems.
4. How Are Embeddings Created?
So far, we’ve learned:
· What embeddings are: A way to turn data (words, images, etc.) into numerical vectors (lists of numbers) that capture meaning.
· Where they’re used: Search engines, chatbots, recommendations, and more.
· What can be embedded: Words, sentences, images, users, products, and more.
How do we actually create these embeddings?
There are two main ways to generate embeddings:
· Using Pre-Trained Models (Quick & Easy)
· Training Your Own Model (Custom but Requires More Work)
Option 1: Pre-Trained Models (The Fast & Easy Way)
If you want to generate embeddings without training a model from scratch, pre-trained models are your best friend.
How Do Pre-Trained Models Work?
· These models have already been trained on massive datasets (millions of books, websites, images, etc.).
· You just feed in your data (e.g., a sentence), and they output an embedding (a vector of numbers).
Option 2: Train Your Own Model (The Custom Way)
If you have unique data (e.g., legal documents, rare languages, or specialized product catalogs), you might need to train your own embedding model.
How Does Training Work?
· Gather Data: You need a large dataset (e.g., thousands of product descriptions).
· Choose a Model: Like Word2Vec for words or a CNN for images.
· Train the Model: Adjusts weights so similar items get similar vectors.
· Extract Embeddings: Use the trained model to generate vectors.
5. Embeddings in Langchain4j
The Embedding class in LangChain4j models a semantic vector representation of text, commonly referred to as an embedding. These embeddings are typically generated by AI models and used in applications like semantic search, similarity comparison, or retrieval-augmented generation (RAG).
LocalEmbeddingExample.java
package com.sample.app.embeddings; import dev.langchain4j.data.embedding.Embedding; import dev.langchain4j.model.embedding.EmbeddingModel; import dev.langchain4j.model.embedding.onnx.bgesmallenv15q.*; public class LocalEmbeddingExample { public static void main(String[] args) { // Step 1: Create the quantized embedding model EmbeddingModel embeddingModel = new BgeSmallEnV15QuantizedEmbeddingModel(); // Step 2: Define your input text String text = "LangChain4j simplifies working with LLMs in Java."; // Step 3: Generate the embedding Embedding embedding = embeddingModel.embed(text).content(); // Step 4: Print the results System.out.println("Embedding vector:"); System.out.println(embedding.vectorAsList()); System.out.println("Embedding dimension: " + embedding.dimension()); } }
Output
Embedding vector: [-0.082744114, -0.023897879, -0.029047808, -0.037368014, -0.024763605, -0.009259008, -0.04551608, 0.023252701, 0.04100233, -0.08748646, -0.008455025, 0.026474712, 0.045002542, 0.019816818, 0.07184365, -0.012057282, -0.018909112, 0.044843785, -0.018666616, 0.008035629, 0.037856, -0.017876457, 0.019761454, -0.040591344, 0.017388204, 0.063495964, 0.03503259, -0.019071402, -0.051338263, -0.19808069, 0.033683438, -0.022472885, 0.034754787, 0.023875035, -0.08622, 0.018104564, -0.056176726, 0.009105614, -0.017272696, 0.024177276, -0.022674413, -0.012322063, 0.011683079, -0.04301081, -0.035067778, -0.08779675, 8.1929174E-4, -0.009406332, -0.029714983, -0.015589028, -0.024279878, -0.054651204, 0.030419536, 2.1614696E-4, 0.008121003, 0.04830378, 0.014498083, 0.08617197, -0.011710719, -0.0027238487, 0.040100526, 0.03201019, -0.084929995, 0.07765175, -0.050279368, 0.019215507, 0.01353482, 3.3818683E-4, 0.04166601, 0.07480434, 0.0123985475, 0.026133507, 0.011071286, 0.07086023, 0.0036411427, 0.02346627, -0.009587636, -0.03172558, 0.01936717, 0.0023019363, -0.081621476, -0.05905277, 0.05650085, -0.07157813, -0.06171017, 0.03248853, -0.025141746, -0.031956606, 0.0119370585, -0.007544868, -0.04889281, 0.01455014, 0.027499968, 0.034083057, -0.057719097, 0.0077606635, -0.0062166164, -0.0028239703, -0.029756702, 0.3838514, -0.0031927903, 0.0016062818, 0.023803858, -0.001999481, 0.012417851, 0.0116338115, -0.063581124, 0.0038245472, 0.020921463, -0.004941366, -0.04521869, -0.021060076, 0.028712468, 0.0073694955, -0.077902116, -0.013048149, 0.0046918746, -0.014669525, -0.053784493, 0.043703463, -0.0020053783, 0.009720397, -0.038737416, -0.01600671, 0.02863781, -0.03373163, -0.037642032, 0.044666875, 0.024651341, 0.060108446, -0.0070553003, -0.016101321, -0.080811486, 0.0018503749, 0.018867046, 0.013475371, -0.003415189, -0.056851164, 0.0285899, 0.025988495, -0.008051881, -0.013336573, -0.016253442, -0.043827143, -0.058982126, 0.074115045, 0.017398842, 0.010707802, -0.012107442, 0.011539475, 0.018395906, 0.070626296, -0.02532828, -0.055266514, 0.046766397, 0.042878896, 0.048386797, 0.016830266, -0.05771195, 0.0025546658, -0.035068333, -0.027377049, 0.034737103, 0.121355936, -0.05758517, -0.068853326, 0.03805245, 0.012676281, -0.0036938577, -0.05714334, -0.011884023, 0.030349717, -0.037901334, -0.007922751, 0.022437626, -0.02969601, -0.073003724, -0.01606871, -0.023695705, 0.03457529, 0.035057478, -0.01752255, -0.047709066, 0.022986384, -0.0043231035, 0.014622111, -0.0122830635, -0.024719968, 0.0045169704, -0.0040410534, 0.005944655, 0.047348704, -0.010411237, -0.0011211799, 0.0033639467, -0.046353135, -0.07325219, -0.013579352, 0.011084018, -0.036519393, 0.032407835, 0.07695445, 0.0156186605, 0.03867607, -0.004304201, -0.02989777, 0.048262917, -0.03376272, 0.0057669445, 0.039932843, -0.0902483, -0.0065555577, 0.055201802, -0.014687267, -0.029182624, 0.026725674, 0.017326785, 0.06439949, 0.018047946, -0.036152557, -0.09050463, 0.00933732, -0.027657067, -0.3154667, 0.011751174, 0.08639627, -0.0019587078, 0.0032839503, -0.031173106, -0.02624135, -0.019338517, 0.004841798, 0.04615466, 0.05328013, -0.007955703, -0.012157245, -0.001275557, 0.02130414, 0.06002365, -0.02416403, 0.006387204, -0.006832195, 0.041760307, -0.011332748, -0.006974836, 0.032819156, -0.095025875, 0.0033367847, 0.03712921, 0.11986668, -0.064719185, -0.02406231, -0.028371055, 0.083425336, 0.01929746, -0.008932217, -0.09721872, -0.009292892, 0.032427948, 5.336355E-4, 0.021204716, 0.035988733, 0.0069755083, -0.045157902, -0.06283432, -0.012702158, -0.11088262, -0.008053905, -0.033540864, -0.037196226, -0.042399302, -0.008969024, -0.021344073, -0.07823543, -0.019630905, 0.053345583, 0.06997609, -0.015087427, -0.009552916, 0.010396964, -0.045278974, 0.0159449, 0.027439563, 0.013914977, 0.021561569, 0.0018600357, 0.00447114, 0.0019598342, -0.023676097, -0.005051663, 0.04443559, 0.015230952, -0.07099633, -0.05789975, 0.03135458, -0.01850015, 0.009496254, -0.015460234, 0.08059632, -0.049303602, -0.016191155, -0.0472479, 0.014897245, 0.06851027, 0.06850619, 0.02263447, -0.012987456, -0.02651845, 0.07594378, -0.005779505, 0.053935032, -0.023069147, 0.028343616, -0.04819376, 0.008733487, 0.01594356, 0.029340087, 0.02133689, 0.017240321, -0.24886835, 0.009677923, -0.024411175, 0.015828667, -0.046145026, 0.0437849, 0.0051265643, -0.028683688, -0.05070976, 0.040816627, 0.034419015, 0.0123211825, 0.03256292, -0.046885267, 0.05567348, -0.0046847067, 0.033388674, 0.027819678, 0.0517406, -0.023178963, 3.9037797E-4, 0.0476578, 0.20756362, -0.02916805, -0.005396151, 0.14030437, 0.0015359628, 0.058372382, 0.03699338, 0.052839715, -0.03800914, 0.04677676, 0.11968715, 0.057695046, 0.013461054, 0.05130623, -0.05109387, 0.036541216, 0.032602523, 0.042032186, 8.194379E-4, 0.02742906, 0.02568754, -0.023322811, 0.075307235, 0.0422195, -0.006003252, -0.047988005, 0.018396173, 0.020890657, -0.033157352, -0.061069064, -0.021945799, 0.04263334, 0.010506472, 0.017822735, -0.023695739, 0.03148585, -0.019827662, -0.0042233444, 0.0020321715, -0.0549567, 0.040304437, -0.006758919, -0.013974651] Embedding dimension: 384
Using Cosine similarity to measure the relation between two embeddings
Cosine similarity is a measure of similarity between two vectors. It calculates the cosine of the angle between them. The value ranges from:
· +1.0 (very similar)
· 0 (completely unrelated)
· -1.0 (opposite directions, rarely used in embeddings)
In NLP, it's a common technique for comparing the semantic similarity of texts.
CosineSimilarityExample.java
package com.sample.app.embeddings; import dev.langchain4j.data.embedding.Embedding; import dev.langchain4j.model.embedding.EmbeddingModel; import dev.langchain4j.model.embedding.onnx.bgesmallenv15q.BgeSmallEnV15QuantizedEmbeddingModel; import dev.langchain4j.store.embedding.CosineSimilarity; public class CosineSimilarityExample { public static void main(String[] args) { // Initialize local embedding model EmbeddingModel model = new BgeSmallEnV15QuantizedEmbeddingModel(); // Define sentences String sentence1 = "The stock market surged as tech companies reported strong earnings."; String sentence2 = "Tech giants like Apple and Amazon saw record profits this quarter."; String sentence3 = "Heavy rain caused flooding in coastal towns yesterday."; // Generate embeddings Embedding embedding1 = model.embed(sentence1).content(); Embedding embedding2 = model.embed(sentence2).content(); Embedding embedding3 = model.embed(sentence3).content(); // Compute cosine similarities double sim1vs2 = CosineSimilarity.between(embedding1, embedding2); double sim1vs3 = CosineSimilarity.between(embedding1, embedding3); double sim2vs3 = CosineSimilarity.between(embedding2, embedding3); // Display results System.out.printf("Similarity between sentence 1 and 2: %.4f%n", sim1vs2); System.out.printf("Similarity between sentence 1 and 3: %.4f%n", sim1vs3); System.out.printf("Similarity between sentence 2 and 3: %.4f%n", sim2vs3); } }
Output
Similarity between sentence 1 and 2: 0.7657 Similarity between sentence 1 and 3: 0.4450 Similarity between sentence 2 and 3: 0.3622
No comments:
Post a Comment