What is an embedding in AI?

An embedding is a dense numerical vector — a list of floating-point numbers — that represents the semantic meaning of a piece of data (text, image, audio). Two embeddings that are close together in vector space represent inputs that are semantically similar. For example, the embeddings for "car" and "automobile" will be very close to each other, while the embedding for "banana" will be far away from both.

How are text embeddings created?

Text embeddings are created by transformer encoder models (like BERT or OpenAI's text-embedding-3 family). The model processes the input text through multiple attention layers and outputs a fixed-size vector, typically the pooled representation of the final hidden states. These models are trained on large datasets using contrastive learning objectives that pull similar items closer and push dissimilar items apart in vector space.

What is cosine similarity and why is it used for embeddings?

Cosine similarity measures the angle between two vectors rather than their raw distance. Two vectors pointing in the same direction score 1.0 (identical meaning), perpendicular vectors score 0.0 (unrelated), and opposite directions score -1.0. It is preferred over Euclidean distance for embeddings because the magnitude of the vector (how "activated" the model is) matters less than the direction (what the vector represents).

What is a vector database and why do I need one?

A vector database stores embeddings and provides fast approximate nearest neighbor (ANN) search over millions or billions of vectors. You cannot do this with a SQL WHERE clause because cosine similarity requires comparing the query vector against every stored vector — an O(n) operation that becomes unacceptably slow above a few thousand records. Vector databases use indexing algorithms like HNSW and IVF to answer nearest-neighbor queries in milliseconds even at scale.

Do large context windows make embeddings obsolete?

No. Even with million-token context windows, embeddings remain essential for retrieving the right information before it enters the context. Stuffing an entire 10-million-document knowledge base into one prompt is not feasible — it would cost thousands of dollars per query and exceed any context limit. Embeddings let you retrieve only the relevant 5-10 chunks, keeping retrieval fast and cost-effective. See the linked YouTubeEmbed below for a longer treatment of this argument.

What is an embedding in AI?

An embedding is a dense numerical vector — a list of floating-point numbers — that represents the semantic meaning of a piece of data (text, image, audio). Two embeddings that are close together in vector space represent inputs that are semantically similar. For example, the embeddings for "car" and "automobile" will be very close to each other, while the embedding for "banana" will be far away from both.

How are text embeddings created?

Text embeddings are created by transformer encoder models (like BERT or OpenAI's text-embedding-3 family). The model processes the input text through multiple attention layers and outputs a fixed-size vector, typically the pooled representation of the final hidden states. These models are trained on large datasets using contrastive learning objectives that pull similar items closer and push dissimilar items apart in vector space.

What is cosine similarity and why is it used for embeddings?

Cosine similarity measures the angle between two vectors rather than their raw distance. Two vectors pointing in the same direction score 1.0 (identical meaning), perpendicular vectors score 0.0 (unrelated), and opposite directions score -1.0. It is preferred over Euclidean distance for embeddings because the magnitude of the vector (how "activated" the model is) matters less than the direction (what the vector represents).

What is a vector database and why do I need one?

A vector database stores embeddings and provides fast approximate nearest neighbor (ANN) search over millions or billions of vectors. You cannot do this with a SQL WHERE clause because cosine similarity requires comparing the query vector against every stored vector — an O(n) operation that becomes unacceptably slow above a few thousand records. Vector databases use indexing algorithms like HNSW and IVF to answer nearest-neighbor queries in milliseconds even at scale.

Do large context windows make embeddings obsolete?

No. Even with million-token context windows, embeddings remain essential for retrieving the right information before it enters the context. Stuffing an entire 10-million-document knowledge base into one prompt is not feasible — it would cost thousands of dollars per query and exceed any context limit. Embeddings let you retrieve only the relevant 5-10 chunks, keeping retrieval fast and cost-effective. See the linked YouTubeEmbed below for a longer treatment of this argument.

What is an embedding in AI?

An embedding is a dense numerical vector — a list of floating-point numbers — that represents the semantic meaning of a piece of data (text, image, audio). Two embeddings that are close together in vector space represent inputs that are semantically similar. For example, the embeddings for "car" and "automobile" will be very close to each other, while the embedding for "banana" will be far away from both.

How are text embeddings created?

Text embeddings are created by transformer encoder models (like BERT or OpenAI's text-embedding-3 family). The model processes the input text through multiple attention layers and outputs a fixed-size vector, typically the pooled representation of the final hidden states. These models are trained on large datasets using contrastive learning objectives that pull similar items closer and push dissimilar items apart in vector space.

What is cosine similarity and why is it used for embeddings?

Cosine similarity measures the angle between two vectors rather than their raw distance. Two vectors pointing in the same direction score 1.0 (identical meaning), perpendicular vectors score 0.0 (unrelated), and opposite directions score -1.0. It is preferred over Euclidean distance for embeddings because the magnitude of the vector (how "activated" the model is) matters less than the direction (what the vector represents).

What is a vector database and why do I need one?

A vector database stores embeddings and provides fast approximate nearest neighbor (ANN) search over millions or billions of vectors. You cannot do this with a SQL WHERE clause because cosine similarity requires comparing the query vector against every stored vector — an O(n) operation that becomes unacceptably slow above a few thousand records. Vector databases use indexing algorithms like HNSW and IVF to answer nearest-neighbor queries in milliseconds even at scale.

Do large context windows make embeddings obsolete?

No. Even with million-token context windows, embeddings remain essential for retrieving the right information before it enters the context. Stuffing an entire 10-million-document knowledge base into one prompt is not feasible — it would cost thousands of dollars per query and exceed any context limit. Embeddings let you retrieve only the relevant 5-10 chunks, keeping retrieval fast and cost-effective. See the linked YouTubeEmbed below for a longer treatment of this argument.

What is an embedding in AI?

An embedding is a dense numerical vector — a list of floating-point numbers — that represents the semantic meaning of a piece of data (text, image, audio). Two embeddings that are close together in vector space represent inputs that are semantically similar. For example, the embeddings for "car" and "automobile" will be very close to each other, while the embedding for "banana" will be far away from both.

How are text embeddings created?

Text embeddings are created by transformer encoder models (like BERT or OpenAI's text-embedding-3 family). The model processes the input text through multiple attention layers and outputs a fixed-size vector, typically the pooled representation of the final hidden states. These models are trained on large datasets using contrastive learning objectives that pull similar items closer and push dissimilar items apart in vector space.

What is cosine similarity and why is it used for embeddings?

Cosine similarity measures the angle between two vectors rather than their raw distance. Two vectors pointing in the same direction score 1.0 (identical meaning), perpendicular vectors score 0.0 (unrelated), and opposite directions score -1.0. It is preferred over Euclidean distance for embeddings because the magnitude of the vector (how "activated" the model is) matters less than the direction (what the vector represents).

What is a vector database and why do I need one?

A vector database stores embeddings and provides fast approximate nearest neighbor (ANN) search over millions or billions of vectors. You cannot do this with a SQL WHERE clause because cosine similarity requires comparing the query vector against every stored vector — an O(n) operation that becomes unacceptably slow above a few thousand records. Vector databases use indexing algorithms like HNSW and IVF to answer nearest-neighbor queries in milliseconds even at scale.

Do large context windows make embeddings obsolete?

No. Even with million-token context windows, embeddings remain essential for retrieving the right information before it enters the context. Stuffing an entire 10-million-document knowledge base into one prompt is not feasible — it would cost thousands of dollars per query and exceed any context limit. Embeddings let you retrieve only the relevant 5-10 chunks, keeping retrieval fast and cost-effective. See the linked YouTubeEmbed below for a longer treatment of this argument.

What is an embedding in AI?

An embedding is a dense numerical vector — a list of floating-point numbers — that represents the semantic meaning of a piece of data (text, image, audio). Two embeddings that are close together in vector space represent inputs that are semantically similar. For example, the embeddings for "car" and "automobile" will be very close to each other, while the embedding for "banana" will be far away from both.

How are text embeddings created?

Text embeddings are created by transformer encoder models (like BERT or OpenAI's text-embedding-3 family). The model processes the input text through multiple attention layers and outputs a fixed-size vector, typically the pooled representation of the final hidden states. These models are trained on large datasets using contrastive learning objectives that pull similar items closer and push dissimilar items apart in vector space.

What is cosine similarity and why is it used for embeddings?

Cosine similarity measures the angle between two vectors rather than their raw distance. Two vectors pointing in the same direction score 1.0 (identical meaning), perpendicular vectors score 0.0 (unrelated), and opposite directions score -1.0. It is preferred over Euclidean distance for embeddings because the magnitude of the vector (how "activated" the model is) matters less than the direction (what the vector represents).

What is a vector database and why do I need one?

A vector database stores embeddings and provides fast approximate nearest neighbor (ANN) search over millions or billions of vectors. You cannot do this with a SQL WHERE clause because cosine similarity requires comparing the query vector against every stored vector — an O(n) operation that becomes unacceptably slow above a few thousand records. Vector databases use indexing algorithms like HNSW and IVF to answer nearest-neighbor queries in milliseconds even at scale.

Do large context windows make embeddings obsolete?

No. Even with million-token context windows, embeddings remain essential for retrieving the right information before it enters the context. Stuffing an entire 10-million-document knowledge base into one prompt is not feasible — it would cost thousands of dollars per query and exceed any context limit. Embeddings let you retrieve only the relevant 5-10 chunks, keeping retrieval fast and cost-effective. See the linked YouTubeEmbed below for a longer treatment of this argument.

What is an embedding in AI?

An embedding is a dense numerical vector — a list of floating-point numbers — that represents the semantic meaning of a piece of data (text, image, audio). Two embeddings that are close together in vector space represent inputs that are semantically similar. For example, the embeddings for "car" and "automobile" will be very close to each other, while the embedding for "banana" will be far away from both.

How are text embeddings created?

Text embeddings are created by transformer encoder models (like BERT or OpenAI's text-embedding-3 family). The model processes the input text through multiple attention layers and outputs a fixed-size vector, typically the pooled representation of the final hidden states. These models are trained on large datasets using contrastive learning objectives that pull similar items closer and push dissimilar items apart in vector space.

What is cosine similarity and why is it used for embeddings?

Cosine similarity measures the angle between two vectors rather than their raw distance. Two vectors pointing in the same direction score 1.0 (identical meaning), perpendicular vectors score 0.0 (unrelated), and opposite directions score -1.0. It is preferred over Euclidean distance for embeddings because the magnitude of the vector (how "activated" the model is) matters less than the direction (what the vector represents).

What is a vector database and why do I need one?

A vector database stores embeddings and provides fast approximate nearest neighbor (ANN) search over millions or billions of vectors. You cannot do this with a SQL WHERE clause because cosine similarity requires comparing the query vector against every stored vector — an O(n) operation that becomes unacceptably slow above a few thousand records. Vector databases use indexing algorithms like HNSW and IVF to answer nearest-neighbor queries in milliseconds even at scale.

Do large context windows make embeddings obsolete?

No. Even with million-token context windows, embeddings remain essential for retrieving the right information before it enters the context. Stuffing an entire 10-million-document knowledge base into one prompt is not feasible — it would cost thousands of dollars per query and exceed any context limit. Embeddings let you retrieve only the relevant 5-10 chunks, keeping retrieval fast and cost-effective. See the linked YouTubeEmbed below for a longer treatment of this argument.

What is an embedding in AI?

An embedding is a dense numerical vector — a list of floating-point numbers — that represents the semantic meaning of a piece of data (text, image, audio). Two embeddings that are close together in vector space represent inputs that are semantically similar. For example, the embeddings for "car" and "automobile" will be very close to each other, while the embedding for "banana" will be far away from both.

How are text embeddings created?

Text embeddings are created by transformer encoder models (like BERT or OpenAI's text-embedding-3 family). The model processes the input text through multiple attention layers and outputs a fixed-size vector, typically the pooled representation of the final hidden states. These models are trained on large datasets using contrastive learning objectives that pull similar items closer and push dissimilar items apart in vector space.

What is cosine similarity and why is it used for embeddings?

Cosine similarity measures the angle between two vectors rather than their raw distance. Two vectors pointing in the same direction score 1.0 (identical meaning), perpendicular vectors score 0.0 (unrelated), and opposite directions score -1.0. It is preferred over Euclidean distance for embeddings because the magnitude of the vector (how "activated" the model is) matters less than the direction (what the vector represents).

What is a vector database and why do I need one?

A vector database stores embeddings and provides fast approximate nearest neighbor (ANN) search over millions or billions of vectors. You cannot do this with a SQL WHERE clause because cosine similarity requires comparing the query vector against every stored vector — an O(n) operation that becomes unacceptably slow above a few thousand records. Vector databases use indexing algorithms like HNSW and IVF to answer nearest-neighbor queries in milliseconds even at scale.

Do large context windows make embeddings obsolete?

No. Even with million-token context windows, embeddings remain essential for retrieving the right information before it enters the context. Stuffing an entire 10-million-document knowledge base into one prompt is not feasible — it would cost thousands of dollars per query and exceed any context limit. Embeddings let you retrieve only the relevant 5-10 chunks, keeping retrieval fast and cost-effective. See the linked YouTubeEmbed below for a longer treatment of this argument.

What is an embedding in AI?

An embedding is a dense numerical vector — a list of floating-point numbers — that represents the semantic meaning of a piece of data (text, image, audio). Two embeddings that are close together in vector space represent inputs that are semantically similar. For example, the embeddings for "car" and "automobile" will be very close to each other, while the embedding for "banana" will be far away from both.

How are text embeddings created?

Text embeddings are created by transformer encoder models (like BERT or OpenAI's text-embedding-3 family). The model processes the input text through multiple attention layers and outputs a fixed-size vector, typically the pooled representation of the final hidden states. These models are trained on large datasets using contrastive learning objectives that pull similar items closer and push dissimilar items apart in vector space.

What is cosine similarity and why is it used for embeddings?

Cosine similarity measures the angle between two vectors rather than their raw distance. Two vectors pointing in the same direction score 1.0 (identical meaning), perpendicular vectors score 0.0 (unrelated), and opposite directions score -1.0. It is preferred over Euclidean distance for embeddings because the magnitude of the vector (how "activated" the model is) matters less than the direction (what the vector represents).

What is a vector database and why do I need one?

A vector database stores embeddings and provides fast approximate nearest neighbor (ANN) search over millions or billions of vectors. You cannot do this with a SQL WHERE clause because cosine similarity requires comparing the query vector against every stored vector — an O(n) operation that becomes unacceptably slow above a few thousand records. Vector databases use indexing algorithms like HNSW and IVF to answer nearest-neighbor queries in milliseconds even at scale.

Do large context windows make embeddings obsolete?

No. Even with million-token context windows, embeddings remain essential for retrieving the right information before it enters the context. Stuffing an entire 10-million-document knowledge base into one prompt is not feasible — it would cost thousands of dollars per query and exceed any context limit. Embeddings let you retrieve only the relevant 5-10 chunks, keeping retrieval fast and cost-effective. See the linked YouTubeEmbed below for a longer treatment of this argument.

What is an embedding in AI?

An embedding is a dense numerical vector — a list of floating-point numbers — that represents the semantic meaning of a piece of data (text, image, audio). Two embeddings that are close together in vector space represent inputs that are semantically similar. For example, the embeddings for "car" and "automobile" will be very close to each other, while the embedding for "banana" will be far away from both.

How are text embeddings created?

Text embeddings are created by transformer encoder models (like BERT or OpenAI's text-embedding-3 family). The model processes the input text through multiple attention layers and outputs a fixed-size vector, typically the pooled representation of the final hidden states. These models are trained on large datasets using contrastive learning objectives that pull similar items closer and push dissimilar items apart in vector space.

What is cosine similarity and why is it used for embeddings?

Cosine similarity measures the angle between two vectors rather than their raw distance. Two vectors pointing in the same direction score 1.0 (identical meaning), perpendicular vectors score 0.0 (unrelated), and opposite directions score -1.0. It is preferred over Euclidean distance for embeddings because the magnitude of the vector (how "activated" the model is) matters less than the direction (what the vector represents).

What is a vector database and why do I need one?

A vector database stores embeddings and provides fast approximate nearest neighbor (ANN) search over millions or billions of vectors. You cannot do this with a SQL WHERE clause because cosine similarity requires comparing the query vector against every stored vector — an O(n) operation that becomes unacceptably slow above a few thousand records. Vector databases use indexing algorithms like HNSW and IVF to answer nearest-neighbor queries in milliseconds even at scale.

Do large context windows make embeddings obsolete?

No. Even with million-token context windows, embeddings remain essential for retrieving the right information before it enters the context. Stuffing an entire 10-million-document knowledge base into one prompt is not feasible — it would cost thousands of dollars per query and exceed any context limit. Embeddings let you retrieve only the relevant 5-10 chunks, keeping retrieval fast and cost-effective. See the linked YouTubeEmbed below for a longer treatment of this argument.

What is an embedding in AI?

An embedding is a dense numerical vector — a list of floating-point numbers — that represents the semantic meaning of a piece of data (text, image, audio). Two embeddings that are close together in vector space represent inputs that are semantically similar. For example, the embeddings for "car" and "automobile" will be very close to each other, while the embedding for "banana" will be far away from both.

How are text embeddings created?

Text embeddings are created by transformer encoder models (like BERT or OpenAI's text-embedding-3 family). The model processes the input text through multiple attention layers and outputs a fixed-size vector, typically the pooled representation of the final hidden states. These models are trained on large datasets using contrastive learning objectives that pull similar items closer and push dissimilar items apart in vector space.

What is cosine similarity and why is it used for embeddings?

Cosine similarity measures the angle between two vectors rather than their raw distance. Two vectors pointing in the same direction score 1.0 (identical meaning), perpendicular vectors score 0.0 (unrelated), and opposite directions score -1.0. It is preferred over Euclidean distance for embeddings because the magnitude of the vector (how "activated" the model is) matters less than the direction (what the vector represents).

What is a vector database and why do I need one?

A vector database stores embeddings and provides fast approximate nearest neighbor (ANN) search over millions or billions of vectors. You cannot do this with a SQL WHERE clause because cosine similarity requires comparing the query vector against every stored vector — an O(n) operation that becomes unacceptably slow above a few thousand records. Vector databases use indexing algorithms like HNSW and IVF to answer nearest-neighbor queries in milliseconds even at scale.

Do large context windows make embeddings obsolete?

No. Even with million-token context windows, embeddings remain essential for retrieving the right information before it enters the context. Stuffing an entire 10-million-document knowledge base into one prompt is not feasible — it would cost thousands of dollars per query and exceed any context limit. Embeddings let you retrieve only the relevant 5-10 chunks, keeping retrieval fast and cost-effective. See the linked YouTubeEmbed below for a longer treatment of this argument.

What is an embedding in AI?

An embedding is a dense numerical vector — a list of floating-point numbers — that represents the semantic meaning of a piece of data (text, image, audio). Two embeddings that are close together in vector space represent inputs that are semantically similar. For example, the embeddings for "car" and "automobile" will be very close to each other, while the embedding for "banana" will be far away from both.

What Are Embeddings? Vector Search and Semantic AI Complete Guide (2026) | explainx.ai Blog

Why Embeddings Are the Primitive Every AI Engineer Needs to Understand

Whenever an AI product "understands" that "affordable car" and "cheap vehicle" mean the same thing, embeddings are doing that work. Whenever a RAG pipeline surfaces the right document chunk from a library of 500,000 pages, embeddings made the match. Whenever an AI agent recalls a relevant memory from a conversation that happened three weeks ago, an embedding was stored and retrieved.

Embeddings are the single most important low-level primitive in applied AI. Yet most tutorials skip past them in a paragraph on the way to showing you a LangChain tutorial. This guide goes the other direction — starting from first principles, building up the math, and arriving at production-ready practices.

Live Bootcamp6 weeks

Complete AI Builder Bootcamp

Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.

View bootcamp

What an Embedding Actually Is

Start with the problem embeddings solve.

Computers work with numbers. Text, images, and audio are not numbers. Before neural networks could work with language, everything had to be converted to a numerical representation. The earliest approach was one-hot encoding: create a vector with one dimension per word in the vocabulary, and mark a single dimension as 1 while all others are 0.

With a 50,000-word vocabulary:

"cat" → [0, 0, 1, 0, 0, ..., 0] (1 at position 3, zeros everywhere else)
"kitten" → [0, 0, 0, 0, 0, ..., 1] (1 at position 49,999)

This is a mathematical disaster. One-hot vectors are:

Enormous — 50,000 dimensions per word
Completely uninformative — the cosine similarity between any two one-hot vectors is exactly 0, whether comparing "cat" and "kitten" or "cat" and "democracy"
Sparse — 49,999 zeros out of 50,000 values

An embedding replaces this sparse, meaningless representation with a dense, learned vector — typically 768 to 3072 floating-point numbers — where the position of the vector in that high-dimensional space encodes semantic meaning.

# One-hot (bad): 50,000 dimensions, no meaning
"cat"    → [0, 0, 1, 0, 0, ..., 0]  # 49,999 zeros
"kitten" → [0, 0, 0, 0, 0, ..., 1]  # completely different

# Embedding (good): 768 dimensions, meaning encoded
"cat"    → [0.21, -0.54, 0.87, ..., 0.13]
"kitten" → [0.19, -0.51, 0.89, ..., 0.11]  # very similar numbers

The intuition: nearby points in embedding space = similar meaning. The model has learned, from training on vast amounts of text, that "cat" and "kitten" appear in similar contexts (cuddly, meow, pet, food bowl) and has placed their vectors close together in the 768-dimensional space.

This is not hand-crafted. Nobody programmed "cat is close to kitten." The model inferred it from patterns in the training data.

How Embeddings Are Created

Transformer Encoder Models

The dominant architecture for producing text embeddings in 2026 is the transformer encoder — the "E" in BERT (Bidirectional Encoder Representations from Transformers). Unlike a full language model (which has both encoder and decoder), an embedding model needs only the encoder half, because its job is to read text and produce a representation, not to generate new text.

The pipeline:

Tokenize the input text into subword tokens (e.g., "embeddings" → ["embed", "##dings"] in WordPiece)
Embed each token into an initial vector using a token embedding table
Add positional encodings so the model knows which token comes first
Pass through N attention layers, where each layer refines the representation using multi-head self-attention — each token can attend to every other token
Pool the output — either take the [CLS] token's final hidden state, or mean-pool all token states — to get a single fixed-size vector for the entire input

The result is a single vector of shape (d_model,) — for example, (768,) for BERT-base or (3072,) for text-embedding-3-large.

Popular Embedding Models

Model	Provider	Dimensions	Notes
`text-embedding-3-small`	OpenAI	1536	Cost-efficient, strong multilingual
`text-embedding-3-large`	OpenAI	3072	Best quality from OpenAI
`all-MiniLM-L6-v2`	Sentence-Transformers	384	Tiny, fast, open-source
`all-mpnet-base-v2`	Sentence-Transformers	768	Excellent open-source baseline
`embed-v3`	Cohere	1024	Strong on retrieval benchmarks
`voyage-3-large`	Voyage AI	2048	Top MTEB scores as of 2026

Contrastive Learning: How the Model Learns to Place Vectors

Training an embedding model is not the same as using one. Training requires a contrastive objective — a loss function that explicitly teaches the model what "similar" means.

The most common approach is contrastive learning with triplets:

Anchor: "How do I fix a Python import error?"
Positive (similar): "Resolving module not found errors in Python"
Negative (dissimilar): "How to bake sourdough bread"

The loss function penalizes the model when the anchor and positive are far apart in vector space, or when the anchor and negative are close together. Over millions of training examples, the model learns a vector space where semantically similar inputs cluster together.

More advanced training uses in-batch negatives (treating all other batch items as negatives) and hard negatives (carefully selected examples that are superficially similar but semantically different, like "Python snake" vs "Python programming language").

Dimensionality: What It Means and What to Choose

When people say an embedding is "768-dimensional" or "1536-dimensional," they mean the embedding vector has that many numbers. Each dimension is a float32 (4 bytes), so:

768-dim embedding: 768 × 4 = 3,072 bytes (3 KB) per document
1536-dim embedding: 1536 × 4 = 6,144 bytes (6 KB) per document
3072-dim embedding: 3072 × 4 = 12,288 bytes (12 KB) per document

For one million documents, a 3072-dim embedding index takes approximately 12 GB in RAM before any indexing overhead.

Higher dimensions generally mean:

More expressive — the model has more capacity to encode nuance
Better performance on benchmarks like MTEB (Massive Text Embedding Benchmark)
Higher storage cost and slower search

Practical advice:

For prototyping: all-MiniLM-L6-v2 (384-dim, free, fast)
For production RAG at moderate scale: text-embedding-3-small (1536-dim, cheap)
For maximum quality, search-critical systems: voyage-3-large or text-embedding-3-large

Some models support Matryoshka embeddings (MRL — Matryoshka Representation Learning), where you can truncate the vector to a smaller dimension (e.g., take the first 512 dimensions of a 1536-dim embedding) with only a small quality loss. OpenAI's text-embedding-3 family supports this, letting you tune the storage/quality trade-off at inference time.

How Similarity Search Works (With a Numeric Example)

Given two embedding vectors, how do you measure how similar they are? There are three main metrics.

Cosine Similarity

Cosine similarity measures the angle between two vectors, ignoring their magnitudes.

cos_sim(A, B) = (A · B) / (||A|| × ||B||)

Range: -1.0 (opposite) to +1.0 (identical direction)

Small numeric example (using 3-dimensional vectors for clarity):

A = [0.6, 0.8, 0.0]   # embedding for "machine learning"
B = [0.5, 0.7, 0.2]   # embedding for "deep learning"
C = [0.1, 0.0, 0.9]   # embedding for "baking bread"

dot(A, B) = 0.6×0.5 + 0.8×0.7 + 0.0×0.2 = 0.30 + 0.56 + 0.00 = 0.86
||A|| = sqrt(0.36 + 0.64 + 0.00) = 1.00
||B|| = sqrt(0.25 + 0.49 + 0.04) = 0.872

cos_sim(A, B) = 0.86 / (1.00 × 0.872) = 0.987  ← very similar

dot(A, C) = 0.6×0.1 + 0.8×0.0 + 0.0×0.9 = 0.06
||C|| = sqrt(0.01 + 0.00 + 0.81) = 0.906

cos_sim(A, C) = 0.06 / (1.00 × 0.906) = 0.066  ← very different

Dot Product

Dot product is A · B without the normalization step. When embeddings are L2-normalized (unit length), dot product and cosine similarity give identical results. Many modern embedding APIs return normalized vectors for this reason — dot product is slightly cheaper to compute.

Euclidean Distance

L2 distance measures the straight-line distance between two points: sqrt(sum((a_i - b_i)^2)). Lower is more similar. This works but is more sensitive to vector magnitude than cosine similarity. Typically used when the magnitude of the embedding carries meaningful information.

For most text retrieval tasks, cosine similarity (or dot product over normalized vectors) is the standard choice.

Vector Databases: Why You Cannot Do This in a SQL WHERE Clause

Given a query embedding and a corpus of 10 million document embeddings, finding the most similar document requires computing the cosine similarity between the query and every stored vector. In SQL:

-- This would be catastrophically slow on 10M rows
SELECT document_id,
       dot_product(embedding, $query_embedding) AS similarity
FROM documents
ORDER BY similarity DESC
LIMIT 10;

A brute-force scan over 10 million 1536-dimensional vectors means 10M × 1536 = ~15 billion multiply-add operations per query. At even a nanosecond per operation, that is 15 seconds per query.

Vector databases solve this with Approximate Nearest Neighbor (ANN) indexes, most commonly:

HNSW (Hierarchical Navigable Small World): A graph-based index that navigates through layers of nodes to find near-neighbors in O(log n) time instead of O(n). High recall, fast queries.
IVF (Inverted File Index): Partitions the vector space into clusters (Voronoi cells). At query time, only the clusters closest to the query are searched. Good for very large datasets.

Vector Database Options in 2026

Database	Deployment	Notes
Pinecone	Fully managed cloud	Simplest to get started, serverless tier
Weaviate	Self-hosted or cloud	Native hybrid search (vector + BM25), GraphQL API
Qdrant	Self-hosted or cloud	Rust-based, very fast, excellent filtering
pgvector	PostgreSQL extension	Great for existing Postgres stacks, HNSW support
Chroma	Self-hosted / in-process	Developer-friendly, popular for prototyping
Zvec	In-process (library)	Alibaba's open-source embedded vector DB

Zvec deserves particular attention. Rather than running as a separate server process, Zvec is a library that runs inside your application process — similar to SQLite vs PostgreSQL. It supports WAL persistence, HNSW indexing, hybrid (dense + sparse) search, and Python/Node.js/Flutter SDKs. For teams that want vector search without managing infrastructure, it is an emerging alternative. Read more in Zvec: Alibaba's In-Process Vector Database.

pgvector is the pragmatic choice for teams already running Postgres. The trade-off is that Postgres's general-purpose storage is less efficient than a purpose-built vector store, and very large indexes (hundreds of millions of vectors) require careful tuning.

The RAG Workflow: Embeddings in Practice

Embeddings are the engine that makes Retrieval-Augmented Generation (RAG) work. The RAG pattern solves the core limitation of LLMs: they are frozen at training time and cannot access your private documents or recent information.

RAG in practice: building a retrieval-augmented chatbot end-to-end — embeddings are the foundation.

The full RAG workflow:

Phase 1 — Indexing (offline, run once or on updates)

from openai import OpenAI
import qdrant_client

client = OpenAI()
qdrant = qdrant_client.QdrantClient(":memory:")

# 1. Load and chunk documents
chunks = chunk_documents(documents, chunk_size=512, overlap=64)

# 2. Embed each chunk
embeddings = client.embeddings.create(
    model="text-embedding-3-small",
    input=[chunk.text for chunk in chunks]
).data

# 3. Store embeddings in vector database
qdrant.upsert(
    collection_name="docs",
    points=[
        PointStruct(id=i, vector=emb.embedding, payload={"text": chunks[i].text})
        for i, emb in enumerate(embeddings)
    ]
)

Phase 2 — Retrieval (online, at query time)

def answer_question(user_query: str) -> str:
    # 4. Embed the user's query using the SAME model
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=user_query
    ).data[0].embedding

    # 5. Find the nearest document chunks
    results = qdrant.search(
        collection_name="docs",
        query_vector=query_embedding,
        limit=5
    )

    # 6. Pass retrieved context to the LLM
    context = "\n\n".join([r.payload["text"] for r in results])
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer using this context:\n{context}"},
            {"role": "user", "content": user_query}
        ]
    )
    return response.choices[0].message.content

The critical constraint: you must embed the query using the same model used to embed the documents. An embedding from text-embedding-3-small lives in a completely different vector space than an embedding from all-MiniLM-L6-v2. Mixing them produces garbage results.

For a deeper treatment of how RAG compares to the Model Context Protocol and when to use each, see RAG vs MCP: The Complete Comparison. For a look at how agentic RAG challenges traditional vector-based retrieval, see RAG vs Agentic RAG.

Multimodal Embeddings: Images, Text, and Cross-Modal Search

The idea of embedding is not limited to text. CLIP (Contrastive Language-Image Pretraining), developed by OpenAI in 2021, showed that you can train a model to embed text and images into the same vector space.

The CLIP training objective is simple in concept: for a dataset of (image, caption) pairs, pull the image embedding and text embedding for matching pairs together, and push non-matching pairs apart. After training on hundreds of millions of image-text pairs, the model learns that:

The embedding for the image of a golden retriever running on a beach
Is very close to the embedding for the text "a golden retriever running on a beach"
And farther from the text "a cat sitting on a couch"

What this enables:

Text-to-image search: Embed a text query ("sunset over mountains") and find the most similar images in a corpus by comparing embeddings
Image-to-text search: Embed an image and find similar captions or documents
Zero-shot image classification: Embed the image, embed class names like "cat", "dog", "car", compare similarities — no task-specific training required
Cross-modal retrieval: A user can type a product description and find matching product images, or paste an image and find similar product listings

Modern successors to CLIP (SigLIP, CoCa, ImageBind) extend this idea to audio, video, depth, and thermal modalities. ImageBind (Meta, 2023) learns a single embedding space across six modalities simultaneously — images, text, audio, depth, thermal, and IMU data. This means you can search a video archive with a sound clip, or find images that match an audio description.

The architectural pattern is always the same: a separate encoder for each modality (a vision transformer for images, a text transformer for text), both trained so their outputs land in the same shared vector space.

What Embeddings Cannot Do

Embeddings are powerful but have hard limits that matter for system design.

They Encode Meaning at Training Time

An embedding model trained on data through 2024 does not know about events in 2025. If you embed the query "What happened at the 2025 Paris Olympics?" the resulting vector will be the model's best guess at what that phrase means based on prior Olympics context — it will not encode the actual 2025 events. This is why RAG exists: to retrieve up-to-date documents and feed them to the LLM, compensating for the embedding model's knowledge cutoff.

They Struggle With Ambiguity

The word "bank" means both a financial institution and the side of a river. A standard embedding model produces a single vector for "bank" that is somewhere between these meanings — it has been averaged across all the contexts "bank" appears in during training.

Contextualized embeddings (like BERT's output for a specific token in context) handle this better — they produce a different representation depending on surrounding words. But most production embedding APIs return a single vector for an entire sentence or passage, so ambiguity at the word level gets partially washed out by sentence context.

They Do Not Update Dynamically

An embedding is computed once and stored. If the underlying document changes, the stored embedding does not update automatically — you need to re-embed the changed document and overwrite the stored vector. In high-change-rate systems (e.g., a live knowledge base with hundreds of daily edits), keeping the embedding index synchronized with the source documents requires explicit pipeline management.

They Are Not Symbolic

Embeddings encode statistical associations, not logical rules. The embedding for "Paris is the capital of France" and "Rome is the capital of Italy" will be similar vectors (similar sentence structure, similar semantic role), but the model has not stored a fact table. If you ask an embedding-only system "What is the capital of France?" it will find documents about France and capitals — you still need an LLM to read those documents and extract the answer.

Practical Tips for Production Embedding Systems

Chunk Size Matters More Than Most Guides Admit

The unit you embed determines what gets retrieved. Common options:

Granularity	Typical Size	Best For
Sentence	15-40 tokens	Precise fact retrieval, QA
Paragraph	100-300 tokens	General document retrieval
Page chunk	300-600 tokens	Long-form content, context preservation
Full section	600-1500 tokens	Technical docs where section cohesion matters

Practical heuristic for 2026: chunk at paragraph boundaries (not arbitrary token counts), with a 10-15% overlap window to avoid losing information at chunk edges. Always store the surrounding context metadata (document title, section header) in the vector database payload so the LLM has orientation.

Hybrid Search: BM25 + Vector

Pure vector search is excellent for semantic similarity but struggles with exact keyword matches — product SKUs, names, error codes, version numbers. BM25 (the algorithm underlying most full-text search engines like Elasticsearch and Lucene) excels at exact keyword matching but cannot understand paraphrase or synonymy.

Hybrid search runs both in parallel and combines the results using reciprocal rank fusion (RRF) or a learned re-ranker:

# Pseudocode for hybrid search
vector_results = vector_db.search(query_embedding, top_k=20)
keyword_results = bm25_index.search(query_text, top_k=20)

# Reciprocal rank fusion
combined = rrf_merge(vector_results, keyword_results, k=60)
final_results = combined[:5]

Weaviate and Qdrant support hybrid search natively. For pgvector, you pair it with PostgreSQL's built-in full-text search (tsvector/tsquery).

When to Use Sentence-Level vs Paragraph-Level Embeddings

Question answering over structured data (FAQs, policies): embed at the sentence or QA-pair level — each unit has a single, retrievable answer
Technical documentation retrieval: embed at the paragraph or subsection level — the question often requires a few sentences of context to be useful
Long-form document search (research papers, legal contracts): embed at the section level, but also index a summary embedding of the full document; use the summary for coarse retrieval and the section embeddings for fine retrieval

Embeddings in 2026: Agent Memory and Long-Term Recall

Why embeddings and retrieval remain essential even as context windows grow.

The question arises regularly in 2026: if models now support context windows of 1 million tokens, why do we still need embeddings?

The answer comes down to three practical realities:

1. Cost: Processing 1 million tokens costs roughly 100× more than processing 10,000 tokens. If your knowledge base has 500,000 documents averaging 800 tokens each, stuffing it all into a prompt would cost thousands of dollars per query. Embeddings let you retrieve the relevant 5-10 documents (5,000-8,000 tokens) and pay for those.

2. Latency: A 1 million token context window takes several seconds to process even on the fastest models. Retrieval + 8,000-token context takes milliseconds for the retrieval step and a fraction of a second for the LLM inference.

3. Beyond the window: Even a 1 million token window cannot fit a year's worth of Slack messages, 10 years of customer support tickets, or the full English-language Wikipedia. Embeddings enable retrieval from arbitrarily large corpora.

The domain where this matters most in 2026 is agent long-term memory. When an agent harness needs to remember what happened in a conversation three weeks ago, it cannot keep all previous conversations in the active context window. Instead, it embeds past interaction summaries, stores them in a vector database, and retrieves relevant memories at the start of each new session. This is how products like Mem0 and the long-term memory feature in ChatGPT work.

The pattern:

# At end of conversation
memory_text = summarize_conversation(messages)
memory_embedding = embed(memory_text)
memory_store.upsert(
    id=conversation_id,
    vector=memory_embedding,
    payload={"summary": memory_text, "date": today()}
)

# At start of new conversation
query_embedding = embed(f"User said: {new_message}")
relevant_memories = memory_store.search(query_embedding, top_k=3)
context = format_memories(relevant_memories) + new_message

This architecture — embedding-based retrieval as the memory substrate for autonomous agents — is increasingly the default pattern for production AI agents in 2026. The embedding is not replacing the LLM; it is providing the recall mechanism the LLM itself cannot replicate.

Evaluating Embedding Quality

Before committing to an embedding model for production, measure its performance on a retrieval task that represents your data. The standard benchmark is BEIR (Benchmarking IR) and the broader MTEB (Massive Text Embedding Benchmark), which covers 56 tasks across 112 languages.

For a quick internal evaluation:

Sample 100-200 representative queries from your use case
Annotate each query with 3-5 relevant documents from your corpus
Run each query through the embedding model and retrieve the top 10 results
Measure Recall@5 (what fraction of relevant docs appear in the top 5 results) and NDCG@10 (normalized discounted cumulative gain, which weights highly-ranked relevant results more)

A model that scores well on MTEB may not score well on your domain-specific data. Medical, legal, code, and multilingual corpora frequently benefit from domain-specific fine-tuning or specialist models (e.g., BioLinkBERT for biomedical text, CodeBERT for code search).

Summary Table: Embedding System Design Decisions

Decision	Options	When to Use
Embedding model	OpenAI 3-small, voyage-3-large, MiniLM	Managed/cost-sensitive / quality-critical / offline/open-source
Dimensions	384 / 768 / 1536 / 3072	Prototype / balanced / production / max quality
Chunking	Sentence / paragraph / section	QA / general / technical docs
Vector DB	Zvec / pgvector / Qdrant / Pinecone	In-process / Postgres stack / self-hosted / managed
Search type	Vector only / hybrid BM25+vector	Semantic queries / mixed keyword+semantic
Memory system	In-context / embedding-based retrieval	Short sessions / long-term agents

Embeddings are not a detail to abstract away behind a framework. They are the mechanism through which meaning travels from human language into the numerical operations that power retrieval, memory, and semantic understanding. Understanding them at this level — not just how to call client.embeddings.create(), but why cosine similarity works, why dimensionality matters, and where embeddings fail — is what separates engineers who can debug retrieval problems from those who cannot.

What Are Embeddings? Vector Search and Semantic AI Explained (2026 Guide)

Why Embeddings Are the Primitive Every AI Engineer Needs to Understand

What an Embedding Actually Is

How Embeddings Are Created

Transformer Encoder Models

Popular Embedding Models

Contrastive Learning: How the Model Learns to Place Vectors

Dimensionality: What It Means and What to Choose

How Similarity Search Works (With a Numeric Example)

Cosine Similarity

Dot Product

Euclidean Distance

Vector Databases: Why You Cannot Do This in a SQL WHERE Clause

Vector Database Options in 2026

The RAG Workflow: Embeddings in Practice

Multimodal Embeddings: Images, Text, and Cross-Modal Search

What Embeddings Cannot Do

They Encode Meaning at Training Time

They Struggle With Ambiguity

They Do Not Update Dynamically

They Are Not Symbolic

Practical Tips for Production Embedding Systems

Chunk Size Matters More Than Most Guides Admit

Hybrid Search: BM25 + Vector

When to Use Sentence-Level vs Paragraph-Level Embeddings

Embeddings in 2026: Agent Memory and Long-Term Recall

Evaluating Embedding Quality

Summary Table: Embedding System Design Decisions

Related posts

AI vs Machine Learning vs Deep Learning — What's Actually Different?

What Are AI Agents? The Complete Explainer for 2026

What Is Fine-Tuning an LLM? A Complete Guide for 2026