← Blog
explainx / blog

What Are Embeddings? Vector Search and Semantic AI Explained (2026 Guide)

Embeddings are dense numerical vectors that encode semantic meaning — the core technology behind RAG, vector search, and agent memory. This guide explains what embeddings are, how they are created, how similarity search works, and how to use them in production systems in 2026.

17 min readYash Thakker
EmbeddingsVector SearchRAGGenerative AIAI Fundamentals

MDX restores the committed source plus an HTML comment attribution; plain text bundles the rendered markdown body with the explainx.ai attribution footer.

What Are Embeddings? Vector Search and Semantic AI Explained (2026 Guide)

Why Embeddings Are the Primitive Every AI Engineer Needs to Understand

Whenever an AI product "understands" that "affordable car" and "cheap vehicle" mean the same thing, embeddings are doing that work. Whenever a RAG pipeline surfaces the right document chunk from a library of 500,000 pages, embeddings made the match. Whenever an AI agent recalls a relevant memory from a conversation that happened three weeks ago, an embedding was stored and retrieved.

Embeddings are the single most important low-level primitive in applied AI. Yet most tutorials skip past them in a paragraph on the way to showing you a LangChain tutorial. This guide goes the other direction — starting from first principles, building up the math, and arriving at production-ready practices.

Live Bootcamp6 weeks

Complete AI Builder Bootcamp

Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.

View bootcamp

The Complete AI Builder Bootcamp is the best AI development course for learning Claude AI, prompt engineering, Python automation, and full-stack web development. This intensive 6-week live bootcamp teaches you how to build AI-powered applications using Claude Projects, Claude Artifacts, Claude Code, and the complete Claude ecosystem. You'll master prompt engineering techniques, learn to create custom Claude connectors and MCP integrations, build Python automation workflows, develop full-stack websites with AI assistance, and create AI marketing agents.

The bootcamp includes 12 live Zoom sessions with Yash Thakker, founder of AISOLO Technologies and instructor to 350,000+ students. You'll build 8+ portfolio projects including AI playbooks, full-stack note-taking applications, Python automation scripts, marketing agents, and personal portfolio websites. The curriculum covers AI fundamentals, Claude Projects and Artifacts, Claude Co-work, Claude plugins and skills, Claude Code for Python development, full-stack development, AI marketing, and capstone projects.

Students receive 1-year access to all recordings, permanent Discord community access, a certificate of completion, and personalized career guidance. All enrollments include a 7-day money-back guarantee. This is the most comprehensive Claude AI bootcamp available, taking students from zero AI knowledge to expert AI builder in 6 weeks.


What an Embedding Actually Is

Start with the problem embeddings solve.

Computers work with numbers. Text, images, and audio are not numbers. Before neural networks could work with language, everything had to be converted to a numerical representation. The earliest approach was one-hot encoding: create a vector with one dimension per word in the vocabulary, and mark a single dimension as 1 while all others are 0.

With a 50,000-word vocabulary:

  • "cat" → [0, 0, 1, 0, 0, ..., 0] (1 at position 3, zeros everywhere else)
  • "kitten" → [0, 0, 0, 0, 0, ..., 1] (1 at position 49,999)

This is a mathematical disaster. One-hot vectors are:

  1. Enormous — 50,000 dimensions per word
  2. Completely uninformative — the cosine similarity between any two one-hot vectors is exactly 0, whether comparing "cat" and "kitten" or "cat" and "democracy"
  3. Sparse — 49,999 zeros out of 50,000 values

An embedding replaces this sparse, meaningless representation with a dense, learned vector — typically 768 to 3072 floating-point numbers — where the position of the vector in that high-dimensional space encodes semantic meaning.

# One-hot (bad): 50,000 dimensions, no meaning
"cat"    → [0, 0, 1, 0, 0, ..., 0]  # 49,999 zeros
"kitten" → [0, 0, 0, 0, 0, ..., 1]  # completely different

# Embedding (good): 768 dimensions, meaning encoded
"cat"    → [0.21, -0.54, 0.87, ..., 0.13]
"kitten" → [0.19, -0.51, 0.89, ..., 0.11]  # very similar numbers

The intuition: nearby points in embedding space = similar meaning. The model has learned, from training on vast amounts of text, that "cat" and "kitten" appear in similar contexts (cuddly, meow, pet, food bowl) and has placed their vectors close together in the 768-dimensional space.

This is not hand-crafted. Nobody programmed "cat is close to kitten." The model inferred it from patterns in the training data.


How Embeddings Are Created

Transformer Encoder Models

The dominant architecture for producing text embeddings in 2026 is the transformer encoder — the "E" in BERT (Bidirectional Encoder Representations from Transformers). Unlike a full language model (which has both encoder and decoder), an embedding model needs only the encoder half, because its job is to read text and produce a representation, not to generate new text.

The pipeline:

  1. Tokenize the input text into subword tokens (e.g., "embeddings" → ["embed", "##dings"] in WordPiece)
  2. Embed each token into an initial vector using a token embedding table
  3. Add positional encodings so the model knows which token comes first
  4. Pass through N attention layers, where each layer refines the representation using multi-head self-attention — each token can attend to every other token
  5. Pool the output — either take the [CLS] token's final hidden state, or mean-pool all token states — to get a single fixed-size vector for the entire input

The result is a single vector of shape (d_model,) — for example, (768,) for BERT-base or (3072,) for text-embedding-3-large.

Popular Embedding Models

ModelProviderDimensionsNotes
text-embedding-3-smallOpenAI1536Cost-efficient, strong multilingual
text-embedding-3-largeOpenAI3072Best quality from OpenAI
all-MiniLM-L6-v2Sentence-Transformers384Tiny, fast, open-source
all-mpnet-base-v2Sentence-Transformers768Excellent open-source baseline
embed-v3Cohere1024Strong on retrieval benchmarks
voyage-3-largeVoyage AI2048Top MTEB scores as of 2026

Contrastive Learning: How the Model Learns to Place Vectors

Training an embedding model is not the same as using one. Training requires a contrastive objective — a loss function that explicitly teaches the model what "similar" means.

The most common approach is contrastive learning with triplets:

  • Anchor: "How do I fix a Python import error?"
  • Positive (similar): "Resolving module not found errors in Python"
  • Negative (dissimilar): "How to bake sourdough bread"

The loss function penalizes the model when the anchor and positive are far apart in vector space, or when the anchor and negative are close together. Over millions of training examples, the model learns a vector space where semantically similar inputs cluster together.

More advanced training uses in-batch negatives (treating all other batch items as negatives) and hard negatives (carefully selected examples that are superficially similar but semantically different, like "Python snake" vs "Python programming language").


Dimensionality: What It Means and What to Choose

When people say an embedding is "768-dimensional" or "1536-dimensional," they mean the embedding vector has that many numbers. Each dimension is a float32 (4 bytes), so:

  • 768-dim embedding: 768 × 4 = 3,072 bytes (3 KB) per document
  • 1536-dim embedding: 1536 × 4 = 6,144 bytes (6 KB) per document
  • 3072-dim embedding: 3072 × 4 = 12,288 bytes (12 KB) per document

For one million documents, a 3072-dim embedding index takes approximately 12 GB in RAM before any indexing overhead.

Higher dimensions generally mean:

  • More expressive — the model has more capacity to encode nuance
  • Better performance on benchmarks like MTEB (Massive Text Embedding Benchmark)
  • Higher storage cost and slower search

Practical advice:

  • For prototyping: all-MiniLM-L6-v2 (384-dim, free, fast)
  • For production RAG at moderate scale: text-embedding-3-small (1536-dim, cheap)
  • For maximum quality, search-critical systems: voyage-3-large or text-embedding-3-large

Some models support Matryoshka embeddings (MRL — Matryoshka Representation Learning), where you can truncate the vector to a smaller dimension (e.g., take the first 512 dimensions of a 1536-dim embedding) with only a small quality loss. OpenAI's text-embedding-3 family supports this, letting you tune the storage/quality trade-off at inference time.


How Similarity Search Works (With a Numeric Example)

Given two embedding vectors, how do you measure how similar they are? There are three main metrics.

Cosine Similarity

Cosine similarity measures the angle between two vectors, ignoring their magnitudes.

cos_sim(A, B) = (A · B) / (||A|| × ||B||)

Range: -1.0 (opposite) to +1.0 (identical direction)

Small numeric example (using 3-dimensional vectors for clarity):

A = [0.6, 0.8, 0.0]   # embedding for "machine learning"
B = [0.5, 0.7, 0.2]   # embedding for "deep learning"
C = [0.1, 0.0, 0.9]   # embedding for "baking bread"

dot(A, B) = 0.6×0.5 + 0.8×0.7 + 0.0×0.2 = 0.30 + 0.56 + 0.00 = 0.86
||A|| = sqrt(0.36 + 0.64 + 0.00) = 1.00
||B|| = sqrt(0.25 + 0.49 + 0.04) = 0.872

cos_sim(A, B) = 0.86 / (1.00 × 0.872) = 0.987  ← very similar

dot(A, C) = 0.6×0.1 + 0.8×0.0 + 0.0×0.9 = 0.06
||C|| = sqrt(0.01 + 0.00 + 0.81) = 0.906

cos_sim(A, C) = 0.06 / (1.00 × 0.906) = 0.066  ← very different

Dot Product

Dot product is A · B without the normalization step. When embeddings are L2-normalized (unit length), dot product and cosine similarity give identical results. Many modern embedding APIs return normalized vectors for this reason — dot product is slightly cheaper to compute.

Euclidean Distance

L2 distance measures the straight-line distance between two points: sqrt(sum((a_i - b_i)^2)). Lower is more similar. This works but is more sensitive to vector magnitude than cosine similarity. Typically used when the magnitude of the embedding carries meaningful information.

For most text retrieval tasks, cosine similarity (or dot product over normalized vectors) is the standard choice.


Vector Databases: Why You Cannot Do This in a SQL WHERE Clause

Given a query embedding and a corpus of 10 million document embeddings, finding the most similar document requires computing the cosine similarity between the query and every stored vector. In SQL:

-- This would be catastrophically slow on 10M rows
SELECT document_id,
       dot_product(embedding, $query_embedding) AS similarity
FROM documents
ORDER BY similarity DESC
LIMIT 10;

A brute-force scan over 10 million 1536-dimensional vectors means 10M × 1536 = ~15 billion multiply-add operations per query. At even a nanosecond per operation, that is 15 seconds per query.

Vector databases solve this with Approximate Nearest Neighbor (ANN) indexes, most commonly:

  • HNSW (Hierarchical Navigable Small World): A graph-based index that navigates through layers of nodes to find near-neighbors in O(log n) time instead of O(n). High recall, fast queries.
  • IVF (Inverted File Index): Partitions the vector space into clusters (Voronoi cells). At query time, only the clusters closest to the query are searched. Good for very large datasets.

Vector Database Options in 2026

DatabaseDeploymentNotes
PineconeFully managed cloudSimplest to get started, serverless tier
WeaviateSelf-hosted or cloudNative hybrid search (vector + BM25), GraphQL API
QdrantSelf-hosted or cloudRust-based, very fast, excellent filtering
pgvectorPostgreSQL extensionGreat for existing Postgres stacks, HNSW support
ChromaSelf-hosted / in-processDeveloper-friendly, popular for prototyping
ZvecIn-process (library)Alibaba's open-source embedded vector DB

Zvec deserves particular attention. Rather than running as a separate server process, Zvec is a library that runs inside your application process — similar to SQLite vs PostgreSQL. It supports WAL persistence, HNSW indexing, hybrid (dense + sparse) search, and Python/Node.js/Flutter SDKs. For teams that want vector search without managing infrastructure, it is an emerging alternative. Read more in Zvec: Alibaba's In-Process Vector Database.

pgvector is the pragmatic choice for teams already running Postgres. The trade-off is that Postgres's general-purpose storage is less efficient than a purpose-built vector store, and very large indexes (hundreds of millions of vectors) require careful tuning.


The RAG Workflow: Embeddings in Practice

Embeddings are the engine that makes Retrieval-Augmented Generation (RAG) work. The RAG pattern solves the core limitation of LLMs: they are frozen at training time and cannot access your private documents or recent information.

RAG in practice: building a retrieval-augmented chatbot end-to-end — embeddings are the foundation.

The full RAG workflow:

Phase 1 — Indexing (offline, run once or on updates)

from openai import OpenAI
import qdrant_client

client = OpenAI()
qdrant = qdrant_client.QdrantClient(":memory:")

# 1. Load and chunk documents
chunks = chunk_documents(documents, chunk_size=512, overlap=64)

# 2. Embed each chunk
embeddings = client.embeddings.create(
    model="text-embedding-3-small",
    input=[chunk.text for chunk in chunks]
).data

# 3. Store embeddings in vector database
qdrant.upsert(
    collection_name="docs",
    points=[
        PointStruct(id=i, vector=emb.embedding, payload={"text": chunks[i].text})
        for i, emb in enumerate(embeddings)
    ]
)

Phase 2 — Retrieval (online, at query time)

def answer_question(user_query: str) -> str:
    # 4. Embed the user's query using the SAME model
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=user_query
    ).data[0].embedding

    # 5. Find the nearest document chunks
    results = qdrant.search(
        collection_name="docs",
        query_vector=query_embedding,
        limit=5
    )

    # 6. Pass retrieved context to the LLM
    context = "\n\n".join([r.payload["text"] for r in results])
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer using this context:\n{context}"},
            {"role": "user", "content": user_query}
        ]
    )
    return response.choices[0].message.content

The critical constraint: you must embed the query using the same model used to embed the documents. An embedding from text-embedding-3-small lives in a completely different vector space than an embedding from all-MiniLM-L6-v2. Mixing them produces garbage results.

For a deeper treatment of how RAG compares to the Model Context Protocol and when to use each, see RAG vs MCP: The Complete Comparison. For a look at how agentic RAG challenges traditional vector-based retrieval, see RAG vs Agentic RAG.


Multimodal Embeddings: Images, Text, and Cross-Modal Search

The idea of embedding is not limited to text. CLIP (Contrastive Language-Image Pretraining), developed by OpenAI in 2021, showed that you can train a model to embed text and images into the same vector space.

The CLIP training objective is simple in concept: for a dataset of (image, caption) pairs, pull the image embedding and text embedding for matching pairs together, and push non-matching pairs apart. After training on hundreds of millions of image-text pairs, the model learns that:

  • The embedding for the image of a golden retriever running on a beach
  • Is very close to the embedding for the text "a golden retriever running on a beach"
  • And farther from the text "a cat sitting on a couch"

What this enables:

  • Text-to-image search: Embed a text query ("sunset over mountains") and find the most similar images in a corpus by comparing embeddings
  • Image-to-text search: Embed an image and find similar captions or documents
  • Zero-shot image classification: Embed the image, embed class names like "cat", "dog", "car", compare similarities — no task-specific training required
  • Cross-modal retrieval: A user can type a product description and find matching product images, or paste an image and find similar product listings

Modern successors to CLIP (SigLIP, CoCa, ImageBind) extend this idea to audio, video, depth, and thermal modalities. ImageBind (Meta, 2023) learns a single embedding space across six modalities simultaneously — images, text, audio, depth, thermal, and IMU data. This means you can search a video archive with a sound clip, or find images that match an audio description.

The architectural pattern is always the same: a separate encoder for each modality (a vision transformer for images, a text transformer for text), both trained so their outputs land in the same shared vector space.


What Embeddings Cannot Do

Embeddings are powerful but have hard limits that matter for system design.

They Encode Meaning at Training Time

An embedding model trained on data through 2024 does not know about events in 2025. If you embed the query "What happened at the 2025 Paris Olympics?" the resulting vector will be the model's best guess at what that phrase means based on prior Olympics context — it will not encode the actual 2025 events. This is why RAG exists: to retrieve up-to-date documents and feed them to the LLM, compensating for the embedding model's knowledge cutoff.

They Struggle With Ambiguity

The word "bank" means both a financial institution and the side of a river. A standard embedding model produces a single vector for "bank" that is somewhere between these meanings — it has been averaged across all the contexts "bank" appears in during training.

Contextualized embeddings (like BERT's output for a specific token in context) handle this better — they produce a different representation depending on surrounding words. But most production embedding APIs return a single vector for an entire sentence or passage, so ambiguity at the word level gets partially washed out by sentence context.

They Do Not Update Dynamically

An embedding is computed once and stored. If the underlying document changes, the stored embedding does not update automatically — you need to re-embed the changed document and overwrite the stored vector. In high-change-rate systems (e.g., a live knowledge base with hundreds of daily edits), keeping the embedding index synchronized with the source documents requires explicit pipeline management.

They Are Not Symbolic

Embeddings encode statistical associations, not logical rules. The embedding for "Paris is the capital of France" and "Rome is the capital of Italy" will be similar vectors (similar sentence structure, similar semantic role), but the model has not stored a fact table. If you ask an embedding-only system "What is the capital of France?" it will find documents about France and capitals — you still need an LLM to read those documents and extract the answer.


Practical Tips for Production Embedding Systems

Chunk Size Matters More Than Most Guides Admit

The unit you embed determines what gets retrieved. Common options:

GranularityTypical SizeBest For
Sentence15-40 tokensPrecise fact retrieval, QA
Paragraph100-300 tokensGeneral document retrieval
Page chunk300-600 tokensLong-form content, context preservation
Full section600-1500 tokensTechnical docs where section cohesion matters

Practical heuristic for 2026: chunk at paragraph boundaries (not arbitrary token counts), with a 10-15% overlap window to avoid losing information at chunk edges. Always store the surrounding context metadata (document title, section header) in the vector database payload so the LLM has orientation.

Hybrid Search: BM25 + Vector

Pure vector search is excellent for semantic similarity but struggles with exact keyword matches — product SKUs, names, error codes, version numbers. BM25 (the algorithm underlying most full-text search engines like Elasticsearch and Lucene) excels at exact keyword matching but cannot understand paraphrase or synonymy.

Hybrid search runs both in parallel and combines the results using reciprocal rank fusion (RRF) or a learned re-ranker:

# Pseudocode for hybrid search
vector_results = vector_db.search(query_embedding, top_k=20)
keyword_results = bm25_index.search(query_text, top_k=20)

# Reciprocal rank fusion
combined = rrf_merge(vector_results, keyword_results, k=60)
final_results = combined[:5]

Weaviate and Qdrant support hybrid search natively. For pgvector, you pair it with PostgreSQL's built-in full-text search (tsvector/tsquery).

When to Use Sentence-Level vs Paragraph-Level Embeddings

  • Question answering over structured data (FAQs, policies): embed at the sentence or QA-pair level — each unit has a single, retrievable answer
  • Technical documentation retrieval: embed at the paragraph or subsection level — the question often requires a few sentences of context to be useful
  • Long-form document search (research papers, legal contracts): embed at the section level, but also index a summary embedding of the full document; use the summary for coarse retrieval and the section embeddings for fine retrieval

Embeddings in 2026: Agent Memory and Long-Term Recall

Why embeddings and retrieval remain essential even as context windows grow.

The question arises regularly in 2026: if models now support context windows of 1 million tokens, why do we still need embeddings?

The answer comes down to three practical realities:

1. Cost: Processing 1 million tokens costs roughly 100× more than processing 10,000 tokens. If your knowledge base has 500,000 documents averaging 800 tokens each, stuffing it all into a prompt would cost thousands of dollars per query. Embeddings let you retrieve the relevant 5-10 documents (5,000-8,000 tokens) and pay for those.

2. Latency: A 1 million token context window takes several seconds to process even on the fastest models. Retrieval + 8,000-token context takes milliseconds for the retrieval step and a fraction of a second for the LLM inference.

3. Beyond the window: Even a 1 million token window cannot fit a year's worth of Slack messages, 10 years of customer support tickets, or the full English-language Wikipedia. Embeddings enable retrieval from arbitrarily large corpora.

The domain where this matters most in 2026 is agent long-term memory. When an agent harness needs to remember what happened in a conversation three weeks ago, it cannot keep all previous conversations in the active context window. Instead, it embeds past interaction summaries, stores them in a vector database, and retrieves relevant memories at the start of each new session. This is how products like Mem0 and the long-term memory feature in ChatGPT work.

The pattern:

# At end of conversation
memory_text = summarize_conversation(messages)
memory_embedding = embed(memory_text)
memory_store.upsert(
    id=conversation_id,
    vector=memory_embedding,
    payload={"summary": memory_text, "date": today()}
)

# At start of new conversation
query_embedding = embed(f"User said: {new_message}")
relevant_memories = memory_store.search(query_embedding, top_k=3)
context = format_memories(relevant_memories) + new_message

This architecture — embedding-based retrieval as the memory substrate for autonomous agents — is increasingly the default pattern for production AI agents in 2026. The embedding is not replacing the LLM; it is providing the recall mechanism the LLM itself cannot replicate.


Evaluating Embedding Quality

Before committing to an embedding model for production, measure its performance on a retrieval task that represents your data. The standard benchmark is BEIR (Benchmarking IR) and the broader MTEB (Massive Text Embedding Benchmark), which covers 56 tasks across 112 languages.

For a quick internal evaluation:

  1. Sample 100-200 representative queries from your use case
  2. Annotate each query with 3-5 relevant documents from your corpus
  3. Run each query through the embedding model and retrieve the top 10 results
  4. Measure Recall@5 (what fraction of relevant docs appear in the top 5 results) and NDCG@10 (normalized discounted cumulative gain, which weights highly-ranked relevant results more)

A model that scores well on MTEB may not score well on your domain-specific data. Medical, legal, code, and multilingual corpora frequently benefit from domain-specific fine-tuning or specialist models (e.g., BioLinkBERT for biomedical text, CodeBERT for code search).


Summary Table: Embedding System Design Decisions

DecisionOptionsWhen to Use
Embedding modelOpenAI 3-small, voyage-3-large, MiniLMManaged/cost-sensitive / quality-critical / offline/open-source
Dimensions384 / 768 / 1536 / 3072Prototype / balanced / production / max quality
ChunkingSentence / paragraph / sectionQA / general / technical docs
Vector DBZvec / pgvector / Qdrant / PineconeIn-process / Postgres stack / self-hosted / managed
Search typeVector only / hybrid BM25+vectorSemantic queries / mixed keyword+semantic
Memory systemIn-context / embedding-based retrievalShort sessions / long-term agents

Embeddings are not a detail to abstract away behind a framework. They are the mechanism through which meaning travels from human language into the numerical operations that power retrieval, memory, and semantic understanding. Understanding them at this level — not just how to call client.embeddings.create(), but why cosine similarity works, why dimensionality matters, and where embeddings fail — is what separates engineers who can debug retrieval problems from those who cannot.

Related posts