AI has a memory problem nobody talks about enough.
You fine-tune the model, deploy the API, and ship the product — and then the vector database bills arrive. A RAG pipeline over 10 million documents needs 31 GB of RAM just for the index. That's before your embedding server, your API layer, your caches, or your LLM inference. At scale, vector memory becomes the largest single line item in your AI infrastructure budget.
Google just shipped an answer: TurboVec.
Built on TurboQuant — Google Research's vector quantization algorithm presented at ICLR 2026 — TurboVec is an open-source vector index written in Rust with Python bindings that compresses that same 10-million-document corpus into 4 GB without sacrificing retrieval quality. And it does it while searching faster than FAISS.
This is not a marginal optimization. This is a fundamental rethinking of how vector search should work.
Part I: The Problem with Vector Search at Scale
Why Vector Databases Are Expensive
Every RAG system, semantic search engine, AI agent, and recommendation system ultimately depends on the same primitive: approximate nearest neighbor (ANN) search over high-dimensional embedding vectors.
The workflow is simple:
- Embed your documents into float32 vectors (typically 1,536–3,072 dimensions for modern embedding models)
- Store those vectors in an index
- At query time, embed the question and find the most similar vectors in the index
The problem is step 2. A single float32 vector at 1,536 dimensions is 6,144 bytes — about 6 KB. Multiply that by 10 million documents and you're at 61.4 GB in raw storage. Optimized indexes like FAISS reduce this, but a FAISS IndexFlatL2 still requires the full float32 representation: ~31 GB for 10M vectors at 1,536 dimensions.
This creates real-world constraints:
- Cost: A dedicated machine with 32–64 GB RAM runs $500–$2,000/month on major cloud providers
- Latency: Large indexes don't fit in L3 cache, killing search latency
- Private deployment: Most organizations can't afford dedicated vector infrastructure for on-prem AI
- Consumer hardware: Running local RAG on a MacBook with a 10M-document knowledge base is simply impossible
The industry response has been product quantization (PQ) — compressing vectors into smaller codes. FAISS ships PQ variants (IndexPQ, IndexPQFastScan) that work reasonably well. But they have a catch: they require training data.
Before you can index anything, PQ must analyze your corpus to build a "codebook" — a learned clustering of the vector space. New data can break the codebook. Changing your embedding model requires rebuilding everything from scratch.
The Status Quo: Product Quantization's Hidden Costs
Traditional PQ involves:
- Training phase (offline): Run k-means clustering on a representative sample of your corpus to learn sub-space centroids
- Encoding phase: Encode each vector by finding the nearest centroid in each sub-space
- Search phase: Score compressed codes against a pre-computed lookup table
The training step adds complexity and latency to every pipeline change. If your corpus is dynamic — news articles, user documents, live data streams — PQ codebooks age poorly. You're constantly balancing index freshness against rebuild cost.
What if you could skip training entirely?
Part II: TurboQuant — The Algorithm Behind TurboVec
TurboQuant is a data-oblivious quantizer developed by researchers at Google Research and New York University, published at ICLR 2026 (arXiv:2504.19874). The paper proves that you can achieve near-optimal compression without ever looking at your data.
The key insight is a mathematical property of high-dimensional geometry.
How TurboQuant Works
Step 1: Normalize
Strip the length (norm) from each vector and store it as a single float. Now every vector is a unit direction on the high-dimensional hypersphere. Norms are tiny — one float per vector is negligible overhead.
Step 2: Random rotation
Multiply every vector by the same randomly-generated orthogonal matrix. This is a critical step. After rotation, a remarkable thing happens: every coordinate independently follows a Beta distribution that converges to Gaussian N(0, 1/d) as dimensionality increases.
This distribution is the same regardless of your input data. It doesn't matter if you're indexing medical papers, e-commerce products, legal contracts, or code. Once you rotate, the coordinate distribution is predictable from math alone.
Step 3: Lloyd-Max scalar quantization
Since the coordinate distribution is known analytically, you can precompute the optimal way to bucket each coordinate — the bucket boundaries and centroids that minimize mean squared error. This is the Lloyd-Max algorithm, applied not to empirical data but to the Beta distribution's closed-form statistics.
For 2-bit quantization: 4 buckets per coordinate. For 4-bit quantization: 16 buckets per coordinate.
These are computed once, hardcoded into the library. Zero data passes. Zero training time.
Step 4: Bit-pack
Each coordinate becomes a small integer. Pack them tightly into bytes.
A 1,536-dim vector goes from:
- Float32: 6,144 bytes
- 2-bit TurboQuant: 384 bytes (16x compression)
- 4-bit TurboQuant: 768 bytes (8x compression)
Search
At query time, rotate the query vector once into the same compressed domain. Score directly against codebook values using SIMD kernels — no decompression required.
The paper proves TurboQuant achieves distortion within 2.7× of the Shannon information-theoretic lower bound — you literally cannot do meaningfully better with any quantizer for a given bit budget.
Why Data-Oblivious Quantization Is a Big Deal
For practitioners, the implications are profound:
# Traditional PQ workflow
index = faiss.IndexPQ(dim, M, nbits)
index.train(training_vectors) # ← requires training data, minutes/hours
index.add(all_vectors)
# TurboVec workflow
from turbovec import TurboQuantIndex
index = TurboQuantIndex(dim=1536, bit_width=4)
index.add(vectors) # ← no training step, just add
No training means:
- Incremental updates: Add vectors one at a time without rebuilding
- Cold start: New corpus? Zero warmup time
- Model changes: Switch embedding models without retraining the index
- Streaming pipelines: Index live data as it arrives
Part III: TurboVec — Rust Implementation with Python Bindings
TurboVec is the open-source implementation of TurboQuant as a production-grade vector index, written by Ryan Codrai. It ships as both a Rust crate and a Python package, and integrates natively with LangChain, LlamaIndex, and Haystack.
Installation
# Python
pip install turbovec
# Rust
cargo add turbovec
Basic Python Usage
from turbovec import TurboQuantIndex
import numpy as np
# Create an index for 1536-dim vectors (e.g., OpenAI text-embedding-3-small)
index = TurboQuantIndex(dim=1536, bit_width=4)
# Add vectors — no training required
vectors = np.random.randn(10_000_000, 1536).astype(np.float32)
index.add(vectors)
# Search
query = np.random.randn(1, 1536).astype(np.float32)
scores, indices = index.search(query, k=10)
# Persist to disk
index.write("my_index.tq")
# Load later
loaded = TurboQuantIndex.load("my_index.tq")
Stable IDs with Deletes
For production use cases where documents are updated or deleted, TurboVec provides IdMapIndex — a wrapper that maps your external IDs to internal indices and supports O(1) deletes:
from turbovec import IdMapIndex
import numpy as np
index = IdMapIndex(dim=1536, bit_width=4)
# Add with your external IDs (e.g., database primary keys)
vectors = np.random.randn(1000, 1536).astype(np.float32)
doc_ids = np.array([1001, 1002, ..., 2000], dtype=np.uint64)
index.add_with_ids(vectors, doc_ids)
# Search returns your external IDs
scores, ids = index.search(query, k=10)
print(ids) # [1047, 1312, ...]
# Delete a document — no rebuild needed
index.remove(1312)
# Persist
index.write("my_index.tvim")
Rust Usage
use turbovec::TurboQuantIndex;
let mut index = TurboQuantIndex::new(1536, 4);
index.add(&vectors);
let results = index.search(&queries, 10);
index.write("index.tv").unwrap();
let loaded = TurboQuantIndex::load("index.tv").unwrap();
Framework Integrations
TurboVec plugs into the major RAG frameworks with one-line installs:
pip install turbovec[langchain] # LangChain integration
pip install turbovec[llama-index] # LlamaIndex integration
pip install turbovec[haystack] # Haystack integration
# LangChain drop-in
from turbovec.langchain import TurboVecVectorStore
from langchain_openai import OpenAIEmbeddings
vectorstore = TurboVecVectorStore(
embedding=OpenAIEmbeddings(),
dim=1536,
bit_width=4
)
vectorstore.add_documents(documents)
docs = vectorstore.similarity_search("your query", k=5)
Part IV: Benchmarks — Memory, Speed, and Recall
Memory Compression
The headline number: 31 GB → 4 GB for 10 million 1,536-dim float32 vectors at 4-bit quantization.
| Configuration | Memory (10M vectors, d=1536) | Compression |
|---|---|---|
| Float32 (raw) | 61.4 GB | 1× |
| FAISS IndexFlatL2 | 61.4 GB | 1× |
| FAISS IndexPQFastScan (4-bit) | ~7.7 GB | ~8× |
| TurboVec (4-bit) | ~4.0 GB | ~15× |
| TurboVec (2-bit) | ~2.0 GB | ~30× |
The compression gains over FAISS PQ come from TurboQuant's more efficient bit-packing and the fact that no codebook storage is required.
Search Speed — ARM (Apple M3 Max)
TurboVec uses hand-written NEON intrinsics for ARM processors, with a nibble-split lookup table approach for maximum throughput.
Benchmarks: 100K vectors, 1K queries, k=64, median of 5 runs.
| Config | TurboVec (single-thread) | FAISS FastScan | Speedup |
|---|---|---|---|
| d=1536, 4-bit | faster | baseline | +12–20% |
| d=3072, 4-bit | faster | baseline | +12–20% |
| d=1536, 2-bit | faster | baseline | +12–20% |
On ARM, TurboVec beats FAISS IndexPQFastScan across every configuration tested.
Search Speed — x86 (Intel Xeon Platinum 8481C / Sapphire Rapids)
TurboVec uses AVX-512BW kernels on modern x86 processors, with an AVX2 fallback for older hardware. Runtime feature detection via is_x86_feature_detected! — no recompilation needed.
| Config | TurboVec vs FAISS |
|---|---|
| 4-bit, single-thread | +1–6% (wins) |
| 4-bit, multi-thread | +1–6% (wins) |
| 2-bit, single-thread | within ~1% (ties) |
| 2-bit, multi-thread | -2–4% (narrow loss) |
The 2-bit multi-thread loss on x86 is a known limitation — the inner accumulate loop is too short for unrolling amortization to match FAISS's AVX-512 VBMI path. For most production workloads (4-bit is the recommended default), TurboVec is competitive or better across the board.
Recall Quality
TurboQuant vs FAISS IndexPQ (LUT256, nbits=8) — 100K vectors, k=64.
On OpenAI embeddings (d=1536, d=3072):
- TurboQuant and FAISS are within 0–1 point at R@1
- Both converge to 1.0 by k=4–8
On GloVe (d=200 — a harder, lower-dimensional regime):
- TurboQuant trails FAISS by 3–6 points at R@1 at very low bit-widths
- Closes by k≈16–32
Bottom line: For modern embedding models at 1,536+ dimensions, TurboQuant matches FAISS quality while using 2–4× less memory and running faster. For very low-dimensional embeddings, FAISS PQ maintains a quality edge.
Part V: Building a Local RAG Pipeline with TurboVec
Here's a complete example of a fully local, air-gapped RAG system using TurboVec — no managed services, no cloud APIs, no data leaving your machine.
Setup
pip install turbovec[langchain] sentence-transformers langchain
Full Pipeline
import numpy as np
from pathlib import Path
from sentence_transformers import SentenceTransformer
from turbovec import TurboQuantIndex
class LocalRAG:
def __init__(
self,
embedding_model: str = "all-MiniLM-L6-v2",
bit_width: int = 4,
index_path: str = "knowledge_base.tq"
):
self.embedder = SentenceTransformer(embedding_model)
self.dim = self.embedder.get_sentence_embedding_dimension()
self.bit_width = bit_width
self.index_path = index_path
self.documents = []
# Load existing index or create fresh
if Path(index_path).exists():
self.index = TurboQuantIndex.load(index_path)
print(f"Loaded existing index ({len(self.documents)} docs)")
else:
self.index = TurboQuantIndex(dim=self.dim, bit_width=bit_width)
print(f"Created new index (dim={self.dim}, {bit_width}-bit)")
def ingest(self, texts: list[str], batch_size: int = 512):
"""Embed and index documents in batches."""
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
embeddings = self.embedder.encode(
batch,
normalize_embeddings=True,
show_progress_bar=True
).astype(np.float32)
start_id = len(self.documents)
self.index.add(embeddings)
self.documents.extend(batch)
print(f"Indexed {min(i + batch_size, len(texts))}/{len(texts)}")
self.index.write(self.index_path)
print(f"Saved index to {self.index_path}")
def search(self, query: str, k: int = 5) -> list[dict]:
"""Retrieve top-k most relevant documents."""
query_vec = self.embedder.encode(
[query],
normalize_embeddings=True
).astype(np.float32)
scores, indices = self.index.search(query_vec, k=k)
results = []
for score, idx in zip(scores[0], indices[0]):
if idx < len(self.documents):
results.append({
"text": self.documents[idx],
"score": float(score),
"index": int(idx)
})
return results
# Usage
rag = LocalRAG(embedding_model="all-MiniLM-L6-v2", bit_width=4)
# Ingest your corpus
with open("knowledge_base.txt") as f:
documents = [line.strip() for line in f if line.strip()]
rag.ingest(documents)
# Query
results = rag.search("What is the capital of France?", k=5)
for r in results:
print(f"[{r['score']:.4f}] {r['text'][:100]}")
Memory Footprint Comparison
For a 10M-document corpus with 1,536-dim embeddings:
Float32 baseline: 61.4 GB RAM
FAISS FlatL2: 61.4 GB RAM
FAISS PQ (4-bit): ~7.7 GB RAM ← requires training phase
TurboVec (4-bit): ~4.0 GB RAM ← zero training, instant add
TurboVec (2-bit): ~2.0 GB RAM ← fits on a Mac mini
A Mac mini M4 with 16 GB RAM can now serve a 10M-document RAG system entirely in memory, with room left over for the embedding model and LLM inference.
Part VI: The TurboQuant Paper — Technical Depth
For practitioners who want to understand the theory before trusting the library, here's the math at a workable depth.
The Information-Theoretic Bound
Any lossy compression scheme for vectors trades off distortion against rate (bits per dimension). Shannon's rate-distortion theory tells us the theoretical minimum distortion achievable at a given bit budget. No algorithm can beat it — but many algorithms fall far short.
PQ-family methods (including FAISS PQ) are data-dependent: they fit the quantizer to the distribution of your specific corpus. This adaptation helps, but it introduces training cost and corpus lock-in.
TurboQuant's insight: for unit vectors in high dimensions, the post-rotation coordinate distribution is universal. You don't need data to characterize it. The Beta distribution that emerges after random rotation is the same regardless of what corpus you embed.
This means the Lloyd-Max quantizer can be derived from first principles once, baked into the library, and applied forever without retraining.
The Two Stages
TurboQuant has a two-stage heritage:
- PolarQuant (AISTATS 2026): The random rotation stage that induces the predictable Beta distribution on coordinates
- QJL (Quantized Johnson-Lindenstrauss) (companion paper): A 1-bit residual correction that recovers inner-product accuracy after quantization
Together they achieve the near-optimal distortion bound. The ICLR 2026 paper proves TurboQuant operates within a factor of ≈2.7 of the Shannon limit across all bit-widths and dimensions — meaning you're not leaving meaningful quality on the table.
Why ARM > x86 for TurboVec
The scoring kernel is where TurboVec's performance advantage lives. The key operation is: given a compressed query and compressed database vectors, compute approximate inner products as fast as possible.
TurboVec uses nibble-split lookup tables — the 4-bit code for each dimension is split into two 2-bit halves, each of which is scored against a precomputed 4-entry table. This maps perfectly onto NEON's vtbl instruction on ARM, which does 8 parallel table lookups in a single cycle.
On x86, AVX-512BW provides a similar vpshufb instruction. TurboVec's AVX-512 kernel adapts FAISS FastScan's pack layout and u16 accumulator strategy, which is why x86 performance is closely competitive with FAISS rather than blowing past it.
The ARM advantage comes from M-series chips' high NEON throughput and the fact that FAISS's x86 path is extremely well-tuned while its ARM path has historically received less attention.
Part VII: Production Deployment Patterns
Pattern 1: Drop-In FAISS Replacement
If you're already on FAISS, TurboVec is designed as a drop-in:
# Before — FAISS
import faiss
index = faiss.IndexPQFastScan(dim, M, nbits)
index.train(training_data)
index.add(vectors)
D, I = index.search(query, k)
# After — TurboVec (no training step, better memory)
from turbovec import TurboQuantIndex
index = TurboQuantIndex(dim=dim, bit_width=4)
index.add(vectors) # same vectors, no training
scores, I = index.search(query, k=k)
Pattern 2: Air-Gapped Enterprise Deployment
For regulated industries (healthcare, finance, government), TurboVec's local-only architecture is a hard requirement:
# All processing stays on your hardware
from turbovec import TurboQuantIndex
# No API calls, no managed services
index = TurboQuantIndex(dim=1536, bit_width=4)
index.add(your_proprietary_embeddings)
results = index.search(query_embedding, k=10)
# Persist to encrypted volume
index.write("/mnt/encrypted/knowledge_base.tq")
Pattern 3: Streaming Ingestion
TurboVec's zero-training architecture enables true streaming updates:
from turbovec import IdMapIndex
from kafka import KafkaConsumer
import json
index = IdMapIndex(dim=1536, bit_width=4)
consumer = KafkaConsumer("document-embeddings")
for message in consumer:
doc = json.loads(message.value)
embedding = np.array(doc["embedding"], dtype=np.float32)
doc_id = np.uint64(doc["id"])
if doc.get("deleted"):
index.remove(doc_id) # O(1) delete
else:
index.add_with_ids(embedding.reshape(1, -1), np.array([doc_id]))
# Checkpoint periodically
if message.offset % 10_000 == 0:
index.write("index.tvim")
Pattern 4: Memory-Constrained Edge Deployment
For edge devices, IoT gateways, or Raspberry Pi deployments:
# 2-bit mode: 30x compression from float32
# Fits 1M docs in ~400 MB — viable for edge devices
from turbovec import TurboQuantIndex
index = TurboQuantIndex(dim=384, bit_width=2) # smaller embedding model too
index.add(corpus_embeddings)
# 1M × 384-dim vectors = 1.5 GB float32 → ~50 MB at 2-bit
# Runs on a Raspberry Pi 5 (8GB model)
Filtering at Search Time
TurboVec supports search-time filtering to restrict results to a subset of documents:
from turbovec import IdMapIndex
index = IdMapIndex(dim=1536, bit_width=4)
# Add documents with metadata (tracked externally)
for doc_id, embedding, metadata in documents:
index.add_with_ids(embedding, np.array([doc_id]))
doc_metadata[doc_id] = metadata
# Filter to only approved documents at search time
approved_ids = get_approved_document_ids(user_context)
scores, ids = index.search(query, k=100, filter_ids=approved_ids)
# Get top-k after filtering
results = [(scores[i], ids[i]) for i in range(len(ids)) if ids[i] in approved_ids][:10]
Part VIII: Implications for AI Infrastructure
The Efficiency Inflection Point
The AI industry has spent five years competing on scale: bigger models, larger context windows, more GPUs, bigger data centers. TurboVec represents a quieter but equally important trend: doing more with less.
The math matters here. A 92% memory reduction doesn't just make existing systems cheaper — it changes what's architecturally possible:
| Before TurboVec | After TurboVec |
|---|---|
| 10M docs requires dedicated 32GB server | 10M docs fits on a MacBook Pro |
| Private RAG needs $500/month cloud VM | Private RAG runs on local hardware |
| Real-time index updates require careful PQ retraining | Streaming updates with zero rebuild cost |
| Filtering requires over-fetch + post-filter | Native search-time filtering |
| Air-gap deployment = small knowledge base | Air-gap deployment = production-scale knowledge base |
What Changes for RAG Architectures
The standard RAG pattern has been:
- Chunk documents
- Embed chunks
- Store in managed vector database (Pinecone, Weaviate, Qdrant, etc.)
- Pay $50–500/month for the service
TurboVec makes a compelling case for self-hosted vector search at scale:
Managed vector DB (10M docs): ~$100–500/month
TurboVec on a VPS with 8GB RAM: ~$20–40/month
TurboVec on existing infrastructure: $0/month
This doesn't mean managed vector databases disappear — they offer persistence, replication, filtering, and operational simplicity that TurboVec doesn't provide out of the box. But for teams that already operate infrastructure and value data privacy, the calculus shifts.
Implications for LLM KV Cache
While TurboVec targets vector search, TurboQuant's paper covers a broader application: KV cache quantization for large language model inference.
Attention's KV cache is itself a matrix of vectors — one per token, per layer, per head. At 128K context windows on large models, the KV cache alone can consume 10–20 GB of GPU memory.
TurboQuant achieves:
- Absolute quality neutrality at 3.5 bits per channel
- Marginal quality degradation at 2.5 bits per channel
- 6× memory reduction with at least 6× faster attention on NVIDIA H100
This means TurboQuant could compress the KV cache of a model running at 128K context from ~16 GB to ~2.7 GB while maintaining generation quality — dramatically expanding what fits in a given GPU budget.
What It Means for the Democratization of AI
The most underappreciated consequence of TurboVec is what it does for accessibility.
Right now, the teams who can run serious RAG systems are the ones who can afford serious infrastructure. A 10M-document knowledge base over proprietary company data — legal documents, customer records, internal wikis — costs real money to query at scale.
With TurboVec, a two-person startup can run that same system on a $40/month VPS, with data never leaving their infrastructure. A research lab can run 100M-document corpus experiments on a single workstation. A healthcare startup can build an air-gapped clinical knowledge base without HIPAA-constrained cloud infrastructure.
Part IX: Limitations and Honest Tradeoffs
TurboVec is impressive, but it's not a universal replacement for all vector search infrastructure. Here's what to keep in mind:
Low-Dimensional Embeddings
The theoretical guarantees of TurboQuant rely on the Beta distribution approximation holding in high dimensions. For embeddings below ~256 dimensions (like GloVe d=200), the approximation is looser:
- At R@1, TurboQuant trails FAISS PQ by 3–6 points for d=200
- The gap closes by k≈16–32
If you're using older, smaller embedding models, test recall carefully before deploying TurboVec in production.
No Distributed Mode
TurboVec is a single-node library. It doesn't provide:
- Replication
- Sharding across multiple machines
- High-availability failover
- Multi-tenant isolation
For massive corpora (>1B vectors) or high-availability requirements, managed vector databases or distributed systems like Milvus remain necessary. TurboVec is best for single-machine workloads — which, given 4 GB for 10M vectors, covers a lot of ground.
No HNSW
TurboVec implements flat quantized search (exhaustive search over compressed vectors), not HNSW (Hierarchical Navigable Small World graphs). HNSW offers sub-linear search time and is better for very large corpora with strict latency SLAs.
At 10M vectors, flat quantized search is fast enough for most applications. At 100M+ vectors, you'd want HNSW or IVF indexing on top of TurboQuant — which may come in future releases.
2-bit Multi-Thread x86 Regression
As noted in the benchmarks, TurboVec is 2–4% slower than FAISS on 2-bit multi-threaded x86 workloads. If you're on x86 and need maximum multi-threaded throughput at 2-bit, benchmark carefully. The 4-bit configuration is recommended for x86 production deployments.
Part X: Getting Started — Practical Checklist
Here's a decision tree for evaluating TurboVec for your use case:
Should You Switch to TurboVec?
Use TurboVec if:
- ✅ Your corpus is 100K–50M vectors
- ✅ You want to reduce infrastructure costs
- ✅ You need air-gapped or on-premise deployment
- ✅ Your corpus grows incrementally (streaming updates)
- ✅ You're on ARM hardware (Apple Silicon, AWS Graviton)
- ✅ You care about data privacy and local-first architecture
- ✅ You embed at 512+ dimensions (modern embedding models)
Stick with your current solution if:
- ❌ You need distributed/replicated vector search
- ❌ Your corpus exceeds 100M vectors
- ❌ You use low-dimensional embeddings (d < 256) where recall is critical
- ❌ You need HNSW for sub-linear search time
- ❌ You require managed SLAs and operations team support
Quick Performance Test
Before migrating, run this recall check against your own data:
import numpy as np
from turbovec import TurboQuantIndex
import faiss
dim = 1536
n = 100_000
# Generate or use your real embeddings
vectors = np.random.randn(n, dim).astype(np.float32)
faiss.normalize_L2(vectors)
queries = np.random.randn(100, dim).astype(np.float32)
faiss.normalize_L2(queries)
# Ground truth from exact search
flat_index = faiss.IndexFlatIP(dim)
flat_index.add(vectors)
_, gt = flat_index.search(queries, k=10)
# TurboVec at 4-bit
tv_index = TurboQuantIndex(dim=dim, bit_width=4)
tv_index.add(vectors)
_, tv_results = tv_index.search(queries, k=10)
# Compute recall@10
recall = np.mean([
len(set(tv_results[i]) & set(gt[i])) / 10
for i in range(len(queries))
])
print(f"TurboVec recall@10: {recall:.4f}")
# Expected: 0.90+ for d=1536
Conclusion: Efficiency as the New Frontier
The past five years of AI progress have been measured in parameters. The next five may be measured in efficiency.
TurboVec is a clear signal that the compression frontier is just as important as the capability frontier. Google Research proved that you can derive near-optimal vector quantization from mathematics alone — no data, no training, no codebooks. Ryan Codrai built that into a production-grade Rust library with Python bindings that ships tomorrow.
The headline numbers are striking:
- 31 GB → 4 GB for 10 million vectors
- Zero training required
- 12–20% faster than FAISS on ARM
- Air-gap friendly — no data leaves your infrastructure
But the deeper implication is about access. The organizations that can now run serious vector search aren't just the ones with $10,000/month infrastructure budgets. They're the two-person startup on a VPS, the research lab on a workstation, the healthcare company that can't put patient data in the cloud.
Vector search at scale is no longer an enterprise-only capability.
Resources
Library:
- TurboVec GitHub — source code, benchmarks, and docs
- PyPI: turbovec —
pip install turbovec - API Reference
Papers:
- TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — ICLR 2026
- ICLR 2026 Poster
- Google Research Blog Post
Framework Integrations: