Jayasoruban R — AI Full Stack Engineer

When you hit 1 million embeddings, simple flat search stops working. You need an Approximate Nearest Neighbor (ANN) index. The gold standard for this is HNSW (Hierarchical Navigable Small Worlds).

In this 1,000+ word guide, we will look at the math and engineering behind scaling vector databases for production AI.

#How HNSW Works: The Intuition

Think of HNSW as a multi-layered graph. The top layers have fewer nodes and long-range connections (like an express train), while the bottom layers have all the nodes and short-range connections (like local streets). Search starts at the top and 'zooms in' as it descends.

#1. Tuning the Index Parameters

There are three main 'knobs' you can turn in HNSW to optimize your performance:

M (Max Connections)

This defines the number of bi-directional links created for every new node.

**Higher M:** Better recall, but more memory usage and slower index build times.

**Recommended:** 16 to 32 for standard RAG; 64+ for complex, high-dimensional datasets.

efConstruction

This controls the index build speed vs. quality.

**Higher efConstruction:** Slower builds, but much more accurate graphs. This is only relevant when you are *inserting* data.

**Recommended:** 200 to 400.

efSearch

This is a query-time parameter. It defines how many 'neighbors' the search algorithm explores before stopping.

**Higher efSearch:** Slower queries, but higher recall (less chance of missing the true nearest neighbor).

#2. Quantization: Saving Memory (and Money)

Embeddings are large. A single 1536-dimensional vector (OpenAI's `text-embedding-3-small`) takes up ~6KB of memory. Scaling to 10 million vectors requires ~60GB of RAM just for the raw vectors. This is where Quantization comes in.

**Scalar Quantization (SQ):** Converts 32-bit floats into 8-bit integers. This reduces memory usage by 4x with minimal impact on recall.

**Product Quantization (PQ):** Compresses the vector by breaking it into sub-vectors and mapping them to a codebook. This can reduce memory by 10x-20x but has a higher impact on accuracy.

#3. Filtering Performance: Pre-filtering vs. Post-filtering

Most RAG queries aren't just 'find similar vectors'. They are 'find similar vectors *where user_id = 123*'.

**Post-filtering:** You find the top 100 similar vectors and then filter out the ones that don't match the metadata. If only 2 vectors match the metadata, you return only 2 results. This is often poor UX.

**Pre-filtering:** The index itself understands the metadata. It only searches within the subset of vectors that match the metadata. Modern databases like Milvus or Pinecone do this using 'Filtered HNSW'.

#4. SQL Example: Tuning pgvector

If you are using PostgreSQL with `pgvector`, here is how you create and use a tuned HNSW index:

sql

-- Create the index with specific M and ef_construction
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops)
WITH (m = 24, ef_construction = 200);

-- At query time, you can dynamically adjust search depth
SET hnsw.ef_search = 100;

-- Efficiently retrieve the top 10 neighbors
SELECT id, content FROM items 
ORDER BY embedding <=> '[0.1, 0.2, ...]' 
LIMIT 10;

#5. Benchmarking Your Specific Distribution

Never trust generic leaderboards. Vector DB performance is highly dependent on your specific 'data distribution'. A database that is fast for short snippets of text might be slow for long technical documents.

I recommend using **ANN-Benchmarks**, an open-source tool that allows you to test different databases and index configurations against your actual production embeddings.

#Conclusion: The 95% Rule

Vector search is a trade-off. For 90% of business applications, I aim for **95% recall at 10ms latency**. Trying to squeeze out that last 4% of recall usually doubles your infrastructure cost for no perceptible benefit to the end user. Focus on your retrieval system's end-to-end quality, not just the nearest-neighbor math.

Key Takeaway

"Moving from demo to production requires shifting focus from prompt engineering to system engineering. The magic is in the retrieval loop."

Vector DB Tuning: Scaling to Millions of Embeddings with HNSW