When you hit 1 million embeddings, simple flat search stops working. You need an Approximate Nearest Neighbor (ANN) index. The gold standard for this is HNSW (Hierarchical Navigable Small Worlds).
In this 1,000+ word guide, we will look at the math and engineering behind scaling vector databases for production AI.
#How HNSW Works: The Intuition
Think of HNSW as a multi-layered graph. The top layers have fewer nodes and long-range connections (like an express train), while the bottom layers have all the nodes and short-range connections (like local streets). Search starts at the top and 'zooms in' as it descends.
#1. Tuning the Index Parameters
There are three main 'knobs' you can turn in HNSW to optimize your performance:
M (Max Connections)
This defines the number of bi-directional links created for every new node.
efConstruction
This controls the index build speed vs. quality.
efSearch
This is a query-time parameter. It defines how many 'neighbors' the search algorithm explores before stopping.
#2. Quantization: Saving Memory (and Money)
Embeddings are large. A single 1536-dimensional vector (OpenAI's `text-embedding-3-small`) takes up ~6KB of memory. Scaling to 10 million vectors requires ~60GB of RAM just for the raw vectors. This is where Quantization comes in.
#3. Filtering Performance: Pre-filtering vs. Post-filtering
Most RAG queries aren't just 'find similar vectors'. They are 'find similar vectors *where user_id = 123*'.
#4. SQL Example: Tuning pgvector
If you are using PostgreSQL with `pgvector`, here is how you create and use a tuned HNSW index:
-- Create the index with specific M and ef_construction
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops)
WITH (m = 24, ef_construction = 200);
-- At query time, you can dynamically adjust search depth
SET hnsw.ef_search = 100;
-- Efficiently retrieve the top 10 neighbors
SELECT id, content FROM items
ORDER BY embedding <=> '[0.1, 0.2, ...]'
LIMIT 10;#5. Benchmarking Your Specific Distribution
Never trust generic leaderboards. Vector DB performance is highly dependent on your specific 'data distribution'. A database that is fast for short snippets of text might be slow for long technical documents.
I recommend using **ANN-Benchmarks**, an open-source tool that allows you to test different databases and index configurations against your actual production embeddings.
#Conclusion: The 95% Rule
Vector search is a trade-off. For 90% of business applications, I aim for **95% recall at 10ms latency**. Trying to squeeze out that last 4% of recall usually doubles your infrastructure cost for no perceptible benefit to the end user. Focus on your retrieval system's end-to-end quality, not just the nearest-neighbor math.
"Moving from demo to production requires shifting focus from prompt engineering to system engineering. The magic is in the retrieval loop."
