Back to blog
Vector DB Tuning: Scaling to Millions of Embeddings with HNSW
🗂️
Vector DBPerformancePostgreSQL

Vector DB Tuning: Scaling to Millions of Embeddings with HNSW

Apr 15, 2026·14 min read

When you hit 1 million embeddings, simple flat search stops working. You need an Approximate Nearest Neighbor (ANN) index. The gold standard for this is HNSW (Hierarchical Navigable Small Worlds).

In this 1,000+ word guide, we will look at the math and engineering behind scaling vector databases for production AI.

#How HNSW Works: The Intuition

Think of HNSW as a multi-layered graph. The top layers have fewer nodes and long-range connections (like an express train), while the bottom layers have all the nodes and short-range connections (like local streets). Search starts at the top and 'zooms in' as it descends.

#1. Tuning the Index Parameters

There are three main 'knobs' you can turn in HNSW to optimize your performance:

M (Max Connections)

This defines the number of bi-directional links created for every new node.

  • **Higher M:** Better recall, but more memory usage and slower index build times.
  • **Recommended:** 16 to 32 for standard RAG; 64+ for complex, high-dimensional datasets.
  • efConstruction

    This controls the index build speed vs. quality.

  • **Higher efConstruction:** Slower builds, but much more accurate graphs. This is only relevant when you are *inserting* data.
  • **Recommended:** 200 to 400.
  • efSearch

    This is a query-time parameter. It defines how many 'neighbors' the search algorithm explores before stopping.

  • **Higher efSearch:** Slower queries, but higher recall (less chance of missing the true nearest neighbor).
  • #2. Quantization: Saving Memory (and Money)

    Embeddings are large. A single 1536-dimensional vector (OpenAI's `text-embedding-3-small`) takes up ~6KB of memory. Scaling to 10 million vectors requires ~60GB of RAM just for the raw vectors. This is where Quantization comes in.

  • **Scalar Quantization (SQ):** Converts 32-bit floats into 8-bit integers. This reduces memory usage by 4x with minimal impact on recall.
  • **Product Quantization (PQ):** Compresses the vector by breaking it into sub-vectors and mapping them to a codebook. This can reduce memory by 10x-20x but has a higher impact on accuracy.
  • #3. Filtering Performance: Pre-filtering vs. Post-filtering

    Most RAG queries aren't just 'find similar vectors'. They are 'find similar vectors *where user_id = 123*'.

  • **Post-filtering:** You find the top 100 similar vectors and then filter out the ones that don't match the metadata. If only 2 vectors match the metadata, you return only 2 results. This is often poor UX.
  • **Pre-filtering:** The index itself understands the metadata. It only searches within the subset of vectors that match the metadata. Modern databases like Milvus or Pinecone do this using 'Filtered HNSW'.
  • #4. SQL Example: Tuning pgvector

    If you are using PostgreSQL with `pgvector`, here is how you create and use a tuned HNSW index:

    sql
    -- Create the index with specific M and ef_construction
    CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops)
    WITH (m = 24, ef_construction = 200);
    
    -- At query time, you can dynamically adjust search depth
    SET hnsw.ef_search = 100;
    
    -- Efficiently retrieve the top 10 neighbors
    SELECT id, content FROM items 
    ORDER BY embedding <=> '[0.1, 0.2, ...]' 
    LIMIT 10;

    #5. Benchmarking Your Specific Distribution

    Never trust generic leaderboards. Vector DB performance is highly dependent on your specific 'data distribution'. A database that is fast for short snippets of text might be slow for long technical documents.

    I recommend using **ANN-Benchmarks**, an open-source tool that allows you to test different databases and index configurations against your actual production embeddings.

    #Conclusion: The 95% Rule

    Vector search is a trade-off. For 90% of business applications, I aim for **95% recall at 10ms latency**. Trying to squeeze out that last 4% of recall usually doubles your infrastructure cost for no perceptible benefit to the end user. Focus on your retrieval system's end-to-end quality, not just the nearest-neighbor math.

    Key Takeaway

    "Moving from demo to production requires shifting focus from prompt engineering to system engineering. The magic is in the retrieval loop."

    J

    Jayasoruban R

    AI Full Stack Engineer · Chennai, India

    OPEN TO WORK · OPEN TO WORK · OPEN TO WORK · OPEN TO WORK · OPEN TO WORK · OPEN TO WORK · OPEN TO WORK · OPEN TO WORK ·