Vector Databases — HNSW, IVF, Filtering & Production Ops
Why Exact k-NN Does Not Scale
The simplest possible vector search is exact k-nearest-neighbour: iterate over every vector in the database, compute the distance to the query vector, return the k smallest. This is O(n) in both time and — for large collections — effectively in memory bandwidth. At one million 1536-dimensional float32 vectors the index occupies 6 GB of memory, and each query touches all of it.
A query for "top-5 similar documents" at one million vectors with a brute-force scan takes roughly 400 ms on a modern CPU. At ten million vectors it is four seconds. Neither is acceptable for a synchronous API.
Approximate Nearest Neighbour (ANN) algorithms trade a small amount of recall for orders-of-magnitude faster search. The two dominant families are graph-based (HNSW) and cluster-based (IVF).
HNSW — Hierarchical Navigable Small World
HNSW builds a multi-layer directed graph. Each vector becomes a node. At the top layer only a small fraction of nodes are present; each successive layer has more nodes. At the bottom layer (layer 0) every node appears.
When inserting a new vector:
- Randomly assign the node to a maximum layer using an exponential distribution — most nodes stay at layer 0, a few reach layer 1, fewer still reach layer 2, and so on.
- Starting from a fixed entry point at the top layer, greedily navigate to the ef_construction nearest neighbours at the current layer.
- Step down to the next layer, starting from the best candidates found above.
- At the node's maximum layer and below, add bidirectional edges to the M nearest neighbours found.
At query time, navigation starts at the top layer, greedily moves toward the query vector, then descends to layer 0 where it expands a beam of ef_search candidates before returning the top-k.
Key Parameters
M (default 16): the number of bidirectional edges per node. Higher M → more accurate, higher memory, slower build. The index memory footprint is approximately n × M × 2 × 4 bytes. At M=16 and one million vectors: ~128 MB overhead on top of the raw vectors.
ef_construction (default 200): the beam width during graph construction. Higher → more accurate graph structure, slower inserts. This is a build-time-only parameter.
ef_search (default 50): the beam width at query time. This is your primary knob for trading recall against latency at runtime. Increasing ef_search from 50 to 200 typically raises Recall@10 from ~95% to ~99% at the cost of 2–3× higher latency.
IVF — Inverted File Index
IVF partitions the vector space into nlist clusters using k-means. Each vector is assigned to its nearest cluster centroid. To search, IVF examines only the nprobe nearest clusters rather than all clusters, dramatically reducing the number of distance computations.
nlist (number of clusters): rule of thumb is sqrt(n). For one million vectors: 1000 clusters. More clusters → finer partitioning → faster search but more memory for centroids.
nprobe (clusters to search): typical range is 10–50. Higher nprobe → higher recall → higher latency. At nprobe = nlist you get exact search.
IVF is typically built on top of quantised vectors for even further memory reduction. The canonical combination is IVF + PQ (Product Quantisation), written as IVFPQ in FAISS.
Recall-Latency-Cost Comparison
| Algorithm | Recall@10 | Query Latency (1M vecs) | Memory | Best For | |-----------|-----------|------------------------|--------|----------| | Brute Force | 100% | ~400 ms | 6 GB | <50K vectors | | HNSW M=16, ef=64 | ~95% | 2–5 ms | 6.1 GB | General purpose | | HNSW M=32, ef=128 | ~99% | 8–15 ms | 6.2 GB | High-recall needs | | IVF nprobe=32 | ~92% | 1–3 ms | 6 GB | Batch workloads | | IVFPQ | ~85% | 1–2 ms | 200 MB | Memory-constrained |
Quantisation — Reducing Memory Without Killing Recall
Scalar Quantisation (SQ8)
SQ8 compresses each float32 component to a uint8 integer by finding the min and max in each dimension and linearly mapping to [0, 255]. Four bytes become one byte — 4× compression. Recall loss at Recall@10 is typically under 1% because the relative ordering of distances is preserved.
Product Quantisation (PQ)
PQ divides each d-dimensional vector into M sub-vectors of dimension d/M. Each sub-vector is independently quantised against a codebook of k* centroids (typically k* = 256). Each sub-vector is then stored as a single byte — its codebook index.
For a 1536-dimensional vector with M=48 sub-spaces: 1536/48 = 32 dimensions per sub-vector, each stored as one byte. Total storage: 48 bytes per vector instead of 6144 bytes — a 128× compression. The cost is higher recall loss (typically 5–15% at Recall@10) due to the coarser approximation.
Filtered Search — The Hard Problem
Filtered vector search means "find the k nearest neighbours to this query vector, but only among documents where category == 'legal'". This sounds simple but has sharp failure modes.
Pre-filter (Dangerous)
Restrict the candidate set to matching documents before ANN search, then search only within that subset. Problem: if the filter is selective (only 500 matches out of 1M), HNSW's graph structure cannot navigate to k=10 results because the sparse subset has no good graph edges. Recall collapses. Pre-filter works only when the filter matches at least ~10% of the corpus.
Post-filter (Wasteful)
Run ANN search without filter, return top-100, then apply filter. If the filter discards 95% of results, you may have fewer than k=5 results after filtering — and you wasted the ANN compute.
Filtered Graph Traversal (Best Approach)
Filter predicates are evaluated during graph traversal itself. Only nodes that pass the filter are considered as valid neighbours. Qdrant implements this natively. The graph exploration path adapts to the filtered subset, maintaining high recall even with selective filters.
Qdrant in Practice
Qdrant is the recommended production vector database for most RAG applications: it supports filtered HNSW natively, runs as a single binary or in Kubernetes, and has a clean Python client.
Weaviate — Hybrid Search Built-In
Weaviate ships with native BM25 + vector hybrid search. You do not need a separate keyword search engine.
Pinecone — Serverless vs Pod Architecture
Pinecone Serverless stores vectors on object storage and loads index pages on demand. It is cost-effective for sporadic workloads but has higher cold-start latency (~100–300 ms). Pod-based deployments keep the entire index in memory for consistent sub-10ms latency. For production RAG with >100 QPS, choose pods.
Multi-Tenancy Strategies
Per-tenant collections: each tenant gets its own Qdrant collection. Strong isolation — a bug in one tenant's query cannot leak into another's. Cost scales linearly with tenant count because each collection has its own HNSW graph overhead. Best for <100 tenants with large per-tenant data volumes.
Metadata filter: all tenants share one collection, each chunk carries a tenant_id payload field, all queries include a mandatory tenant_id filter. More cost-effective for hundreds or thousands of tenants. Risk: a missing filter in application code leaks cross-tenant data. Mitigate by wrapping all search calls in a TenantSearchClient that always injects the filter.
Index Maintenance
When to re-index: whenever you change the embedding model. The new model produces vectors in a different geometric space — old vectors are incompatible with new queries. Strategy: build a new collection with the new model in parallel, run a canary test, cut over, delete the old collection.
Segment merging: Qdrant and similar databases organise data in segments. As you insert and delete, small segments accumulate and degrade query performance. Most databases trigger automatic background merging; for Qdrant you can trigger it explicitly:
Tombstones and deletes: deleting a point marks it as a tombstone. It is still visited during graph traversal but filtered from results. High delete rates degrade recall because the graph still routes through tombstoned nodes. Compact the collection periodically after bulk deletes.
Key Takeaways
- Exact k-NN is O(n) and becomes unacceptable past ~50K vectors at interactive latency; use ANN algorithms for production.
- HNSW is the general-purpose choice: tune M for the recall-memory tradeoff, tune ef_search at runtime to balance recall and latency.
- IVF works well for batch workloads; choose nprobe = 10–50 and nlist ≈ sqrt(n).
- SQ8 cuts memory 4× with under 1% recall loss; PQ cuts it 32–128× but with meaningful recall cost — use PQ only when memory is the binding constraint.
- Pre-filtering is dangerous with selective filters; use a vector database that supports filtered graph traversal (Qdrant, Weaviate) to preserve recall.
- For multi-tenancy, wrap all search calls in a client that enforces the tenant filter — never trust callers to inject it themselves.
- Re-index from scratch whenever you change embedding models; the vector space is incompatible across models.
- Monitor segment health and tombstone accumulation; trigger compaction after bulk deletes.