Retrieval Quality
Dense embedding retrieval alone misses many relevant documents. Keyword-rich queries ("Python ImportError traceback") favour BM25 (sparse retrieval), while paraphrase queries ("why does my import fail") favour dense embeddings. Hybrid search combines both, and cross-encoder re-ranking reorders the combined results for maximum precision.
Retrieval Metrics
| Metric | Definition | Range | Target | |---|---|---|---| | Recall@k | % of relevant docs in top-k results | 0–1 | > 0.80 | | Precision@k | % of top-k results that are relevant | 0–1 | > 0.60 | | MRR | Mean Reciprocal Rank — position of first relevant result | 0–1 | > 0.70 | | NDCG@k | Normalised Discounted Cumulative Gain | 0–1 | > 0.65 | | Answer Faithfulness (RAGAS) | Is the answer supported by retrieved context? | 0–1 | > 0.85 |
Always measure Recall@k first — if relevant documents are not retrieved at all, no amount of re-ranking can fix it.
BM25 Sparse Retrieval
BM25 excels on exact-match queries and is very fast (no GPU required). It fails on synonym queries and paraphrases.
Hybrid Search with Reciprocal Rank Fusion
RRF is robust to score scale differences between dense and sparse retrievers. The constant k=60 is a standard default; experiment with values 30–100 if recall is suboptimal.
Cross-Encoder Re-Ranking
Cross-encoders jointly encode the query and each candidate, producing a much more accurate relevance score than bi-encoder cosine similarity. The trade-off is cost: run cross-encoders only on a small candidate set (20–50 items), not the full corpus.
RAGAS Evaluation Framework
Run RAGAS on 100–200 representative questions. A context_recall below 0.75 means your retrieval is the bottleneck; below that, look at chunking and embedding quality before tuning the LLM.
Summary
- Measure Recall@k, Precision@k, and RAGAS context recall before optimising anything else.
- BM25 sparse retrieval and dense embedding retrieval have complementary strengths; combine them with Reciprocal Rank Fusion.
- Cross-encoder re-ranking significantly improves precision but is expensive — apply it to a pre-filtered candidate set of 20–50 chunks.
- Use RAGAS to decompose end-to-end RAG quality into faithfulness (hallucination rate), answer relevancy, and context recall.
- If context_recall < 0.75, fix retrieval before tuning the LLM — generation quality is bounded by retrieval quality.