GadaaLabs
RAG Engineering
Lesson 5

Retrieval Quality

17 min

Dense embedding retrieval alone misses many relevant documents. Keyword-rich queries ("Python ImportError traceback") favour BM25 (sparse retrieval), while paraphrase queries ("why does my import fail") favour dense embeddings. Hybrid search combines both, and cross-encoder re-ranking reorders the combined results for maximum precision.

Retrieval Metrics

| Metric | Definition | Range | Target | |---|---|---|---| | Recall@k | % of relevant docs in top-k results | 0–1 | > 0.80 | | Precision@k | % of top-k results that are relevant | 0–1 | > 0.60 | | MRR | Mean Reciprocal Rank — position of first relevant result | 0–1 | > 0.70 | | NDCG@k | Normalised Discounted Cumulative Gain | 0–1 | > 0.65 | | Answer Faithfulness (RAGAS) | Is the answer supported by retrieved context? | 0–1 | > 0.85 |

Always measure Recall@k first — if relevant documents are not retrieved at all, no amount of re-ranking can fix it.

BM25 Sparse Retrieval

python
from rank_bm25 import BM25Okapi
import re

def tokenise(text: str) -> list[str]:
    return re.findall(r"\w+", text.lower())

corpus_tokens = [tokenise(chunk) for chunk in corpus]
bm25 = BM25Okapi(corpus_tokens)

def bm25_retrieve(query: str, k: int = 20) -> list[tuple[int, float]]:
    tokens = tokenise(query)
    scores = bm25.get_scores(tokens)
    top_k  = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)[:k]
    return top_k  # list of (idx, score)

BM25 excels on exact-match queries and is very fast (no GPU required). It fails on synonym queries and paraphrases.

Hybrid Search with Reciprocal Rank Fusion

python
def reciprocal_rank_fusion(
    dense_results: list[tuple[int, float]],
    sparse_results: list[tuple[int, float]],
    k: int = 60,
) -> list[tuple[int, float]]:
    """
    RRF score = Σ 1 / (k + rank_i)
    Fuses rankings without needing to normalise scores across systems.
    """
    scores: dict[int, float] = {}

    for rank, (idx, _) in enumerate(dense_results, start=1):
        scores[idx] = scores.get(idx, 0.0) + 1.0 / (k + rank)

    for rank, (idx, _) in enumerate(sparse_results, start=1):
        scores[idx] = scores.get(idx, 0.0) + 1.0 / (k + rank)

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

RRF is robust to score scale differences between dense and sparse retrievers. The constant k=60 is a standard default; experiment with values 30–100 if recall is suboptimal.

Cross-Encoder Re-Ranking

python
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", max_length=512)

def rerank(query: str, candidate_chunks: list[str], top_n: int = 5) -> list[str]:
    pairs  = [(query, chunk) for chunk in candidate_chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidate_chunks, scores), key=lambda x: x[1], reverse=True)
    return [chunk for chunk, _ in ranked[:top_n]]

Cross-encoders jointly encode the query and each candidate, producing a much more accurate relevance score than bi-encoder cosine similarity. The trade-off is cost: run cross-encoders only on a small candidate set (20–50 items), not the full corpus.

RAGAS Evaluation Framework

python
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall

eval_dataset = Dataset.from_dict({
    "question":  ["What is the Gadaa system?"],
    "answer":    ["Gadaa is an Oromo democratic institution..."],
    "contexts":  [["The Gadaa system divides Oromo society into eight-year cycles..."]],
    "ground_truth": ["Gadaa is a democratic governance system of the Oromo people."],
})

results = evaluate(
    eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_recall],
)
print(results)
# → {'faithfulness': 0.92, 'answer_relevancy': 0.87, 'context_recall': 0.80}

Run RAGAS on 100–200 representative questions. A context_recall below 0.75 means your retrieval is the bottleneck; below that, look at chunking and embedding quality before tuning the LLM.

Summary

  • Measure Recall@k, Precision@k, and RAGAS context recall before optimising anything else.
  • BM25 sparse retrieval and dense embedding retrieval have complementary strengths; combine them with Reciprocal Rank Fusion.
  • Cross-encoder re-ranking significantly improves precision but is expensive — apply it to a pre-filtered candidate set of 20–50 chunks.
  • Use RAGAS to decompose end-to-end RAG quality into faithfulness (hallucination rate), answer relevancy, and context recall.
  • If context_recall < 0.75, fix retrieval before tuning the LLM — generation quality is bounded by retrieval quality.