Evaluating RAG output is harder than evaluating classification or regression. There is no single correct answer to "What causes Redis connection timeouts?" — multiple valid answers exist, phrased differently, with varying levels of detail. Traditional accuracy metrics do not apply.
The challenge has three dimensions:
Reference-free evaluation: you often cannot write a ground-truth answer for every production query. You need metrics that do not require a reference answer.
Multi-component failure modes: the answer can fail because the retrieved context was wrong, because the LLM ignored the context, or because the answer doesn't actually address the question. Each failure requires a different metric.
Evaluator quality: LLM-as-judge prompts are themselves imperfect — they can disagree with each other, be sensitive to prompt wording, and produce inconsistent scores.
The answer is not to wait for a perfect evaluation method — it is to measure all three dimensions simultaneously with the RAG evaluation triad.
The RAG Evaluation Triad
Context Relevance
Is the retrieved context relevant to the user's query? This catches retrieval failures: the system returned documents, but they don't contain the information needed to answer the question.
Faithfulness (Groundedness)
Does the generated answer use only information present in the retrieved context? This catches hallucinations: the LLM generated plausible-sounding statements that are not supported by the provided documents.
Answer Relevance
Does the answer actually address the user's question? This catches non-answers: the LLM found relevant context and stayed grounded, but still gave a vague or tangential response.
Implementing LLM-as-Judge Functions
python
import jsonfrom groq import Groqclient = Groq()CONTEXT_RELEVANCE_PROMPT = """You are evaluating a RAG system. Given a user query and retrieved context,assess whether the context contains the information needed to answer the query.USER QUERY: {query}RETRIEVED CONTEXT:{context}Score the context relevance on a scale of 1-5:1 = Context is completely irrelevant — does not relate to the query at all2 = Context is marginally related but lacks the specific information needed3 = Context partially addresses the query4 = Context mostly addresses the query with minor gaps5 = Context fully contains the information needed to answer the queryRespond with a JSON object: {{"score": <int>, "reasoning": "<one sentence>"}}"""FAITHFULNESS_PROMPT = """You are checking whether an AI-generated answer is faithful to its source context.A faithful answer contains ONLY claims that are directly supported by the provided context.USER QUERY: {query}RETRIEVED CONTEXT:{context}AI ANSWER:{answer}For each claim in the answer, check whether it is entailed by the context.Then give an overall faithfulness score:1 = Answer contains major unsupported claims (hallucinations)2 = Answer has several claims not in the context3 = Answer has minor unsupported details4 = Answer is mostly faithful with trivial additions5 = Every claim in the answer is directly supported by the contextRespond with JSON: {{"score": <int>, "unsupported_claims": [<strings>], "reasoning": "<one sentence>"}}"""ANSWER_RELEVANCE_PROMPT = """You are evaluating whether an AI answer addresses the user's question.USER QUERY: {query}AI ANSWER:{answer}Score the answer relevance on a scale of 1-5:1 = Answer completely ignores the question2 = Answer is tangentially related but misses the point3 = Answer partially addresses the question4 = Answer mostly addresses the question with minor gaps5 = Answer directly and completely addresses the questionRespond with JSON: {{"score": <int>, "reasoning": "<one sentence>"}}"""def evaluate_context_relevance(query: str, context: str) -> dict: """Score whether the retrieved context is relevant to the query.""" response = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=[{ "role": "user", "content": CONTEXT_RELEVANCE_PROMPT.format(query=query, context=context[:3000]), }], response_format={"type": "json_object"}, temperature=0.0, ) return json.loads(response.choices[0].message.content)def evaluate_faithfulness(query: str, context: str, answer: str) -> dict: """Score whether the answer is grounded in the provided context.""" response = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=[{ "role": "user", "content": FAITHFULNESS_PROMPT.format( query=query, context=context[:3000], answer=answer ), }], response_format={"type": "json_object"}, temperature=0.0, ) return json.loads(response.choices[0].message.content)def evaluate_answer_relevance(query: str, answer: str) -> dict: """Score whether the answer addresses the user's question.""" response = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=[{ "role": "user", "content": ANSWER_RELEVANCE_PROMPT.format(query=query, answer=answer), }], response_format={"type": "json_object"}, temperature=0.0, ) return json.loads(response.choices[0].message.content)
Hallucination Detection with NLI
Natural Language Inference (NLI) models classify whether one sentence entails, contradicts, or is neutral to another. You can use an NLI model to check each sentence in the generated answer against the retrieved context — a claim-level faithfulness check that does not require an LLM call.
python
from transformers import pipeline# facebook/bart-large-mnli is a strong 400 MB NLI modelnli_pipeline = pipeline( "text-classification", model="facebook/bart-large-mnli", device=-1, # -1 = CPU; set to 0 for GPU)def check_claim_entailment(claim: str, context: str) -> dict: """ Check if a single claim is entailed by the context using NLI. Returns entailment probability and label. """ # BART-MNLI uses the format: "premise. hypothesis." # The model outputs scores for CONTRADICTION, NEUTRAL, ENTAILMENT result = nli_pipeline( f"{context[:512]}", candidate_labels=["entailment", "neutral", "contradiction"], hypothesis_template="{}", ) # result is actually a zero-shot classification; map to entailment check scores = {r["label"]: r["score"] for r in result["scores"]} if isinstance(result, list) else {} return { "claim": claim, "entailment_score": result[0]["score"] if isinstance(result, list) else 0.0, "label": result[0]["label"] if isinstance(result, list) else "neutral", }def split_into_claims(answer: str) -> list[str]: """Split an answer into individual claim sentences.""" import re # Simple sentence splitter — use spacy for production sentences = re.split(r'(?<=[.!?])\s+', answer.strip()) return [s for s in sentences if len(s.split()) > 4]def nli_faithfulness_score(answer: str, context: str) -> dict: """ Compute faithfulness by checking NLI entailment for each sentence. Returns per-claim results and an overall fraction of supported claims. """ claims = split_into_claims(answer) results = [] for claim in claims: entailment = check_claim_entailment(claim, context) results.append(entailment) supported = sum(1 for r in results if "entail" in r["label"].lower()) faithfulness = supported / len(results) if results else 1.0 return { "faithfulness": faithfulness, "supported_claims": supported, "total_claims": len(results), "claim_results": results, }
RAGAS — Framework-Level Evaluation
RAGAS is the de facto standard evaluation framework for RAG pipelines. It computes faithfulness, answer relevancy, context precision, and context recall from a dataset of (question, answer, contexts, ground_truth) tuples.
python
# pip install ragas datasetsfrom datasets import Datasetfrom ragas import evaluatefrom ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall,)def run_ragas_evaluation(eval_data: list[dict]) -> dict: """ Run RAGAS evaluation on a batch of RAG outputs. Each item in eval_data must have: - question: str - answer: str (the generated answer) - contexts: list[str] (the retrieved chunks) - ground_truth: str (reference answer — required for context_recall) """ dataset = Dataset.from_list(eval_data) results = evaluate( dataset, metrics=[ faithfulness, # are all answer claims supported by context? answer_relevancy, # does the answer address the question? context_precision, # are the retrieved contexts relevant? context_recall, # does context cover the ground truth? ], ) # results is a dict with metric names as keys and float scores as values return { "faithfulness": results["faithfulness"], "answer_relevancy": results["answer_relevancy"], "context_precision": results["context_precision"], "context_recall": results["context_recall"], }# Example usagesample_eval_data = [ { "question": "What is HNSW?", "answer": "HNSW is a graph-based ANN algorithm that builds a hierarchical navigable small world graph for fast approximate nearest neighbour search.", "contexts": ["HNSW (Hierarchical Navigable Small World) is a graph-based index that builds a multi-layer structure for approximate nearest neighbour search with sub-linear query time."], "ground_truth": "HNSW is a hierarchical graph structure for approximate nearest neighbour search.", }]
The Evaluation Harness
Combine all evaluation approaches into a production harness:
python
import sqlite3import datetimeimport pandas as pdfrom dataclasses import dataclass@dataclassclass EvalRecord: query: str context: str answer: str context_relevance: float faithfulness: float answer_relevance: float timestamp: str = ""class EvalHarness: """Complete RAG evaluation harness with SQLite persistence.""" def __init__(self, db_path: str = "rag_eval.db"): self.db_path = db_path self._init_db() def _init_db(self) -> None: with sqlite3.connect(self.db_path) as conn: conn.execute(""" CREATE TABLE IF NOT EXISTS evaluations ( id INTEGER PRIMARY KEY AUTOINCREMENT, query TEXT NOT NULL, context_relevance REAL, faithfulness REAL, answer_relevance REAL, timestamp TEXT NOT NULL ) """) def evaluate_single(self, query: str, context: str, answer: str) -> EvalRecord: """Evaluate a single RAG output and persist the result.""" cr = evaluate_context_relevance(query, context) faith = evaluate_faithfulness(query, context, answer) ar = evaluate_answer_relevance(query, answer) record = EvalRecord( query=query, context=context, answer=answer, context_relevance=cr["score"] / 5.0, faithfulness=faith["score"] / 5.0, answer_relevance=ar["score"] / 5.0, timestamp=datetime.datetime.utcnow().isoformat(), ) with sqlite3.connect(self.db_path) as conn: conn.execute( "INSERT INTO evaluations (query, context_relevance, faithfulness, answer_relevance, timestamp) VALUES (?, ?, ?, ?, ?)", (query, record.context_relevance, record.faithfulness, record.answer_relevance, record.timestamp), ) return record def run_batch( self, questions: list[str], contexts: list[str], answers: list[str], ) -> pd.DataFrame: """Evaluate a batch and return a DataFrame of scores.""" records = [] for q, ctx, ans in zip(questions, contexts, answers): try: record = self.evaluate_single(q, ctx, ans) records.append({ "query": q, "context_relevance": record.context_relevance, "faithfulness": record.faithfulness, "answer_relevance": record.answer_relevance, }) except Exception as e: print(f"Eval failed for query '{q[:50]}': {e}") return pd.DataFrame(records) def get_summary_stats(self) -> dict: """Retrieve aggregate metrics from all stored evaluations.""" with sqlite3.connect(self.db_path) as conn: df = pd.read_sql("SELECT * FROM evaluations", conn) return { "n_evaluated": len(df), "avg_context_relevance": df["context_relevance"].mean(), "avg_faithfulness": df["faithfulness"].mean(), "avg_answer_relevance": df["answer_relevance"].mean(), "faithfulness_below_threshold": (df["faithfulness"] < 0.6).sum(), }
CI Integration
Run evaluation on every pull request that touches the RAG pipeline. Fail the build if faithfulness drops below threshold.
python
# eval_ci.py — run in CI pipelineimport sysdef run_ci_evaluation(rag_system, golden_set: list[dict], thresholds: dict) -> bool: """ Run evaluation on the golden set. Return True if all thresholds pass, False otherwise. """ harness = EvalHarness(db_path=":memory:") # in-memory for CI questions = [item["question"] for item in golden_set] contexts = [] answers = [] for item in golden_set: ctx, ans = rag_system.query(item["question"]) contexts.append(ctx) answers.append(ans) results = harness.run_batch(questions, contexts, answers) stats = results.mean() print(f"Faithfulness: {stats['faithfulness']:.3f} (threshold: {thresholds['faithfulness']})") print(f"Context Relevance: {stats['context_relevance']:.3f} (threshold: {thresholds['context_relevance']})") print(f"Answer Relevance: {stats['answer_relevance']:.3f} (threshold: {thresholds['answer_relevance']})") passed = ( stats["faithfulness"] >= thresholds["faithfulness"] and stats["context_relevance"] >= thresholds["context_relevance"] and stats["answer_relevance"] >= thresholds["answer_relevance"] ) if not passed: print("EVALUATION FAILED: one or more thresholds not met") sys.exit(1) print("EVALUATION PASSED") return True# Example invocation# run_ci_evaluation(my_rag, golden_set, {"faithfulness": 0.75, "context_relevance": 0.70, "answer_relevance": 0.70})
Key Takeaways
RAG evaluation requires three orthogonal metrics: context relevance (retrieval quality), faithfulness (no hallucinations), and answer relevance (actually answered the question).
LLM-as-judge with structured prompts returning JSON scores 1–5 is the most practical approach for all three metrics — use a 70B model for the judge.
NLI-based faithfulness checking (BART-MNLI) provides claim-level grounding verification without an LLM call — fast and deployable in CI.
RAGAS wraps all four metrics (faithfulness, answer relevancy, context precision, context recall) in a standard dataset API.
Build an evaluation harness that persists every score to SQLite — tracking trends over time reveals regressions before they reach users.
Set CI quality gates: fail the build if faithfulness drops below 0.75 or context relevance below 0.70 on your golden set.
Reference-free metrics (faithfulness, answer relevance) are more useful in production because they don't require ground-truth answers for every query.
Sample 10% of production queries for live evaluation rather than evaluating every query — that sample is sufficient to detect systemic regressions.