Why RAG?
LLMs are remarkably good at generating fluent, confident text — including fluent, confident text that is completely wrong. This lesson explains why that happens, how retrieval-augmented generation addresses it, and when a simpler approach is preferable.
The Hallucination Problem
LLMs are trained to predict the next token given the previous ones. When asked a factual question, the model generates tokens that are statistically likely given the question context — not tokens that are provably true. Three failure modes are common:
| Failure mode | Example | Root cause | |---|---|---| | Fabricated facts | Incorrect citation author or date | Statistical plausibility ≠ factual accuracy | | Stale knowledge | Answers from the training cutoff | Knowledge frozen at training time | | Out-of-domain gaps | Internal company policy not in training data | Private knowledge never in training corpus |
Fine-tuning addresses the third problem partially, but fine-tuning on facts does not reliably prevent hallucinations — it changes style, not factual fidelity.
RAG Architecture
A RAG system adds a retrieval step before generation:
The LLM never needs to "remember" facts. It only needs to read and synthesise the retrieved context — a task LLMs are genuinely good at.
When NOT to Use RAG
RAG is not universally better than alternatives:
| Scenario | Better approach | |---|---| | < 20 short documents | Fit in context window (no retrieval needed) | | Stable knowledge, < 5k docs | Fine-tuning | | Complex multi-hop reasoning | Chain-of-thought or code execution | | Real-time data (stock prices) | API tool call, not RAG |
Minimal Working RAG
This is the simplest possible implementation. The following lessons improve each component.
Summary
- LLMs hallucinate because they maximise token likelihood, not factual accuracy.
- RAG grounds every response in retrieved documents rather than model weights, eliminating the stale-knowledge and out-of-domain problems.
- Do not use RAG if your corpus fits in the context window, knowledge is stable, or reasoning requires multi-hop logic that retrieval cannot support.
- The core RAG loop is: encode query → retrieve top-k chunks → assemble prompt → generate response.