GadaaLabs
RAG Engineering
Lesson 1

Why RAG?

12 min

LLMs are remarkably good at generating fluent, confident text — including fluent, confident text that is completely wrong. This lesson explains why that happens, how retrieval-augmented generation addresses it, and when a simpler approach is preferable.

The Hallucination Problem

LLMs are trained to predict the next token given the previous ones. When asked a factual question, the model generates tokens that are statistically likely given the question context — not tokens that are provably true. Three failure modes are common:

| Failure mode | Example | Root cause | |---|---|---| | Fabricated facts | Incorrect citation author or date | Statistical plausibility ≠ factual accuracy | | Stale knowledge | Answers from the training cutoff | Knowledge frozen at training time | | Out-of-domain gaps | Internal company policy not in training data | Private knowledge never in training corpus |

Fine-tuning addresses the third problem partially, but fine-tuning on facts does not reliably prevent hallucinations — it changes style, not factual fidelity.

RAG Architecture

A RAG system adds a retrieval step before generation:

User query


Query Encoder → dense vector


Vector Database → top-k most similar document chunks


Prompt Assembly: [system prompt] + [retrieved chunks] + [user query]


LLM → response grounded in retrieved chunks

The LLM never needs to "remember" facts. It only needs to read and synthesise the retrieved context — a task LLMs are genuinely good at.

When NOT to Use RAG

RAG is not universally better than alternatives:

python
# Decision heuristic
def choose_approach(knowledge_changes_daily, corpus_size_docs, needs_reasoning):
    if corpus_size_docs < 20:
        return "stuff all docs into system prompt"
    if not knowledge_changes_daily and corpus_size_docs < 5000:
        return "fine-tune on the corpus"
    if needs_reasoning and not knowledge_changes_daily:
        return "fine-tune + optional RAG for long-tail"
    return "RAG"

| Scenario | Better approach | |---|---| | < 20 short documents | Fit in context window (no retrieval needed) | | Stable knowledge, < 5k docs | Fine-tuning | | Complex multi-hop reasoning | Chain-of-thought or code execution | | Real-time data (stock prices) | API tool call, not RAG |

Minimal Working RAG

python
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

corpus = [
    "The Gadaa system is an Oromo democracy originating in Ethiopia.",
    "The Gadaa cycle lasts eight years.",
    "Women in Gadaa hold a parallel institution called Siqqee.",
]

corpus_embeddings = model.encode(corpus, normalize_embeddings=True)

def retrieve(query: str, k: int = 2):
    q_emb = model.encode(query, normalize_embeddings=True)
    scores = corpus_embeddings @ q_emb        # cosine similarity (pre-normalised)
    top_k  = np.argsort(scores)[::-1][:k]
    return [corpus[i] for i in top_k]

results = retrieve("How long does a Gadaa cycle last?")
context = "\n".join(results)
prompt  = f"Answer using only this context:\n{context}\n\nQuestion: How long does a Gadaa cycle last?"
# → pass `prompt` to your LLM of choice

This is the simplest possible implementation. The following lessons improve each component.

Summary

  • LLMs hallucinate because they maximise token likelihood, not factual accuracy.
  • RAG grounds every response in retrieved documents rather than model weights, eliminating the stale-knowledge and out-of-domain problems.
  • Do not use RAG if your corpus fits in the context window, knowledge is stable, or reasoning requires multi-hop logic that retrieval cannot support.
  • The core RAG loop is: encode query → retrieve top-k chunks → assemble prompt → generate response.