GadaaLabs
Claude Code Superpowers: AI That Gets Smarter With Every Task
Lesson 7

NEXUS — RAG, Agents, and Prompts

22 min

Building AI applications is different from building regular software in a specific way: the failure modes are invisible. A broken database query throws an exception. A hallucinating RAG system returns a confident, plausible, wrong answer. No exception is raised. No 500 error. Just incorrect information delivered with the tone of authority.

The NEXUS skill is built around this reality. Every pattern it provides has a corresponding failure mode it prevents. Understanding the connection between the pattern and the failure is what makes the difference between a demo that works and a system that works reliably in production.

The System Type Assessment

When you invoke the NEXUS skill, it immediately assesses what kind of system you are building:

A) RAG / knowledge retrieval system
B) Autonomous agent / tool-using agent
C) Prompt engineering / LLM integration
D) LLM evaluation / benchmarking
E) Multi-agent system
F) Debugging a hallucination / quality problem
G) Cost/latency optimization

Each type maps to a different section of patterns. Type A loads the RAG architecture patterns. Type B loads the agent design patterns. Type F triggers hunter first, then returns with evidence about what the model hallucinated, on what input, under what conditions.

One question follows: "What model are you using and what's the context window limit?" This single answer determines chunking strategy, context budget, and retrieval approach. A 4k context window system requires completely different RAG design than a 128k context window system.

Building a Production RAG System

RAG (Retrieval-Augmented Generation) is the most commonly built AI system and the most commonly built wrong. The pattern is simple in description: retrieve relevant documents, pass them to the model, get a grounded response. The failure modes are in the implementation details.

Chunking strategy is the first decision and it affects everything downstream.

The choice depends on document type. Technical documentation with code blocks: 512-1024 token chunks with 10-15% overlap (preserves code integrity). Legal contracts: 256-512 tokens with 20% overlap (preserves clause boundaries). Conversation transcripts: 256 tokens with 25% overlap (preserves turn-taking context). Code files: split at function boundaries, no size limit (an AST-aware split is better than a token-count split).

Semantic chunking — using embedding similarity to find natural break points — consistently outperforms fixed-size chunking but costs more to compute. Use fixed-size for prototypes, semantic for production.

Embedding model selection is a cost-vs-quality trade-off. The patterns file provides the full decision matrix:

| Model | Dimension | Speed | Best For | |-------|-----------|-------|----------| | all-MiniLM-L6-v2 | 384 | Fast | General RAG, low latency | | bge-large-en-v1.5 | 1024 | Medium | High-accuracy retrieval | | text-embedding-3-large | 3072 | Slow | Enterprise, multi-lingual |

Start with bge-large-en-v1.5. It provides the best quality-to-cost ratio for most production RAG systems.

Retrieval method selection follows a decision tree:

Need exact keyword match?           → lexical (BM25)
Need semantic recall?               → dense retrieval (cosine similarity)
Need both precision AND recall?     → hybrid (BM25 + dense, RRF weighted)
Need diversity in results?          → MMR (Maximal Marginal Relevance)
Multi-hop questions?                → iterative retrieval

For most production systems, hybrid retrieval with Reciprocal Rank Fusion gives the best results. Pure semantic retrieval misses exact matches. Pure lexical retrieval misses semantic matches. Hybrid captures both.

Reranking adds a cross-encoder on top of the initial retrieval. You retrieve the top 20 candidates, then rerank to top 5 using a cross-encoder (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2). Cross-encoders are slower than bi-encoders but substantially more accurate — they see the query and document together, not separately.

Context budget management is the detail that determines whether the system actually works at the token limits of your chosen model:

python
token_budget = {
    'system_prompt':     500,
    'query':             100,
    'retrieved_context': 6000,
    'output_reserve':    1592,
}

If your retrieved context consistently exceeds 6000 tokens, you need map-reduce (process chunks separately, combine results) or prompt compression. Running out of context window silently truncates input — the model never tells you it is missing context.

The most important rule: Test retrieval quality (precision/recall) before testing generation quality. A RAG system with excellent generation but poor retrieval will hallucinate from missing context. The generation cannot compensate for bad retrieval.

Agent Design Patterns

Agents have five design patterns, each appropriate for different problem shapes.

ReAct (Reason + Act) is the simplest and most versatile. The model alternates Thought → Action → Observation until it has enough information for a final answer. Maximum 10 iterations. Force an answer after the limit. ReAct is appropriate when the task is factual and the tools are clear.

Plan-and-Execute separates planning from execution. A planner LLM generates a multi-step plan. An executor LLM executes one step at a time. A synthesizer combines results. This pattern handles tasks with many steps that can be determined upfront. Limit to 5 plan regenerations (adaptive replanning).

Reflection/Self-Correction adds a quality loop: Generate → Critique → Revise → Output. A critic LLM reviews the generator's output for factual errors, logical inconsistencies, missing information, and unclear reasoning. A reviser fixes identified issues. Maximum 3 reflection cycles. Stop when the critic says "no issues."

This pattern is expensive (3 LLM calls per reflection cycle) but dramatically improves quality for tasks where correctness is critical: legal document review, medical information synthesis, financial analysis.

Multi-Agent Debate uses multiple agents arguing different positions, with a judge evaluating. Appropriate for high-stakes decisions where the right answer is genuinely uncertain and second-order effects matter. 3 debate rounds. Use sparingly — it is expensive and slow.

Tool Routing selects the right tool for each task using either prompt-based selection (fast LLM reads tool descriptions and selects), embedding-based selection (encode task and tool descriptions, find nearest neighbor), or a fine-tuned classifier (94% accuracy, zero inference cost). Fallback: ask user for clarification if confidence < 0.7.

The universal rule for all agent patterns: Always set a maximum iteration limit. An agent without a limit will loop indefinitely when it encounters a tool error or a problem it cannot solve. The iteration limit is not a performance optimization — it is a safety mechanism.

The Prompt Engineering Workflow

Prompt engineering is not guesswork with a process. The iterative refinement loop:

  1. Write initial prompt
  2. Test on 10 diverse examples — categorize every failure
  3. Add constraints or few-shot examples to address the most common failures
  4. Re-test until >90% success rate
  5. Add output format schema with retry logic

Never deploy a prompt tested on fewer than 10 diverse examples. A prompt that works on 2 examples you thought of might fail on the 3rd input a real user enters.

Few-shot example selection matters more than most engineers realize. Random selection gives random results. Select examples that:

  • Cover all query categories in your input space
  • Include known edge cases from step 2
  • Progress from easy to hard (helps the model calibrate its reasoning)

Output format enforcement prevents the second most common production failure: the model generates correct content but in a format your downstream system cannot parse. Enforce format with a JSON schema in the prompt, validate the output, retry up to 3 times with the validation error if it fails.

Prompt injection defense is non-negotiable for any system that processes untrusted input. The defenses:

  • Keyword filter: reject inputs containing "ignore previous instructions", "system prompt", "developer mode"
  • Context isolation: user input in a separate message, not concatenated into the system prompt
  • Output filtering: block responses that contain system prompt content
  • Monitoring: log and alert on inputs that trigger the keyword filter

LLM Evaluation Framework

Without measurement, you do not know if your AI system is getting better or worse.

Correctness metrics match to task type:

  • Exact match: factual QA, code generation (output must match reference exactly)
  • F1 score: extractive QA, summarization (token overlap with reference)
  • Semantic similarity: open-ended QA, chatbots (embedding cosine similarity > 0.8)
  • Rubric-based: complex analysis tasks (LLM evaluates against explicit rubric)

Hallucination detection is the most important metric for knowledge-intensive systems. Three methods, in order of rigor:

Fact verification: extract atomic claims from the output, search the knowledge base for each, flag claims with no supporting evidence.

Self-consistency: generate 5 outputs for the same input at temperature > 0, extract key claims, check agreement across samples. < 50% agreement indicates likely hallucination.

Uncertainty calibration: ask the model for its confidence score, bin predictions by confidence, verify actual accuracy matches claimed confidence within 5%.

Human preference evaluation is the gold standard for comparing two models or two prompt versions. Blind A/B comparison (evaluators do not know which is A or B), 3 evaluators per sample, 100 samples minimum covering all query categories. Win rate > 0.55 with p < 0.05 is the decision threshold.

Cost and latency budgets must be set before production deployment, not discovered in production:

  • P99 latency: 5 seconds maximum for interactive systems
  • Cost per 1,000 requests: $10.00 maximum
  • Set up cost alerts immediately after deployment

The Final Checklist

Before calling any AI engineering work complete:

  • [ ] RAG retrieval quality tested (precision/recall on test set)
  • [ ] Agent has explicit max iteration limits
  • [ ] Prompt injection testing passed
  • [ ] Hallucination detection configured and running
  • [ ] Cost monitoring and alerts active
  • [ ] Latency budgets defined and P99 measured
  • [ ] Output format validation with retry implemented
  • [ ] Human evaluation plan defined (or A/B test configured)
  • [ ] Rollback plan documented

The hallucination monitoring and cost alerts items are the ones most commonly skipped. Hallucination monitoring catches quality degradation before users report it. Cost alerts prevent the scenario where your open-ended agent costs $10k in a weekend because someone found a way to make it loop.

Key Takeaway

AI systems fail invisibly — confident, plausible, wrong. The NEXUS skill provides the patterns that make these failures visible and preventable. Test retrieval before generation. Set iteration limits on every agent. Never deploy a prompt tested on fewer than 10 examples. Measure hallucination rates in production. Set cost alerts before you need them. The patterns exist because each one corresponds to a failure mode that costs teams real time, money, or user trust.