Claude Code Superpowers: AI That Gets Smarter With Every Task

Lesson 5

HUNTER â Root Cause, Not Guessing

18 min

The most common way engineers use AI for debugging is also the least effective: paste an error message, ask for a fix, apply the fix, see if it worked. This approach treats debugging as guessing with autocomplete. Sometimes it works. More often it leads to a succession of patches that address symptoms without reaching the root cause.

The systematic debugging skill changes the approach entirely. Evidence before hypothesis. Narrow before fix. Verify before close.

The Core Failure Mode of AI-Assisted Debugging

When you paste an error and ask Claude "what is wrong?", Claude does what language models do: it generates the most probable explanation given the text it received. If the error is common, the explanation is usually right. If the error is unusual, or if the root cause is context-specific, Claude generates a confident-sounding explanation that addresses the surface pattern, not your actual problem.

You apply the fix. The error changes form. You paste the new error. Claude generates another explanation. Four iterations later you have four patches and a codebase that is harder to understand than when you started.

The fix is to change the question. Instead of "what is wrong?", ask "what do I know, and what do I need to find out?" Evidence collection before hypothesis formation is the discipline that prevents the guessing loop.

Phase 1: Reproduce Reliably

Before any analysis, before any hypothesis, reproduce the bug under controlled conditions.

A bug you can reproduce reliably is 80% debugged. You can observe it. You can change one variable at a time. You can verify when it is fixed. A bug you cannot reproduce is almost impossible to debug — you are chasing a ghost.

Reproduction requirements:

Minimal trigger: the smallest input or action that consistently causes the bug
Environment specification: OS, language version, framework version, any relevant configuration
Frequency: always, intermittent (what percentage?), only under load, only after N operations

Tell Claude the reproduction steps before asking anything else. If you cannot produce minimal steps, that investigation is part of the debugging process.

Reproduction:
- Input: POST /api/checkout with body {"items": [{"id": "abc", "qty": 0}]}
- Trigger: always
- Environment: Node 20.11, Stripe SDK 14.x, production config (not staging)
- Error: "422 Unprocessable Entity" but only when qty is 0, not when qty is negative

The "only when qty is 0" observation is already a clue. You found it by specifying precisely.

Phase 2: Collect Evidence

Evidence is observable fact. Hypothesis is interpretation. Collect evidence first.

Evidence to collect before asking Claude anything:

Error information:

Exact error message (copy-paste, do not paraphrase)
Full stack trace with line numbers
HTTP status codes and response bodies for API errors
Log output from around the time of the error

State information:

What was the state of the system immediately before the error?
What inputs led to this state?
Did any data change recently (migration, config change, deployment)?

Change information:

When did this bug first appear? What changed around that time?
Run git log --oneline -20 and look for recent commits near the symptom area
Run git bisect if the change is not obvious

Isolation information:

Does the bug occur in a minimal reproduction without other components?
Does it occur in staging but not local? (Environment difference)
Does it occur for all users or specific users? (Data difference)

Present all of this to Claude at once. Not incrementally. The quality of Claude's analysis is proportional to the completeness of evidence you provide. A half-evidence conversation leads to half-correct hypotheses.

Phase 3: Form Hypotheses

With evidence collected, ask Claude for hypotheses — plural. Not "what is wrong" but "given this evidence, what are the top 3 most likely root causes, ranked by probability?"

Asking for multiple hypotheses does two things. It prevents anchoring on the first explanation (Claude will generate a confident-sounding explanation regardless of certainty — asking for three forces acknowledgment of uncertainty). It gives you a ranked list to test.

Good hypothesis prompt:

Here is the evidence I collected:

Error: ValidationError: qty must be a positive integer
Stack trace: [paste]
Occurs: always when qty === 0, never when qty < 0 or qty > 0
Recent changes: deployed validation middleware 2 days ago
Code: [paste the validation function]

What are the top 3 most likely root causes, ranked by probability? 
For each, describe what evidence would confirm or rule it out.

The "what evidence would confirm or rule it out" clause is the most important part. It turns Claude's analysis into a testable prediction, not just an opinion.

Phase 4: Bisect to Root Cause

Take the highest-probability hypothesis and test it. Bisection means narrowing the problem space by half with each test.

Binary search through code: If the bug is somewhere in a 200-line function, add a log at line 100 and observe whether the bug manifests before or after. Repeat. You find the exact line in ~7 iterations regardless of function length.

Binary search through commits: If the bug appeared recently, git bisect automates the search:

bash

git bisect start
git bisect bad HEAD        # current commit has the bug
git bisect good v2.1.0     # this release was fine
# Git checks out the midpoint; test and mark good/bad
git bisect good            # or: git bisect bad
# Repeat until git identifies the exact commit
git bisect reset

Binary search through configuration: If the bug is environment-specific, isolate variables. Production vs staging differs on: environment variables, database data, network topology, service versions. Change one at a time.

Tell Claude the result of each bisection test. It adjusts hypothesis ranking based on your observations. This iterative narrowing is how root causes get found — not by the first guess being right, but by systematic elimination.

Phase 5: Fix, Test, Document

Fix only the root cause. Not the symptoms you observed along the way.

Applying the fix:

Write a failing test that reproduces the bug (this goes in your test suite permanently)
Apply the minimal fix to address the root cause
Run the reproduction test — it should now pass
Run the full test suite — it should still pass
If you are not certain the fix is right, invoke SENTINEL

The failing test in step 1 is not optional. Without it, the same bug can reappear after a future refactor and you will have no early warning.

Documenting the pattern: After a successful fix, spend 2 minutes creating a pattern entry. Format:

Domain: backend
Type: debug
Keywords: stripe, webhook, idempotency, duplicate-event
Root cause: Stripe retries webhooks if no 200 is returned within 30s.
Our endpoint took 35s for large carts. Stripe retried. We processed twice.
Fix: Return 200 immediately, process asynchronously.
Avoid: Any synchronous processing in webhook handlers that might exceed 30s.

This pattern is searchable in future debugging sessions. The third time someone hits a webhook timeout, they find this pattern in 10 seconds instead of debugging for 3 hours.

The Canary Test Pattern

For bugs that cannot be reliably reproduced (race conditions, time-dependent failures, load-dependent failures), use the canary test pattern.

A canary test is a test that will fail if the bug recurs, even if you cannot reproduce it on demand right now. You write it based on your hypothesis about the root cause, deploy it, and monitor.

python

# Canary test for race condition in payment processing
def test_payment_idempotency_under_concurrent_load():
    """
    Canary: detects race conditions in payment processing.
    Should never fail in production — failure means a race condition occurred.
    """
    payment_id = str(uuid4())
    
    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = [
            executor.submit(process_payment, payment_id, amount=100)
            for _ in range(10)
        ]
        results = [f.result() for f in futures]
    
    # Exactly one payment should succeed; rest should get idempotency errors
    successful = [r for r in results if r.status == 'succeeded']
    assert len(successful) == 1, f"Expected 1 successful payment, got {len(successful)}"

If this test fails in CI in 6 months, you have caught a regression before it hit production. That is the value.

Knowing When to Escalate

Not every bug is a debugging problem. Some bugs are architecture problems wearing bug costumes.

Escalate to architecture review when:

You have debugged the same area three times in the past month
The fix requires touching more than 5 files
The root cause is "the system was not designed for this case"
Multiple people have tried to fix this and each fix created new bugs

Escalate to code review when:

The fix changes behavior in ways that are hard to test
You are not confident the fix does not have side effects

The systematic debugging skill integrates with tribunal for exactly this case. After Phase 5, if the confidence gate scores LOW or MEDIUM, code review is automatically invoked.

Full Debugging Session Template

For reference, here is the template for a complete debugging session with Claude:

DEBUGGING SESSION START
━━━━━━━━━━━━━━━━━━━━━━━━
Bug: [one sentence description]
Reproduction: [exact steps, always/intermittent]
Environment: [OS, language version, framework version]
Error: [exact message + stack trace]
Recent changes: [git log output]
State at time of error: [relevant logs, data]
Isolation: [what you've already ruled out]
━━━━━━━━━━━━━━━━━━━━━━━━

REQUEST: Give me the top 3 most likely root causes, ranked by probability.
For each, tell me what evidence would confirm or rule it out.

Starting a debugging session with this template — even for bugs you think are simple — dramatically reduces the time to root cause. It forces you to collect evidence before asking questions, which is the entire discipline.

Key Takeaway

Effective AI-assisted debugging is evidence-first, not guess-first. The 5 phases: reproduce reliably, collect evidence, form multiple hypotheses (ranked), bisect to root cause, fix and document. Write a failing test before applying any fix — it becomes a permanent regression guard. Store the pattern after solving — it becomes a searchable asset for future debugging sessions. The canary test pattern handles non-reproducible bugs. Escalate when the problem is architectural, not just buggy.

FORGE â Test-Driven Development With AI ML Engineering Skill — Data Pipelines to Production

HUNTER â Root Cause, Not Guessing