GadaaLabs
Claude Code Superpowers: AI That Gets Smarter With Every Task
Lesson 4

FORGE — Test-Driven Development With AI

20 min

Test-driven development is counterintuitive the first time you encounter it: you write a test for code that does not yet exist, watch it fail, then write the code to make it pass. The constraint of writing the test first forces you to understand what the code should do before you think about how it does it. That clarity is the entire point.

With AI assistance, TDD becomes even more important — and the reason is the opposite of what most people expect.

Why AI Makes TDD More Important, Not Less

The common assumption is that AI-generated code is correct enough to skip formal verification. This assumption is wrong in a specific way: AI code is often locally correct but contextually wrong.

Locally correct means the code does what it looks like it does. A function that sorts a list will sort the list. A function that validates an email will perform the regex check. You can read the code and it is fine.

Contextually wrong means the code does not do what the system needs. The sort uses the wrong comparison function for your data structure. The email validator accepts formats your downstream system cannot handle. The auth middleware checks the wrong header. These failures are invisible to a reader who does not know the full context.

Tests — written by you, before Claude generates the implementation — encode your contextual knowledge. They say "in this system, under these conditions, with this input, the output must be exactly this." When Claude's implementation passes your tests, it has satisfied your context, not just its own interpretation of the task.

The second reason TDD matters more with AI: AI can generate wrong implementations with high confidence. Unlike a human engineer who slows down when uncertain, Claude generates text at the same rate regardless of confidence. A test suite catches wrong implementations before they are merged. Without one, the only signal that something is wrong is a production incident.

The Red-Green-Refactor Loop With Claude

The classic TDD loop has three phases. With Claude, each phase has a specific mechanic.

Red: Write the failing test first.

You write the test. Not Claude. This is the discipline that makes TDD work with AI. If you ask Claude to write both the test and the implementation, it tends to write an implementation first in its head and then write tests that pass it — defeating the entire purpose.

A failing test with Claude should look like this:

python
# tests/test_rate_limiter.py
import pytest
from rate_limiter import RateLimiter

def test_allows_requests_under_limit():
    limiter = RateLimiter(max_requests=10, window_seconds=60)
    for _ in range(10):
        assert limiter.is_allowed(user_id="user_1") == True

def test_blocks_requests_over_limit():
    limiter = RateLimiter(max_requests=10, window_seconds=60)
    for _ in range(10):
        limiter.is_allowed(user_id="user_1")  # use up the limit
    assert limiter.is_allowed(user_id="user_1") == False

def test_different_users_have_independent_limits():
    limiter = RateLimiter(max_requests=2, window_seconds=60)
    limiter.is_allowed(user_id="user_1")
    limiter.is_allowed(user_id="user_1")
    # user_1 is at limit, but user_2 should still be allowed
    assert limiter.is_allowed(user_id="user_2") == True

def test_window_resets_after_expiry():
    limiter = RateLimiter(max_requests=2, window_seconds=1)
    limiter.is_allowed(user_id="user_1")
    limiter.is_allowed(user_id="user_1")
    # Both slots used. Wait for window reset.
    import time
    time.sleep(1.1)
    assert limiter.is_allowed(user_id="user_1") == True

Notice what these tests encode: the class name, the constructor parameters, the method signature, the behavior across multiple users, and the time-based reset. This is the specification. Claude does not need to make decisions — it just needs to make these tests pass.

Run the tests now:

bash
pytest tests/test_rate_limiter.py

They all fail with ModuleNotFoundError. That is correct. Red phase complete.

Green: Tell Claude to make the tests pass.

Give Claude the test file and this exact instruction:

"Make these tests pass. Do not change the tests. Implement rate_limiter.py from scratch."

This constraint — "do not change the tests" — is important. Without it, Claude will occasionally modify the tests to make them easier to pass rather than implementing what you actually specified. The tests are the spec. They are inviolable.

Claude generates an implementation. Run the tests:

bash
pytest tests/test_rate_limiter.py

If all tests pass, green phase complete. If some fail, show Claude the test output:

"Tests still failing. Here is the output: [paste output]. Fix the implementation only."

Refactor: Clean up without breaking tests.

Once all tests are green, ask Claude to refactor for clarity, performance, or style — with the tests as the regression guard:

"The tests all pass. Now refactor the implementation for clarity. Run tests after each change."

The tests are your safety net. Any refactoring that breaks a test has introduced a regression.

Edge Cases: The Hidden Value of TDD with AI

The most valuable tests to write are edge cases — the inputs that reveal incorrect assumptions.

Common edge cases that Claude implementations miss without explicit tests:

python
def test_zero_max_requests_blocks_everything():
    # Degenerate case: what if the limit is 0?
    limiter = RateLimiter(max_requests=0, window_seconds=60)
    assert limiter.is_allowed(user_id="user_1") == False

def test_negative_window_raises_error():
    # Invalid input: should fail loudly, not silently
    with pytest.raises(ValueError, match="window_seconds must be positive"):
        RateLimiter(max_requests=10, window_seconds=-1)

def test_concurrent_requests_respect_limit():
    # Race condition: two requests simultaneously at the limit
    import threading
    limiter = RateLimiter(max_requests=1, window_seconds=60)
    results = []

    def make_request():
        results.append(limiter.is_allowed(user_id="user_1"))

    t1 = threading.Thread(target=make_request)
    t2 = threading.Thread(target=make_request)
    t1.start(); t2.start()
    t1.join(); t2.join()

    # Exactly one should be allowed
    assert results.count(True) == 1
    assert results.count(False) == 1

The concurrent request test is the one Claude almost never gets right on the first try without being explicitly tested. In-memory dictionaries are not thread-safe in Python by default. The test forces Claude to implement locking.

The Confidence Gate

After the green phase, the FORGE skill applies a confidence scoring gate before considering the task complete.

Ask yourself four questions:

  1. Do the tests cover the full specification? Not just the happy path — edge cases, error conditions, concurrent scenarios.
  2. Are the tests independent? Each test should be able to run in any order without depending on state from a previous test.
  3. Do I understand every line of the implementation? If Claude generated code you cannot explain, that is a risk signal.
  4. Would I be comfortable if this went to production tomorrow? Honest answer.

Score: YES to all four → HIGH confidence, proceed. NO to any security/auth question → ALWAYS invoke code review regardless of other answers. NO to 1-2 non-security questions → MEDIUM, invoke code review. NO to 3+ → LOW, run systematic debugging before proceeding.

The confidence gate prevents the most dangerous outcome of AI-assisted development: shipping code that passes tests but has non-obvious problems the tests do not cover.

TDD Across Languages

The pattern is the same across all languages — only the syntax changes.

TypeScript:

typescript
// tests/rateLimiter.test.ts
import { RateLimiter } from '../src/rateLimiter';

describe('RateLimiter', () => {
  it('allows requests under the limit', () => {
    const limiter = new RateLimiter({ maxRequests: 10, windowSeconds: 60 });
    for (let i = 0; i < 10; i++) {
      expect(limiter.isAllowed('user_1')).toBe(true);
    }
  });

  it('blocks requests over the limit', () => {
    const limiter = new RateLimiter({ maxRequests: 2, windowSeconds: 60 });
    limiter.isAllowed('user_1');
    limiter.isAllowed('user_1');
    expect(limiter.isAllowed('user_1')).toBe(false);
  });
});

Go:

go
// rate_limiter_test.go
func TestAllowsRequestsUnderLimit(t *testing.T) {
    limiter := NewRateLimiter(10, 60*time.Second)
    for i := 0; i < 10; i++ {
        if !limiter.IsAllowed("user_1") {
            t.Errorf("Expected request %d to be allowed", i+1)
        }
    }
}

func TestBlocksRequestsOverLimit(t *testing.T) {
    limiter := NewRateLimiter(2, 60*time.Second)
    limiter.IsAllowed("user_1")
    limiter.IsAllowed("user_1")
    if limiter.IsAllowed("user_1") {
        t.Error("Expected third request to be blocked")
    }
}

C (for embedded systems):

c
// tests/test_rate_limiter.c (using Unity test framework)
void test_allows_under_limit(void) {
    rate_limiter_t limiter;
    rate_limiter_init(&limiter, /*max_requests=*/10, /*window_ms=*/60000);

    for (int i = 0; i < 10; i++) {
        TEST_ASSERT_TRUE(rate_limiter_is_allowed(&limiter, USER_ID_1));
    }
}

void test_blocks_over_limit(void) {
    rate_limiter_t limiter;
    rate_limiter_init(&limiter, 2, 60000);
    rate_limiter_is_allowed(&limiter, USER_ID_1);
    rate_limiter_is_allowed(&limiter, USER_ID_1);
    TEST_ASSERT_FALSE(rate_limiter_is_allowed(&limiter, USER_ID_1));
}

When to Skip TDD (and When Not To)

TDD has a setup cost. For very small changes, the cost exceeds the benefit.

Skip TDD for:

  • Correcting a single typo in a string constant
  • Changing a config value
  • Adding a missing log statement
  • Renaming a variable in one file

Never skip TDD for:

  • Any new function or method
  • Any change to business logic
  • Any security-related code (auth, permissions, encryption)
  • Any code that handles money or user data
  • Any public API or interface change

The decision rule is: if the task shows up in a complexity score above 3, write the tests first.

Integration With the Broader System

The FORGE skill connects to three other skills:

SENTINEL uses the confidence gate output. If confidence is MEDIUM or LOW, it invokes code review automatically.

systematic-debugging takes over when tests still fail after multiple implementation attempts. If you have been at the red phase for more than 20 minutes, stop trying to fix the implementation and run systematic debugging to understand why.

requesting-code-review receives the test coverage report and implementation as input. Good coverage makes code review faster — the reviewer can focus on logic and architecture rather than hunting for untested paths.

Key Takeaway

Write the test before the implementation — always. This constraint encodes your contextual knowledge and prevents Claude from optimizing for local correctness at the expense of system correctness. The Red-Green-Refactor loop with Claude: you write the failing test, Claude writes the implementation, you refactor together with tests as the guard. The confidence gate determines whether to proceed or invoke code review. Edge cases are where AI implementations fail most often — test them explicitly.