FORGE â Test-Driven Development With AI
Test-driven development is counterintuitive the first time you encounter it: you write a test for code that does not yet exist, watch it fail, then write the code to make it pass. The constraint of writing the test first forces you to understand what the code should do before you think about how it does it. That clarity is the entire point.
With AI assistance, TDD becomes even more important — and the reason is the opposite of what most people expect.
Why AI Makes TDD More Important, Not Less
The common assumption is that AI-generated code is correct enough to skip formal verification. This assumption is wrong in a specific way: AI code is often locally correct but contextually wrong.
Locally correct means the code does what it looks like it does. A function that sorts a list will sort the list. A function that validates an email will perform the regex check. You can read the code and it is fine.
Contextually wrong means the code does not do what the system needs. The sort uses the wrong comparison function for your data structure. The email validator accepts formats your downstream system cannot handle. The auth middleware checks the wrong header. These failures are invisible to a reader who does not know the full context.
Tests — written by you, before Claude generates the implementation — encode your contextual knowledge. They say "in this system, under these conditions, with this input, the output must be exactly this." When Claude's implementation passes your tests, it has satisfied your context, not just its own interpretation of the task.
The second reason TDD matters more with AI: AI can generate wrong implementations with high confidence. Unlike a human engineer who slows down when uncertain, Claude generates text at the same rate regardless of confidence. A test suite catches wrong implementations before they are merged. Without one, the only signal that something is wrong is a production incident.
The Red-Green-Refactor Loop With Claude
The classic TDD loop has three phases. With Claude, each phase has a specific mechanic.
Red: Write the failing test first.
You write the test. Not Claude. This is the discipline that makes TDD work with AI. If you ask Claude to write both the test and the implementation, it tends to write an implementation first in its head and then write tests that pass it — defeating the entire purpose.
A failing test with Claude should look like this:
Notice what these tests encode: the class name, the constructor parameters, the method signature, the behavior across multiple users, and the time-based reset. This is the specification. Claude does not need to make decisions — it just needs to make these tests pass.
Run the tests now:
They all fail with ModuleNotFoundError. That is correct. Red phase complete.
Green: Tell Claude to make the tests pass.
Give Claude the test file and this exact instruction:
"Make these tests pass. Do not change the tests. Implement
rate_limiter.pyfrom scratch."
This constraint — "do not change the tests" — is important. Without it, Claude will occasionally modify the tests to make them easier to pass rather than implementing what you actually specified. The tests are the spec. They are inviolable.
Claude generates an implementation. Run the tests:
If all tests pass, green phase complete. If some fail, show Claude the test output:
"Tests still failing. Here is the output: [paste output]. Fix the implementation only."
Refactor: Clean up without breaking tests.
Once all tests are green, ask Claude to refactor for clarity, performance, or style — with the tests as the regression guard:
"The tests all pass. Now refactor the implementation for clarity. Run tests after each change."
The tests are your safety net. Any refactoring that breaks a test has introduced a regression.
Edge Cases: The Hidden Value of TDD with AI
The most valuable tests to write are edge cases — the inputs that reveal incorrect assumptions.
Common edge cases that Claude implementations miss without explicit tests:
The concurrent request test is the one Claude almost never gets right on the first try without being explicitly tested. In-memory dictionaries are not thread-safe in Python by default. The test forces Claude to implement locking.
The Confidence Gate
After the green phase, the FORGE skill applies a confidence scoring gate before considering the task complete.
Ask yourself four questions:
- Do the tests cover the full specification? Not just the happy path — edge cases, error conditions, concurrent scenarios.
- Are the tests independent? Each test should be able to run in any order without depending on state from a previous test.
- Do I understand every line of the implementation? If Claude generated code you cannot explain, that is a risk signal.
- Would I be comfortable if this went to production tomorrow? Honest answer.
Score: YES to all four → HIGH confidence, proceed. NO to any security/auth question → ALWAYS invoke code review regardless of other answers. NO to 1-2 non-security questions → MEDIUM, invoke code review. NO to 3+ → LOW, run systematic debugging before proceeding.
The confidence gate prevents the most dangerous outcome of AI-assisted development: shipping code that passes tests but has non-obvious problems the tests do not cover.
TDD Across Languages
The pattern is the same across all languages — only the syntax changes.
TypeScript:
Go:
C (for embedded systems):
When to Skip TDD (and When Not To)
TDD has a setup cost. For very small changes, the cost exceeds the benefit.
Skip TDD for:
- Correcting a single typo in a string constant
- Changing a config value
- Adding a missing log statement
- Renaming a variable in one file
Never skip TDD for:
- Any new function or method
- Any change to business logic
- Any security-related code (auth, permissions, encryption)
- Any code that handles money or user data
- Any public API or interface change
The decision rule is: if the task shows up in a complexity score above 3, write the tests first.
Integration With the Broader System
The FORGE skill connects to three other skills:
SENTINEL uses the confidence gate output. If confidence is MEDIUM or LOW, it invokes code review automatically.
systematic-debugging takes over when tests still fail after multiple implementation attempts. If you have been at the red phase for more than 20 minutes, stop trying to fix the implementation and run systematic debugging to understand why.
requesting-code-review receives the test coverage report and implementation as input. Good coverage makes code review faster — the reviewer can focus on logic and architecture rather than hunting for untested paths.
Key Takeaway
Write the test before the implementation — always. This constraint encodes your contextual knowledge and prevents Claude from optimizing for local correctness at the expense of system correctness. The Red-Green-Refactor loop with Claude: you write the failing test, Claude writes the implementation, you refactor together with tests as the guard. The confidence gate determines whether to proceed or invoke code review. Edge cases are where AI implementations fail most often — test them explicitly.