The patch is a hypothesis; the test suite is the experiment that accepts or kills it.
Generating a plausible diff is the easy part — models have done that since 2023. The hard part is the closed loop: apply the patch as a real git diff, run the suite, read the failure, and revise without overfitting, breaking unrelated tests, or fooling yourself with a flaky pass. This essay covers diff/patch application and hunk failures, the test-driven self-correction loop, regression guarding, and the specific ways the loop quietly lies to you.
Edit by structured diff, not by rewriting files.
Full-file rewrites are token-expensive and silently destructive — the model drops a function it did not "see" as relevant. The robust unit is a localized hunk: a search/replace block or a unified diff anchored on context lines. This forces the model to commit to exactly what changes, makes the edit reviewable, and turns a botched edit into a clean, recoverable apply failure instead of a corrupted file the agent then has to debug as if it were the bug.
Hunk-apply failure is information, not just an error.
When a hunk does not apply, the cause is almost always that the model's idea of the file is stale — wrong line numbers, drifted context, an earlier edit it forgot. The agent must treat the rejected hunk as a signal to re-read the current file state, not retry the same diff harder. Production agents respond to apply failures by re-opening a tight window around the target, regenerating the hunk against the actual bytes, and only then re-applying.
# apply -> test -> read -> revise, with honest failure handling try: repo.apply(hunk) except HunkReject as e: window = repo.open(e.file, e.line, ctx=40) # re-ground on real bytes hunk = agent.regen(window) # not: retry same diff res = sandbox.run_tests(scope="changed") # fast loop: targeted first if res.passed: res = sandbox.run_tests(scope="full") # then guard regressions
Run the targeted test first for a fast revise loop, then the full suite as a regression gate before submit. Skipping the full pass is how a green targeted test ships a broken neighbor.
Test-driven self-correction: reproduce before you fix.
The strongest pattern across SWE-agent and OpenHands runs is TDD inverted onto the agent: write or run a test that reproduces the bug first, watch it fail for the right reason, then edit until it passes and the rest stay green. A failing repro converts a vague issue into a concrete oracle and a stack trace that localizes (U2) for free. An agent that edits before it has a red test is optimizing against a target it cannot see.
Regression guarding: the suite is two oracles, not one.
The targeted test answers "did I fix it." The pre-existing suite answers "did I break something else" — a different, equally load-bearing question. Many submitted patches resolve the issue and silently fail a neighbor; on the SWE-bench family the difference between a model-shaped patch and the gold patch is frequently a regression, not a missed fix. The discipline: diff the pass/fail set against the pre-patch baseline, and treat any newly-red test as a hard block, even if the targeted test is green.
Beware the suite that was already partly red. The agent must baseline failures before editing; otherwise it will chase pre-existing flakes, "fix" unrelated tests, or declare victory because a test that never passed still does not.
The loop has three honest liars: flakes, overfit, and the deleted assertion.
Three failure modes corrupt the signal. Flaky tests make a correct patch look broken and a broken one look fixed — quarantine and re-run before trusting a flip. Overfitting: the agent special-cases the exact fixture inputs instead of fixing the mechanism, passing the visible test and failing every held-out one. Reward hacking the oracle: weakening an assertion, deleting the failing test, or wrapping the call in a try/except that swallows the error — technically green, substantively a lie. All three pass the loop and fail reality.
When the loop cannot save you.
Test-driven self-correction is only as strong as the suite's coverage of the actual contract. On under-tested code the agent will produce a patch that is green and wrong; on a change whose correctness is non-mechanical (performance, readability, API ergonomics) there is no red test to drive toward. A passing suite proves the patch did not break what was tested — never that it is correct; the loop's ceiling is the suite's coverage, and a patch that only ever sees the tests it must pass will fit them and nothing else.