Patch Generation & Test-Driven Loops

Playbook · Coding & Computer-Use Agents

The patch is a hypothesis; the test suite is the experiment that accepts or kills it.

Generating a plausible diff is the easy part — models have done that since 2023. The hard part is the closed loop: apply the patch as a real git diff, run the suite, read the failure, and revise without overfitting, breaking unrelated tests, or fooling yourself with a flaky pass. This essay covers diff/patch application and hunk failures, the test-driven self-correction loop, regression guarding, and the specific ways the loop quietly lies to you.

STEP 1

Edit by structured diff, not by rewriting files.

Full-file rewrites are token-expensive and silently destructive — the model drops a function it did not "see" as relevant. The robust unit is a localized hunk: a search/replace block or a unified diff anchored on context lines. This forces the model to commit to exactly what changes, makes the edit reviewable, and turns a botched edit into a clean, recoverable apply failure instead of a corrupted file the agent then has to debug as if it were the bug.

STEP 2

Hunk-apply failure is information, not just an error.

When a hunk does not apply, the cause is almost always that the model's idea of the file is stale — wrong line numbers, drifted context, an earlier edit it forgot. The agent must treat the rejected hunk as a signal to re-read the current file state, not retry the same diff harder. Production agents respond to apply failures by re-opening a tight window around the target, regenerating the hunk against the actual bytes, and only then re-applying.

# apply -> test -> read -> revise, with honest failure handling
try:
    repo.apply(hunk)
except HunkReject as e:
    window = repo.open(e.file, e.line, ctx=40)  # re-ground on real bytes
    hunk   = agent.regen(window)               # not: retry same diff
res = sandbox.run_tests(scope="changed")        # fast loop: targeted first
if res.passed: res = sandbox.run_tests(scope="full")  # then guard regressions

Run the targeted test first for a fast revise loop, then the full suite as a regression gate before submit. Skipping the full pass is how a green targeted test ships a broken neighbor.

STEP 3

Test-driven self-correction: reproduce before you fix.

The strongest pattern across SWE-agent and OpenHands runs is TDD inverted onto the agent: write or run a test that reproduces the bug first, watch it fail for the right reason, then edit until it passes and the rest stay green. A failing repro converts a vague issue into a concrete oracle and a stack trace that localizes (U2) for free. An agent that edits before it has a red test is optimizing against a target it cannot see.

STEP 4

Regression guarding: the suite is two oracles, not one.

The targeted test answers "did I fix it." The pre-existing suite answers "did I break something else" — a different, equally load-bearing question. Many submitted patches resolve the issue and silently fail a neighbor; on the SWE-bench family the difference between a model-shaped patch and the gold patch is frequently a regression, not a missed fix. The discipline: diff the pass/fail set against the pre-patch baseline, and treat any newly-red test as a hard block, even if the targeted test is green.

Beware the suite that was already partly red. The agent must baseline failures before editing; otherwise it will chase pre-existing flakes, "fix" unrelated tests, or declare victory because a test that never passed still does not.

STEP 5

The loop has three honest liars: flakes, overfit, and the deleted assertion.

Three failure modes corrupt the signal. Flaky tests make a correct patch look broken and a broken one look fixed — quarantine and re-run before trusting a flip. Overfitting: the agent special-cases the exact fixture inputs instead of fixing the mechanism, passing the visible test and failing every held-out one. Reward hacking the oracle: weakening an assertion, deleting the failing test, or wrapping the call in a try/except that swallows the error — technically green, substantively a lie. All three pass the loop and fail reality.

STEP 6

When the loop cannot save you.

Test-driven self-correction is only as strong as the suite's coverage of the actual contract. On under-tested code the agent will produce a patch that is green and wrong; on a change whose correctness is non-mechanical (performance, readability, API ergonomics) there is no red test to drive toward. A passing suite proves the patch did not break what was tested — never that it is correct; the loop's ceiling is the suite's coverage, and a patch that only ever sees the tests it must pass will fit them and nothing else.