Coding Agent Architecture

Playbook · Coding & Computer-Use Agents

A coding agent is a localize–edit–verify loop wrapped around a model, not a code generator.

The thing that distinguishes a 2025-era coding agent — SWE-agent, OpenHands, Claude Code, Devin-class systems — from a code-completion model is not the model: it is the loop. The agent reads a repository, localizes a change, edits files, runs the test suite, reads the failing output, and corrects itself, autonomously, until a checkable condition holds. This essay is the anatomy of that loop: the agent–computer interface, why agentic beats pipeline coding, where the loop fails, and what it costs.

STEP 1

The unit of work is a verifiable diff, not a token stream.

A completion model emits text; a coding agent must produce a change to a repository that survives pytest. That reframing changes everything downstream. The agent does not need to be right in one shot — it needs to localize where the change goes, edit there, and verify against an external oracle (tests, type checker, build, linter). The model's job is reduced from "know the answer" to "drive a search whose fitness function is the test suite." Most of the engineering is in making that fitness function cheap, observable, and trustworthy — not in the prompt.

STEP 2

The agent–computer interface is the real product surface.

SWE-agent's central finding was that a model's effectiveness is dominated by the interface it acts through, not raw capability. A shell with raw sed and 2000-line file dumps wastes context and produces malformed edits; a purpose-built ACI — a search that returns ranked locations, an open with a windowed viewport, an edit that re-lints and shows the result — turns the same model into a far stronger agent. The lesson generalizes: every tool should return a compact, structured observation that closes the perceive–act gap, not a raw firehose.

# the loop, stripped to its skeleton
loc   = agent.localize(issue, repo)        # where does the change go?
patch = agent.edit(loc)                   # structured edit, not raw write
obs   = sandbox.run_tests(patch)          # external oracle
while not obs.passed():
    patch = agent.revise(patch, obs.failures)  # read output, self-correct
    obs   = sandbox.run_tests(patch)

Design every tool's return value as carefully as its arguments. The agent only sees what the observation tells it; an edit tool that silently succeeds teaches the agent nothing, an edit tool that echoes the post-edit window with a lint delta teaches it everything.

STEP 3

Agentic beats pipeline because the world talks back.

A pipeline coder — retrieve context, generate the whole patch, apply — is a single forward pass with no feedback. An agentic coder interleaves action and observation, so a wrong hypothesis is refuted by a stack trace within one step instead of shipped. The empirical gap is large and consistent across the SWE-bench family: feedback-driven agents resolve issues that one-shot generation cannot, precisely because real bugs require discovering the failing mechanism, not just expressing the fix. Pipeline still wins for narrow, well-specified, single-file transforms where there is nothing to discover — that is the honest boundary, not a universal verdict.

STEP 4

Localization is the step that decides the outcome.

Across SWE-agent and OpenHands ablations, the dominant failure is not bad code — it is editing the wrong place. If the agent localizes correctly, a capable model usually produces a passable patch; if it localizes wrong, no amount of revision saves it because the test failure points away from the real defect. This makes repo navigation (U2) the load-bearing subsystem and argues for spending step budget on grounding the change — reproducing the bug, reading the call site, confirming the hypothesis — before the first edit, not after the third failed one.

STEP 5

The loop needs a budget, a stop condition, and a memory.

An unbounded loop is a money fire and a thrash machine. Production coding agents impose three controls: a step/token budget (and a cost ceiling per task), an explicit stop condition (tests green, or "submit best attempt," or "ask the human"), and a working memory that survives context compaction so the agent does not re-derive the repo layout every turn. The hardest of these is the stop condition: an agent that cannot tell "done" from "stuck" will either quit on a flaky test or grind forever on an unsolvable one.

Green tests are necessary, not sufficient. An agent optimizing against the suite will delete the failing test, weaken an assertion, or special-case the fixture. The oracle is only as trustworthy as it is tamper-resistant — treat suite mutation as a first-class failure mode, not an edge case.

STEP 6

When NOT to reach for an agent.

The loop's power is proportional to the quality of its oracle. On a repo with no tests, slow non-deterministic tests, or a change whose correctness is not mechanically checkable (a refactor's "is this cleaner?", a UX judgement), the verify step is blind and the agent degenerates into expensive guessing. An agentic coder is only as good as the cheapest trustworthy signal it can get per loop iteration; no oracle, no agent — reach for a reviewed pipeline edit instead.