Evaluating Coding Agents

Playbook · Coding & Computer-Use Agents

A SWE-bench number is a measurement of a harness on a contaminated benchmark — read it like one.

“Resolves 70% of SWE-bench” is one of the most-quoted and least-understood numbers in the field. It conflates the model, the scaffold, the harness, pass@k vs. resolve rate, and a benchmark with documented contamination. This essay covers the SWE-bench family, what the metrics actually mean, harness sensitivity, the contamination problem, and how to build the private eval set that is the only number you should trust.

STEP 1

The SWE-bench family measures issue resolution, not code quality.

SWE-bench takes real GitHub issues with their merged fix and their test diff; an agent “resolves” an instance if its patch makes the hidden fail-to-pass tests pass without breaking pass-to-pass tests. SWE-bench Verified is the human-filtered 500-task Python subset that became the headline; SWE-bench Pro is the larger multi-language, harder, contamination-resistant successor that OpenAI moved to after auditing Verified. The metric is binary per task: tests pass or they do not — it says nothing about whether the patch is good code.

STEP 2

pass@k flatters; resolve rate at k=1 is the honest number.

pass@k counts a task solved if any of k attempts works — useful for measuring headroom, dishonest as a deployment number, because production rarely gets to silently try ten times and pick the winner (there is no oracle to pick by). Always read which is reported: a glossy “pass@5” and a sober “resolved@1” on the same system can differ by twenty-plus points. Worse still is the unstated one — assume the larger advertised number is the more generous metric until proven otherwise.

# the same system, three legitimate, very different numbers
resolve_at_1 = solved_in_one_attempt / N        # the deployment number
pass_at_5    = solved_in_any_of_5  / N           # headroom, not capability
pass_caret_5 = solved_in_all_of_5 / N            # reliability under repetition
# quote resolve@1; pass@k without k and the selector is marketing

A leaderboard row is model + scaffold + harness + retry policy + which test subset, scored on a public set. Comparing two rows that differ in any of those is comparing nothing — the agent framework often moves the number more than the model does.

STEP 3

Harness sensitivity: the scaffold is a confound, not a constant.

The same model under SWE-agent, OpenHands, and a bespoke loop produces materially different resolve rates — tool design, retry budget, localization strategy, and prompt all move it more than a model version bump often does. This is the U1 lesson restated as a measurement hazard: a benchmark number attributes to “the model” an outcome that is mostly “the harness someone built around it.” Hold the scaffold fixed or you are A/B-testing two different products and calling it a model comparison.

STEP 4

Contamination is not hypothetical — it is documented.

SWE-bench Verified instances are pre-cutoff public GitHub issues with public merged fixes; auditing found frontier models can reproduce gold patches near-verbatim on some tasks, and the maintainers of SWE-bench Verified themselves stopped treating it as a clean signal — the field moved toward Pro and toward continuously-refreshed sets (the SWE-rebench line) for exactly this reason. Treat any score on an old public benchmark as an upper bound contaminated by memorization, not a measurement of generalization.

STEP 5

The only number you should trust is your own private set.

Public benchmarks rank systems; they do not predict performance on your codebase, which has your conventions, your test idioms, your build. Build a private eval from your own resolved issues: each is the diff, the fail-to-pass tests, the repo state, dated and held out of any training surface. Refresh it from new issues so it cannot rot or leak, and report resolve@1 under your fixed harness. That number, on your code, post-cutoff, is the only one that forecasts production.

STEP 6

What the number still cannot tell you.

Even a clean private resolve@1 measures “made the tests pass,” not “wrote code a senior would merge,” not maintainability, not whether it weakened an assertion to get green (U3). It is necessary, not sufficient; pair it with human review of a sample and trajectory inspection. Trust no coding-agent number whose harness, k, selector, and contamination status you cannot name; the public leaderboard ranks scaffolds on a leaked test, and the only honest score is resolve@1 on your own post-cutoff issues.