Reading Agent Benchmarks Critically

E4
Operation · Evaluation & Observability

Reading agent benchmarks critically — what the leaderboard is actually measuring.

"State of the art on SWE-bench" is a marketing sentence, not an engineering fact about your system. Every agent benchmark measures one narrow operationalization of capability under one harness, on tasks that may be contaminated, with a score that may not survive a prompt change — and almost certainly does not predict performance on your task. This essay dissects what the major benchmarks actually test, the failure modes that make leaderboard rank misleading, and why the only number that matters is the one from the small eval set you build yourself.

STEP 1

What each major benchmark actually measures.

  • SWE-bench — resolve a real GitHub issue so the repo's hidden test suite passes. Measures localize-and-patch on mature Python repos with strong test coverage. Does not measure greenfield design, languages without good test harnesses, or anything where "the tests pass" is not the spec. SWE-bench Verified strips ambiguous/broken tasks — a cleaner but easier subset.
  • GAIA — real-world assistant questions needing multi-hop web browsing, file handling, and multimodal reasoning, with a single unambiguous answer. Measures tool-use orchestration and retrieval, not code or long-horizon autonomy.
  • τ-bench — tool-agent in a simulated domain (retail/airline) with a policy document and a simulated user, scored on database end-state plus a pass^k consistency metric. The rare benchmark that measures reliability, not just peak capability — and agents that look strong elsewhere fall apart on its pass^8.
  • WebArena — long-horizon tasks in self-hosted web apps (an e-commerce site, a CMS, a forum). Measures realistic multi-step web operation with functional success checks. Scores here are low and that is the honest signal: real web agency is hard.

Read the operationalization, not the headline. Each benchmark answers a precisely scoped question; the leaderboard collapses it into one number and the press release drops the scope entirely.

STEP 2

Contamination: the benchmark may be in the weights.

Public benchmarks built from public data leak into pretraining. SWE-bench tasks are GitHub issues with public PRs; GAIA questions are searchable. A frontier model may have seen the issue, the fixing commit, and the discussion. You cannot always tell, from the outside, whether a high score is capability or recall.

Heuristics for suspicion: a sharp jump exactly at a model's training-cutoff boundary; performance that craters when entities are renamed or the task is paraphrased; near-perfect scores on a benchmark everyone else finds hard. Contamination inflates the number most on exactly the tasks the model has memorized — precisely where it tells you least about generalization.

Decontaminated, held-out, and freshly-authored variants (and time-sliced "tasks created after cutoff" splits) exist because of this. Prefer them, and treat any benchmark older than the model's training data as a memorization test until proven otherwise.

STEP 3

Harness sensitivity: same model, different number.

A benchmark score is a property of the model and the scaffold, never the model alone. The same base model on SWE-bench can swing 15+ points across agent harnesses — different prompt, retrieval, tool set, retry budget, and parsing. Two implications that invalidate most leaderboard comparisons:

  • Cross-row comparisons are confounded. Row A vs row B may be a harness difference, not a model difference, unless the harness is held fixed — which it usually is not.
  • The reported number is an upper-ish bound under an optimized scaffold the authors tuned on this benchmark. Your scaffold is different and untuned for it; expect a markdown.
  • Parsing/format failures masquerade as reasoning failures. A chunk of "wrong" answers are correct solutions the harness failed to extract — an artifact, not a capability ceiling.
STEP 4

Why leaderboard rank ≠ your task.

Even a perfectly clean, harness-controlled benchmark answers its question, not yours. SWE-bench rank predicts almost nothing about an agent that triages support tickets against your internal API. The distribution shift is total: your tools, your domain, your error modes, your definition of success, your latency and cost ceiling. A benchmark is evidence of a capability class, weak prior on your specific deployment.

Use public benchmarks for what they are good for: model triage (which 2–3 models are worth your eval budget) and capability sanity checks. Never use them as the acceptance test for shipping. The leaderboard narrows the field; it does not pick your model.

STEP 5

Build the small custom eval set that actually decides.

50–200 tasks drawn from your traffic, with your tools and your success predicate, outpredicts any public leaderboard for your decision. It does not need to be big — it needs to be representative, decontaminated by construction (you wrote it, post-cutoff), and stratified across the failure modes you actually see.

# a custom task: real env, executable check, frozen + dated
{
  "id": "sup-triage-014",
  "created": "2026-05-10",        # post-cutoff: contamination-safe
  "prompt": "Customer says webhook stopped firing after the v3 upgrade.",
  "tools": ["logs.search", "kb.lookup", "ticket.update"],
  "check": lambda e: e.ticket.tag == "webhook-v3-regression"
                    and "rotate secret" in e.ticket.reply,
  "forbidden": ["ticket.close", "refund.issue"]
}

Stratify by failure mode, not by topic: a bucket each for ambiguous requests, tool-error recovery, multi-hop retrieval, and "should refuse / escalate." Twenty tasks that each isolate a known failure mode beat two hundred that all exercise the happy path and move your number not at all.

STEP 6

The honest tradeoff.

Public benchmarks are comparable, reproducible, and cheap to read — and that comparability is bought with narrow operationalization, contamination risk, and harness confounds that make the rank a weak signal for you. A custom eval set is the opposite: incomparable across orgs, expensive to maintain, but the only number that actually predicts your deployment. Use leaderboards to shortlist models in an afternoon; trust only the small, dated, decontaminated eval set you built from your own traffic to decide what ships.