Why evaluating agents is hard — and why your test suite lies to you.
Evaluating a single model call is a solved-enough problem: fix the prompt, sample once, score the string. Agents break every assumption that made that easy. The output is non-deterministic, the task is multi-step so errors compound, there is rarely one gold answer, success is path-dependent, every eval run costs real money and minutes, and the dataset you trusted last quarter has quietly rotted. This essay is the honest catalogue of why agent eval is hard, so the next five essays can attack each difficulty deliberately.
Non-determinism makes a single run uninformative.
A model with temperature > 0 samples a different trajectory each run; even at temperature = 0, batching, GPU non-associativity, and provider-side routing produce run-to-run drift. For a one-shot QA prompt this is noise around a stable mean. For an agent it is structural: one sampled token at step 3 sends the agent down a different tool, into a different observation, onto a different branch of the task. The variance is not a small band around the answer — it is a different answer reached by a different path.
The consequence is brutal and frequently ignored: a single pass@1 number on a single run is not a measurement, it is one sample of a distribution you have not characterized. An agent that "passed the eval" may pass 40% of the time. You must run each task k times and report a distribution — pass^k (passes all k, the reliability number) is usually more honest than pass@k (passes at least once, the capability number).
If your CI runs each eval task once, your green checkmark has a confidence interval wide enough to drive a regression through. The flake you blame on "the model being weird today" is the measurement, not noise to suppress.
Errors compound multiplicatively across steps.
The arithmetic is unforgiving. If each step of an agent is independently 95% reliable, a 10-step task succeeds at 0.95^10 ≈ 0.60; a 20-step task at 0.36. Per-step accuracy that would be excellent for a classifier is a coin flip for a workflow. This is why agent demos look magical at 5 steps and collapse at 30, and why "the model got better at the subtask" can leave end-to-end success unchanged or worse.
# the tyranny of the exponent: per-step p -> task success for p in (0.99, 0.95, 0.90): for n in (5, 10, 20): print(p, n, round(p ** n, 2)) # 0.99 20 -> 0.82 0.95 20 -> 0.36 0.90 20 -> 0.12
Two implications for eval design. First, an aggregate task-success number tells you the system is broken but not where — you need per-step instrumentation (this is the through-line to the tracing essay). Second, errors are not independent: a wrong step poisons the context for every step after it, so real curves decay faster than the naive product. Recovery behavior — can the agent notice and correct — is a first-class thing to measure, not an afterthought.
There is rarely a single gold answer.
For "what is the capital of France" there is a string to match. For "refactor this module to remove the circular import" or "book me a reasonable flight" there is a set of acceptable outcomes, possibly infinite, with no clean membership test. Exact-match scoring is impossible; the task is defined by a predicate over end states ("the tests pass and no import cycle remains"), not by an answer key.
- Open-ended generation — many correct phrasings; string match is hopeless, embedding similarity is a weak proxy that rewards fluent wrongness.
- Constraint satisfaction — the spec is "all of these properties hold," gradeable by an executable checker; the right kind of task to build evals around.
- Genuinely subjective — "is this summary good." No checker exists; this is where LLM-as-judge enters, with all the bias caveats of its essay.
The design move is to push tasks toward the second category: prefer tasks with a verifiable end state, because a checker you can run is worth ten judges you have to calibrate.
Success is path-dependent, so the same answer can be right or wrong.
Two agents return the identical final answer. One read the docs, called the API correctly, and verified the result. The other guessed, got lucky, and would fail on the next input. Outcome alone cannot distinguish competence from a fortunate coin flip — and an eval that only checks the final state will score them identically and ship the lucky one.
Worse, the path can be unacceptable even when the outcome is correct: the agent deleted a production table, then restored it from backup, then reported success. Final state is fine; the trajectory is a fireable offense. This is the entire motivation for the outcome-vs-trajectory essay — for agents, how is often part of whether.
Eval is expensive, and the dataset rots.
Every eval run is full agent execution: many model calls, real tool latency, sometimes real API spend and side effects. A 500-task suite run k=5 times is 2,500 multi-minute trajectories — you cannot run it on every commit by reflex. Eval cost is a budget you design against: a fast small gate on every push, the full suite nightly, the expensive judge-graded slice weekly.
- Contamination. Public benchmarks leak into pretraining; a frontier model may have seen SWE-bench solutions. Yesterday's hard benchmark is today's memorized answer key. Held-out, private, freshly-authored tasks are the only durable signal.
- Eval-set rot. Your tasks reference APIs that changed, websites that redesigned, tickets that were closed. A passing eval can mean the system works or that the task no longer tests anything. Stale tasks decay silently into false green.
- Overfitting to the suite. Once an eval is a CI gate, every change is implicitly tuned to it. The suite stops measuring capability and starts measuring conformance to itself. Rotate held-out tasks the team never sees.
Treat the eval set as code with an expiry date. Pin tool/API versions, snapshot web targets, date-stamp every task, and schedule a re-validation pass — a task that hasn't been confirmed to still test what it claims is technical debt accruing as false confidence.
The honest tradeoff.
You cannot make agent eval cheap, deterministic, and definitive — pick which two you will fake and be loud about it. The mature posture is not "we have an eval that says 87%"; it is "we sample a characterized distribution on a versioned, decontaminated, deliberately-rotated set, and we report variance and path, not just a headline number." An agent eval that yields one clean number you fully trust is not rigorous — it is hiding all six of these problems behind a green checkmark.