Eval-driven agent development — the eval is the spec, the ratchet is the discipline.
An agent has no compiler and no type system to tell you a change broke something; the only thing standing between a prompt tweak and a silent 15-point regression is an eval you run before merge. This essay assembles the previous five into a development loop: evals as a CI regression gate, golden trajectories, the offline/online split, sampling production traffic into the eval set, the feedback loop that keeps it honest, and the no-regression ratchet that makes progress monotone instead of a random walk.
The eval is the spec.
For an agent there is no other executable specification. A prompt does not declare its contract; a tool loop does not type-check its behavior. The eval set is the definition of "working" — which means a behavior not covered by an eval is, operationally, undefined and unprotected. The development question stops being "did I improve the prompt" and becomes "did the eval move, and which tasks flipped."
Corollary: write the eval task before the fix. A bug with no failing eval is a bug you will reintroduce. The reproduction is the regression test — this is TDD with the agent's environment as the harness.
Evals as a CI regression gate, tiered by cost.
Eval is expensive (E1), so one suite on every commit is infeasible. Tier it by the cost/coverage curve so every change hits a fast gate and the expensive signal still runs often enough to catch drift.
- Per-commit (seconds–minutes) — deterministic tool-call assertions and a small smoke set of outcome checks. Cheap, no judge, blocks the merge.
- Per-PR / nightly (minutes–hours) — the full custom eval set, each task run k times, distribution reported (
pass^k, not pass@1). - Weekly (hours) — judge-graded subjective slice and replayed production-trace regression set.
# CI gate: block merge on regression OR on absolute floor breach res = run_suite(build, k=5) assert res.tool_assertions.all_pass() # hard, deterministic assert res.pass_k >= baseline.pass_k - EPS # no-regression ratchet assert res.pass_k >= FLOOR # absolute bar assert not res.new_failures(baseline) # no task regressed
Golden trajectories: freeze the run, not just the answer.
A golden trajectory is a known-good full trace (E5) frozen with its environment: inputs, tool observations, the decision sequence, and the outcome. It is the regression artifact a final-answer fixture cannot be — you can replay a candidate against it without live tools and diff decisions, not just the end string.
- Outcome-golden — assert the candidate still reaches the verified end state. Robust to alternative paths; the default.
- Trajectory-golden — assert the safety/efficiency invariants still hold (no forbidden tool, step budget). Assert invariants, never exact-match the path (E2's mistake).
- Curate, do not hoard. A golden trajectory is maintained code: when the API or policy legitimately changes, re-bless it deliberately — an un-reviewed re-bless is how a bug becomes the new "correct."
Offline vs online eval — you need both, they answer different questions.
Offline eval (the CI suite, replayed traces) is reproducible, gates merges, and answers "did this change regress a known case." It cannot see what you did not anticipate. Online eval — measured on live traffic — answers "is it actually working on the real distribution," and it is the only place the unknown unknowns surface.
- Offline catches regressions; online discovers new failure modes. A green offline suite on a distribution that has shifted is false confidence — the production world moved and your frozen set did not.
- Online signals — implicit (task completion, retries, human takeover, thumbs, downstream undo of the agent's action) and sampled explicit judge grading on live traces.
- Ship behind a flag; compare online. Canary the new build, judge a sample of its real traces against the incumbent's before full rollout.
Close the loop: production traffic becomes tomorrow's eval set.
The flywheel that makes the whole discipline compound: sample production traces, surface the failures and near-misses, freeze the instructive ones (with environment) into replayable eval tasks, fix, and the fix is now permanently guarded. Without this, your eval set is a fixed snapshot decaying against a moving world (E1's eval-set rot); with it, the eval set tracks reality and every incident becomes a permanent immune-system entry.
# production -> eval flywheel for tr in sample(prod_traces, stratify="failure_mode"): if tr.failed or tr.human_took_over or tr.low_judge_score: case = freeze(tr) # trace + env, replayable eval_set.add(case) # dated, decontaminated by origin # fix is not done until this case passes AND nothing else regressed
Bias the sampler toward the tails, not the mean: failures, human takeovers, low judge scores, high-cost runs. A thousand happy-path traces teach the eval set nothing; the twenty weird ones are the entire value, and they are exactly what uniform sampling drowns.
The ratchet, and its honest cost.
The discipline that makes all of this add up is the no-regression ratchet: a change merges only if no covered task regressed and the absolute floor holds — the eval score moves up or stays, never silently down. This converts agent development from a random walk (every prompt tweak fixes one thing and breaks two unseen others) into monotone progress. The honest cost: the ratchet is exactly as good as its coverage, it can ossify into overfitting to the suite (rotate held-out tasks the ratchet never sees, per E1/E4), and a too-tight epsilon will block real improvements that have benign noisy regressions. The eval is the only spec an agent has; gate every merge on a no-regression ratchet over a deliberately-rotated set fed by sampled production failures — progress that is not ratcheted is not progress, it is a walk that happens to be green today.