Eval-Driven Agent Development

Operation · Evaluation & Observability

Eval-driven agent development — the eval is the spec, the ratchet is the discipline.

An agent has no compiler and no type system to tell you a change broke something; the only thing standing between a prompt tweak and a silent 15-point regression is an eval you run before merge. This essay assembles the previous five into a development loop: evals as a CI regression gate, golden trajectories, the offline/online split, sampling production traffic into the eval set, the feedback loop that keeps it honest, and the no-regression ratchet that makes progress monotone instead of a random walk.

STEP 1

The eval is the spec.

For an agent there is no other executable specification. A prompt does not declare its contract; a tool loop does not type-check its behavior. The eval set is the definition of "working" — which means a behavior not covered by an eval is, operationally, undefined and unprotected. The development question stops being "did I improve the prompt" and becomes "did the eval move, and which tasks flipped."

Corollary: write the eval task before the fix. A bug with no failing eval is a bug you will reintroduce. The reproduction is the regression test — this is TDD with the agent's environment as the harness.

STEP 2

Evals as a CI regression gate, tiered by cost.

Eval is expensive (E1), so one suite on every commit is infeasible. Tier it by the cost/coverage curve so every change hits a fast gate and the expensive signal still runs often enough to catch drift.

Per-commit (seconds–minutes) — deterministic tool-call assertions and a small smoke set of outcome checks. Cheap, no judge, blocks the merge.
Per-PR / nightly (minutes–hours) — the full custom eval set, each task run k times, distribution reported (pass^k, not pass@1).
Weekly (hours) — judge-graded subjective slice and replayed production-trace regression set.

# CI gate: block merge on regression OR on absolute floor breach
res = run_suite(build, k=5)
assert res.tool_assertions.all_pass()                # hard, deterministic
assert res.pass_k >= baseline.pass_k - EPS            # no-regression ratchet
assert res.pass_k >= FLOOR                             # absolute bar
assert not res.new_failures(baseline)                    # no task regressed

STEP 3

Golden trajectories: freeze the run, not just the answer.

A golden trajectory is a known-good full trace (E5) frozen with its environment: inputs, tool observations, the decision sequence, and the outcome. It is the regression artifact a final-answer fixture cannot be — you can replay a candidate against it without live tools and diff decisions, not just the end string.

Outcome-golden — assert the candidate still reaches the verified end state. Robust to alternative paths; the default.
Trajectory-golden — assert the safety/efficiency invariants still hold (no forbidden tool, step budget). Assert invariants, never exact-match the path (E2's mistake).
Curate, do not hoard. A golden trajectory is maintained code: when the API or policy legitimately changes, re-bless it deliberately — an un-reviewed re-bless is how a bug becomes the new "correct."

STEP 4

Offline vs online eval — you need both, they answer different questions.

Offline eval (the CI suite, replayed traces) is reproducible, gates merges, and answers "did this change regress a known case." It cannot see what you did not anticipate. Online eval — measured on live traffic — answers "is it actually working on the real distribution," and it is the only place the unknown unknowns surface.

Offline catches regressions; online discovers new failure modes. A green offline suite on a distribution that has shifted is false confidence — the production world moved and your frozen set did not.
Online signals — implicit (task completion, retries, human takeover, thumbs, downstream undo of the agent's action) and sampled explicit judge grading on live traces.
Ship behind a flag; compare online. Canary the new build, judge a sample of its real traces against the incumbent's before full rollout.

STEP 5

Close the loop: production traffic becomes tomorrow's eval set.

The flywheel that makes the whole discipline compound: sample production traces, surface the failures and near-misses, freeze the instructive ones (with environment) into replayable eval tasks, fix, and the fix is now permanently guarded. Without this, your eval set is a fixed snapshot decaying against a moving world (E1's eval-set rot); with it, the eval set tracks reality and every incident becomes a permanent immune-system entry.

# production -> eval flywheel
for tr in sample(prod_traces, stratify="failure_mode"):
    if tr.failed or tr.human_took_over or tr.low_judge_score:
        case = freeze(tr)              # trace + env, replayable
        eval_set.add(case)             # dated, decontaminated by origin
        # fix is not done until this case passes AND nothing else regressed

Bias the sampler toward the tails, not the mean: failures, human takeovers, low judge scores, high-cost runs. A thousand happy-path traces teach the eval set nothing; the twenty weird ones are the entire value, and they are exactly what uniform sampling drowns.

STEP 6

The ratchet, and its honest cost.

The discipline that makes all of this add up is the no-regression ratchet: a change merges only if no covered task regressed and the absolute floor holds — the eval score moves up or stays, never silently down. This converts agent development from a random walk (every prompt tweak fixes one thing and breaks two unseen others) into monotone progress. The honest cost: the ratchet is exactly as good as its coverage, it can ossify into overfitting to the suite (rotate held-out tasks the ratchet never sees, per E1/E4), and a too-tight epsilon will block real improvements that have benign noisy regressions. The eval is the only spec an agent has; gate every merge on a no-regression ratchet over a deliberately-rotated set fed by sampled production failures — progress that is not ratcheted is not progress, it is a walk that happens to be green today.