Operations / Evaluation & Observability

Evaluation & Observability

Measuring agents that don't have a single right answer — outcome vs trajectory evals, LLM-as-judge, traces, benchmarks.

Why Evaluating Agents Is Hard

Non-determinism, compounding multi-step error, no single gold answer, path-dependence, eval cost, and dataset rot — the six reasons one clean number is a lie.
Outcome vs Trajectory Evaluation

End-state predicates vs grading the decision sequence: when each is right, partial credit, and tool-call assertions as the highest-leverage safety check.
LLM-as-Judge for Agents

Rubric design, pairwise vs pointwise, the biases that invert verdicts, calibrating against human labels, and the cases where you must not use a judge.
Reading Agent Benchmarks Critically

What SWE-bench, GAIA, τ-bench and WebArena actually measure, why contamination and harness sensitivity make rank a weak signal, and the small custom set that really decides.
Tracing & Observability for Agents

The trace is the data structure, not a log: what to record per step, spans and OpenTelemetry GenAI conventions, and trajectory replay as the bridge to eval.
Eval-Driven Agent Development

The eval is the only spec an agent has: tiered CI gates, golden trajectories, offline vs online, the production-to-eval flywheel, and the no-regression ratchet.