Operations / Evaluation & Observability
Evaluation & Observability
Measuring agents that don't have a single right answer — outcome vs trajectory evals, LLM-as-judge, traces, benchmarks.
- Why Evaluating Agents Is HardNon-determinism, compounding multi-step error, no single gold answer, path-dependence, eval cost, and dataset rot — the six reasons one clean number is a lie.
- Outcome vs Trajectory EvaluationEnd-state predicates vs grading the decision sequence: when each is right, partial credit, and tool-call assertions as the highest-leverage safety check.
- LLM-as-Judge for AgentsRubric design, pairwise vs pointwise, the biases that invert verdicts, calibrating against human labels, and the cases where you must not use a judge.
- Reading Agent Benchmarks CriticallyWhat SWE-bench, GAIA, τ-bench and WebArena actually measure, why contamination and harness sensitivity make rank a weak signal, and the small custom set that really decides.
- Tracing & Observability for AgentsThe trace is the data structure, not a log: what to record per step, spans and OpenTelemetry GenAI conventions, and trajectory replay as the bridge to eval.
- Eval-Driven Agent DevelopmentThe eval is the only spec an agent has: tiered CI gates, golden trajectories, offline vs online, the production-to-eval flywheel, and the no-regression ratchet.