Operations / Evaluation & Observability

Evaluation & Observability

Measuring agents that don't have a single right answer — outcome vs trajectory evals, LLM-as-judge, traces, benchmarks.

  1. Why Evaluating Agents Is Hard
    Non-determinism, compounding multi-step error, no single gold answer, path-dependence, eval cost, and dataset rot — the six reasons one clean number is a lie.
  2. Outcome vs Trajectory Evaluation
    End-state predicates vs grading the decision sequence: when each is right, partial credit, and tool-call assertions as the highest-leverage safety check.
  3. LLM-as-Judge for Agents
    Rubric design, pairwise vs pointwise, the biases that invert verdicts, calibrating against human labels, and the cases where you must not use a judge.
  4. Reading Agent Benchmarks Critically
    What SWE-bench, GAIA, τ-bench and WebArena actually measure, why contamination and harness sensitivity make rank a weak signal, and the small custom set that really decides.
  5. Tracing & Observability for Agents
    The trace is the data structure, not a log: what to record per step, spans and OpenTelemetry GenAI conventions, and trajectory replay as the bridge to eval.
  6. Eval-Driven Agent Development
    The eval is the only spec an agent has: tiered CI gates, golden trajectories, offline vs online, the production-to-eval flywheel, and the no-regression ratchet.