Outcome vs trajectory evaluation — grading the destination, the route, or both.
There are two fundamentally different questions you can ask of an agent run: did it reach an acceptable end state (outcome), and did it get there acceptably (trajectory). They catch different bugs, cost different amounts, and are right in different situations. Conflating them — or defaulting to outcome-only because it is cheaper — ships agents that are correct by luck and dangerous in process. This essay defines both, shows when each is the right tool, and how to combine them with partial credit and tool-call assertions.
Outcome eval: a predicate over the final world state.
Outcome (or end-state) evaluation ignores everything the agent did and asks one question of the world afterward: is the post-condition satisfied? The gold standard is an executable checker, not a string match: run the test suite, query the database, hit the API, diff the filesystem.
# outcome check: never compare the agent's prose, check the world def check_outcome(env) -> bool: return ( env.run_tests() == "pass" and env.db.query("select count(*) from orders where state='shipped'") == 1 and not env.fs.exists("/tmp/scratch.lock") )
Strengths: objective, cheap to score, robust to the infinite ways a correct agent can phrase or path its way to the goal. Blind spots: it cannot see how the goal was reached. It passes the lucky guesser, passes the agent that achieved the state via a catastrophic-then-reverted side effect, and gives you a single bit when the agent got 90% of the way and you'd like to know that.
Trajectory eval: grading the sequence of decisions.
Trajectory evaluation scores the sequence — the steps, tool calls, arguments, observations, and the reasoning that connected them. It answers questions outcome eval structurally cannot: did the agent take a forbidden action, call the right tool with the right arguments, recover from the injected error, avoid an irreversible operation, finish in a sane number of steps?
- Reference-trajectory match — compare against one or more known-good paths. Brittle: punishes valid alternative routes. Use only when the path genuinely is the spec.
- Property assertions over the trace — not "match this path" but "these invariants held": never called
delete_*without a prior confirm, never sent PII to the external tool, retried at most 3 times. This is the robust form and the default you should reach for. - Step-wise judging — an LLM judge scores each decision in context ("was calling search here reasonable given the state?"). Expensive and noisy; reserve for diagnosis, not the CI gate.
Exact reference-path matching is the classic trajectory-eval mistake: it conflates "did something different" with "did something wrong" and trains your agent to be a brittle path-replayer. Assert invariants and forbidden actions, not the one true sequence.
When each is the right tool.
- Outcome-only is right when the post-condition is fully verifiable, the path is irrelevant to value, and there are no irreversible side effects (or they're sandboxed): code that must make tests pass, a query that must return the right rows.
- Trajectory matters when actions have real-world consequences (money moved, emails sent, prod data touched), when "right answer, wrong reason" must be caught, when you are debugging why the outcome failed, or when the task has no clean checkable end state and the process is the only thing you can inspect.
- Both, always, for anything you'd actually deploy. Outcome is the headline pass/fail; trajectory is the safety and quality gate layered on top. An agent that ships money must satisfy both "the transfer arrived" and "it never attempted a transfer to an unverified account."
The decision rule: outcome eval tells you whether to celebrate; trajectory eval tells you whether to trust it again tomorrow. Production agents need the second far more than a leaderboard does.
Partial credit: when one bit is too coarse.
Binary outcome scoring throws away the difference between "did nothing useful" and "completed 7 of 8 subgoals then stumbled." For long multi-stage tasks that destroys your signal — every iteration reads as 0% until the day it reads 100%, and you cannot tell improvement from noise. Decompose the task into independently checkable sub-goals and score the fraction achieved, ideally with a dependency-aware weighting so unlocking later stages counts for more.
# partial credit over checkable subgoals (weighted) SUBGOALS = [ ("repo cloned", 0.1, lambda e: e.fs.exists("/work/.git")), ("bug reproduced", 0.2, lambda e: e.ran("pytest -k repro")), ("fix applied", 0.3, lambda e: e.diff_touches("core/")), ("tests pass", 0.4, lambda e: e.run_tests() == "pass"), ] score = sum(w for _, w, ok in SUBGOALS if ok(env))
Keep one strict binary on top of the partial-credit score. Partial credit is for tracking progress between releases; the binary "fully solved" is what you report and gate on. Optimizing the partial score alone breeds agents that ace the easy subgoals and never close the hard one.
Tool-call assertions: the highest-leverage trajectory check.
Most production-relevant trajectory properties reduce to assertions over the tool-call log, and these are cheap, deterministic, and exactly the bugs that hurt in production. They are the trajectory equivalent of unit tests — run them in CI on every change.
- Presence/absence — required tool was called; forbidden tool was never called.
- Argument constraints —
transfer(amount)never exceeded the cap; the destructive call always carrieddry_run=Trueon the eval env. - Ordering & preconditions —
confirmpreceded everydelete; auth happened before any data read. - Budget — total tool calls / tokens / wall-clock under the ceiling; no pathological retry loop.
trajectory assertions (task: refund-flow, 1 run)
outcome: refund recorded PASS
assert: called verify_identity PASS
assert: never called delete_* PASS
assert: refund.amount <= order.total FAIL ← refunded 120 on an 80 order
assert: steps <= 15 PASS (9)
verdict: OUTCOME-PASS / TRAJECTORY-FAIL -> do not ship
That row is the whole argument for this essay: an outcome-only harness reports a clean pass and ships an agent that over-refunds. The trajectory assertion is one cheap line and it is the line that saves you.
The honest tradeoff.
Trajectory eval is strictly more informative and strictly more expensive, brittle, and opinionated — every assertion is a judgment call you now have to maintain, and over-specifying the path turns your eval into a straitjacket that fails good agents for being creative. Grade outcomes to know if the agent works; assert trajectory invariants to know if it is safe to let work — ship only when both pass, and never let the cheaper of the two stand in for the other.