Tracing & observability for agents — the trace is the data structure.
You cannot evaluate, debug, or improve what you cannot see, and an agent is the least observable kind of software: a non-deterministic loop of model calls and tool effects with the decisive state living inside a context window you did not log. This essay argues for one idea — the execution trace is not telemetry bolted on afterward, it is the central data structure of an agent system — and shows what to record per step, how to model it as spans, the OpenTelemetry GenAI conventions, and why trajectory replay is what turns observability into eval.
The trace is the data structure, not a log.
The instinct is to add logging to an agent the way you'd add it to a web service: a line here, a counter there. That is backwards. For an agent, the full structured trace of a run — every input, decision, tool call, observation, and token — is the object the rest of the system operates on. Your eval harness scores traces. Your debugger reads traces. Your regression gate diffs traces. Your fine-tuning set is filtered traces. Your judge grades traces. If the trace is lossy, every one of those is operating on corrupted input.
Design rule: a run must be fully reconstructable from its trace alone. If you have to re-run the agent to understand what it did, your trace is incomplete — and re-running a non-deterministic agent gives you a different run, so the information is gone for good.
What to record per step — all of it.
Each agent step is one unit of work and must capture enough to replay and to assert against. Logging only the final answer is the single most common observability failure and it makes every downstream essay impossible.
- Inputs — the exact rendered prompt/messages sent to the model (post-templating, post-retrieval), system prompt version, model id, decoding params. Not the template — the bytes that went over the wire.
- Decision — raw model output including reasoning/thinking content, the parsed tool selection, and arguments.
- Tool call — tool name, full arguments, latency, success/error, and the full observation returned (truncating the observation is discarding the agent's actual input to the next step).
- Accounting — prompt/completion tokens, cost, wall-clock, retry count, cache hit/miss.
- Linkage — ids tying the step to its run, parent step, session, and (for multi-agent) the agent that emitted it.
# the minimum step record; anything less is not replayable step = { "run_id": rid, "step": i, "parent": i - 1, "model": "<id>", "prompt_version": "sys@7", "input_messages": msgs, # exact bytes sent "output": raw, "tool": name, "args": args, "observation": obs_full, # NOT truncated "tokens_in": ti, "tokens_out": to, "latency_ms": dt, "error": err, }
Model it as spans: a run is a trace tree.
The right shape is the distributed-tracing model. A run is a trace; each step, model call, and tool call is a span with start/end, parent, attributes, and status. Spans nest: a step span contains a model-call span and a tool-call span; a sub-agent is a child trace linked by context propagation. This is not a metaphor — it is the same primitive APM uses, which means agent traces drop into existing tracing backends instead of a bespoke pile of JSON.
- Spans give you the timeline for free — where latency went, what ran in parallel, which tool blocked.
- The parent/child tree is the trajectory — trajectory eval (E2) is literally assertions over this tree.
- Context propagation across agents means a multi-agent run is one connected trace, not N orphaned logs you cannot stitch.
Use the OpenTelemetry GenAI conventions, not a bespoke schema.
OpenTelemetry has semantic conventions for GenAI/agent spans: standardized attributes like gen_ai.operation.name, gen_ai.request.model, gen_ai.usage.input_tokens/output_tokens, tool-call span structure, and conventions for capturing prompt/response content. Adopting them is a high-leverage, low-cost decision.
# OTel GenAI semantic conventions on an agent span with tracer.start_as_current_span("invoke_agent") as s: s.set_attribute("gen_ai.operation.name", "invoke_agent") s.set_attribute("gen_ai.request.model", model_id) s.set_attribute("gen_ai.usage.input_tokens", ti) s.set_attribute("gen_ai.usage.output_tokens", to) # tool calls become child spans: gen_ai.operation.name=execute_tool
Standard conventions are not bureaucracy — they are why your traces work in any OTel backend, why two teams' agents are comparable, and why a vendor-neutral exporter means you are never locked into one observability tool. A bespoke trace schema is a migration you will pay for later with interest.
Trajectory replay: where observability becomes eval.
A complete trace is replayable: feed the recorded inputs and observations back through a candidate prompt/policy and see if it makes the same or better decisions — without re-hitting live tools. This is the bridge between this essay and all the others.
- Counterfactual debugging — "would prompt v8 have avoided the bad call at step 12?" Replay step 12 with the recorded context; you get an answer in seconds, not a re-run.
- Regression eval from real traffic — replay a sampled set of production traces against the new build; a decision that flipped from good to bad is a regression, caught before deploy (this is E6's gate).
- Eval-set construction — the most informative eval tasks are real failed traces, frozen with their environment, turned into replayable cases.
replay: prod trace #4471 vs build candidate-92
step 3 tool=search same ok
step 7 tool=db.write args DIFFER ← candidate adds dry_run
step 12 tool=delete_user BLOCKED ← candidate refuses, prod did it
verdict: candidate FIXES the step-12 incident; ship behind flag
That replay turned a production incident into a regression test and a fix verification in one pass — impossible if the trace had been a few log lines and a final answer.
The honest tradeoff.
Full-fidelity tracing is not free: storage grows fast, prompt/observation content can contain PII and secrets (redact at the boundary, never log raw credentials), and naive synchronous export adds latency to the hot path (export async, sample by policy, keep 100% of errors). But the alternative — thin logs and a final answer — makes evaluation, debugging, and improvement structurally impossible, not merely harder. Pay the tracing cost deliberately and treat the trace as the system's primary data structure; an agent you cannot fully replay is an agent you cannot actually evaluate.