Three layers of evals: which to run when, and why none alone is enough.
Chapter 3.1 taught the discipline of evaluating before shipping. Chapter 3.3 taught the grading mechanism that lets you grade at scale. This chapter teaches the architecture connecting them: evals come in three distinct layers, each with its own cadence, scope, cost, and the failures it catches. Most production agents need all three; treating any one as a substitute for the others is how teams discover regressions in production they thought their evals would have caught. By the end you'll know which checks belong at which layer, what each layer is good and bad at, and how to compose the layers into a feedback system that catches what matters quickly and cheaply.
Why evals come in three layers.
The testing-pyramid metaphor from software engineering — unit tests at the base, integration tests in the middle, end-to-end tests at the top — partially applies to agent evals. The shape is similar: cheap fast tests at the base run on every change; expensive slow tests at the top run rarely but cover ground the cheap tests can't. The structure rhymes.
Where the metaphor breaks: agent outputs are non-deterministic, ground truth is often subjective, and production reveals failures that no offline test can predict. So the agent version of the pyramid has a third character at the top that traditional software testing doesn't: production observability as a first-class eval layer, not just monitoring. You don't replace testing with observability — you do both, with one informing the other.
The three layers, at a glance
This is the structural picture. Each layer has its own discipline — what belongs there, what its cadence looks like, what kind of failure it specializes in catching.
The four properties that distinguish the layers
Four axes capture how the layers differ. Understanding each axis helps you decide where to put any particular check.
The pattern that emerges from the axes: as you go up the layers, cost and signal-quality both increase, while feedback speed decreases. You want to catch failures at the lowest layer they're detectable at — both because it's cheaper and because the feedback loop is faster.
Each layer's specialization
Different failure modes show up at different layers, and treating the layers as substitutes leaves gaps.
Layer 1 catches: output schema violations, missing citations, tool calls that don't match the schema, budget overruns, refusal patterns matching prohibited responses. Things that have a deterministic right/wrong. A regression that breaks one of these gets caught in seconds, in CI, before merge. Cheap to catch, cheap to fix.
Layer 2 catches: behavioral changes that don't violate any deterministic check but produce worse answers. A new prompt that's syntactically fine but reduces accuracy by 4 points; a model swap that produces correctly-formatted but less helpful responses; a tool description change that subtly reroutes the agent. These are detectable only by running the agent end-to-end and grading the output quality. Layer 1 can't see them because nothing structural is wrong.
Layer 3 catches: drift over time, novel failure modes that didn't exist in your eval set, production-only conditions (real user queries, real network latency, real tool failures, real adversarial inputs). The most important class of Layer 3 detection: the queries your real users send that your eval set didn't anticipate. Layer 2 grades against a fixed curated set; Layer 3 reveals what's missing from that set.
Why each layer is non-substitutable
The temptation, especially early in a project, is to invest in one layer and treat it as enough. Three predictable failure modes follow.
"We have Layer 2; we don't need Layer 1." Layer 2 is slow and expensive. Running it on every commit is impractical. So commits ship that pass Layer 2 only on a delayed schedule, and the basic-correctness check (the output should validate as JSON) might catch the bug — but only hours later, after merge, after deployment. Layer 1 catches it in 5 seconds on the PR. Skipping Layer 1 makes Layer 2's feedback loop slower than it should be.
"We have Layer 1; that's enough for now." Layer 1 is fast and cheap, so it's a tempting place to stop. But Layer 1 by definition only catches deterministic failures — and most quality regressions in agents aren't deterministic failures. The agent produces well-formed output that's worse; Layer 1 happily passes it. Teams that stop at Layer 1 ship quality regressions consistently because their tests don't detect them.
"We have Layer 1 and Layer 2; production monitoring is for ops, not eng." The reverse mistake. Without Layer 3, you have no signal about how your agent performs on the actual queries users send — which inevitably differ from what's in your curated eval set. Real users hit edge cases the eval set didn't include. Production traffic patterns shift over time. The agent's quality silently degrades on segments of traffic you weren't measuring. Layer 3 is what catches this — and feeds the gaps back into Layer 2's curation, closing the loop.
The pyramid shape: more at the base, less at the top
The proportion of checks across the layers, for a healthy production agent, follows the inverse-pyramid pattern:
- Layer 1: dozens to hundreds of individual deterministic checks. Each runs in milliseconds. Every commit runs all of them.
- Layer 2: an eval suite of 30-100 curated queries. Each run is a full agent invocation. Runs on selected PRs and on a release cadence.
- Layer 3: continuous metrics on 100% of production traffic. Sampling for deep analysis. Always-on dashboards.
Reading the proportions: many cheap fast checks at Layer 1, fewer expensive thorough checks at Layer 2, broad statistical observation at Layer 3. Building the system in this order — Layer 1 first, then Layer 2, then Layer 3 — is the natural progression, because each layer is foundational to the one above it. Layer 2's evals build on the structured outputs Layer 1 ensures. Layer 3's metrics rely on the instrumentation patterns that Layer 2 makes routine.
Statistical pattern across teams: most have Layer 2 only. The narrative goes: "we need to evaluate our agent, let's build a test set with judges." That gets you a Layer 2 system. Layers 1 and 3 are easy to skip because Layer 2 already produces visible numbers.
The diagnostic: do you have CI that fails on a schema regression in seconds, or does that bug only get caught when you run the full eval suite? That's the Layer 1 gap. Do you have production dashboards that would alert you if the agent's cost-per-query or user-thumbs-down rate spiked tomorrow? That's the Layer 3 gap.
The two layers people typically miss are the cheapest ones to set up. The discipline is to recognize them as their own concerns, not afterthoughts on the eval suite.
Complementary, not redundant. Chapter 3.1 was about the rhythm — when to run evals, how to interpret them, how to predict-then-measure. This chapter is about the substrate: what kinds of evals exist, and which kind serves which purpose. The eval cadence diagram in 3.1 mostly described Layer 2 (the offline curated-query suite). This chapter says: that's one of three layers, and you need the others too.
The two chapters together give you both pieces — the rhythm of the practice (3.1) and the architecture of the underlying system (3.2). A team can do 3.1's rhythm on top of any of the three layers; the rhythm is the same, but what you're measuring depends on which layer you're working with.
The structure generalizes — deterministic checks + offline benchmarks + production monitoring is a pattern across ML systems (classifiers, recommendation systems, etc.). What's specific to LLM agents is the relative importance of each layer.
For a classifier, Layer 2 (offline benchmarks against labeled test sets) does most of the work; deterministic checks are minor. For an LLM agent, Layer 1 (schema, format, structural) catches a much higher fraction of total failures because the output space is richer and there's more that can be structurally wrong. And Layer 3 matters more because agents face open-ended inputs (a classifier's eval set is a reasonable proxy for production; an agent's curated eval set never is).
So the layers are the same shape; the weights are different. For agents specifically: more Layer 1 than you'd expect, robust Layer 2, and Layer 3 that drives most of your iteration loop after the first few releases.
Layer 1: deterministic checks.
This is the layer everyone underinvests in, because it doesn't feel like "evaluation" — it feels like "tests." That's exactly the right framing. Layer 1 evals are unit tests for your agent: small focused checks that run in milliseconds, catch specific regressions, and run on every commit. They cost essentially nothing and prevent a meaningful fraction of bugs that would otherwise need expensive Layer 2 runs to detect.
The trap people fall into: assuming agents can't have unit-test-style checks because their outputs are non-deterministic. Their outputs are non-deterministic; their contracts are not. An agent's tool calls should always match the schema. Its citations should always reference real sources. Its budget should never be exceeded. These are deterministic properties that hold on every run, regardless of what the model decides — and that means they're cheaply testable.
The categories that belong at Layer 1
Six concrete categories, each with examples of what to check:
Schema validation. The agent's structured outputs match their declared schemas. Tool calls have the right argument types. Findings objects have all required fields. JSON outputs validate. Run a schema check on every output your agent produces; flag any violation.
Citation existence and format. Every citation in the agent's output references a source URL or document ID that's been seen in the conversation. No fabricated URLs. No malformed citation tokens. Run regex / structural checks against the citation format, and lookup checks against the actual sources the agent had access to.
Tool-call correctness. Tool names exist in the toolkit. Required arguments are present. Argument values pass the schema's enum / range / pattern constraints. Tool-use IDs match between request and response. This is the chapter 0.3 protocol-level discipline, automated.
Budget compliance. The agent stays within its tool-call budget, token budget, time budget. A run that hits the budget cap is not necessarily a failure — but a run that exceeds the cap is a bug in your enforcement code (chapter 1.1's step budget).
Forbidden-action checks. The agent never calls tools from a blocklist. Never accesses paths outside its sandbox. Never produces output matching prohibited patterns (e.g., leaking API keys, raw PII). These are negative assertions — things that should never appear — and they're cheap to check.
Refusal patterns. When the agent declines a request, it does so via the structured refusal path your system supports, not by producing free-form text that looks like an answer but isn't. A "yes-but-actually-no" refusal is the worst of both worlds; checking for the structured form catches this.
What Layer 1 checks look like, concretely
Layer 1 checks are plain code, not LLM-based. The point is to be fast, cheap, and unambiguous. A reasonable Layer 1 suite is a directory of test functions that each assert one property of an agent run.
# tests/layer1/test_output_contracts.py import pytest from agent import run_agent from agent.schemas import ResearchOutput from jsonschema import validate # Property: every agent output validates against its declared schema. @pytest.mark.parametrize("query", [ "What is the capital of France?", # trivial "Summarize this PDF in three points.", # typical "Find all bugs in this commit's changes.", # complex ]) async def test_output_validates_against_schema(query): result = await run_agent(query) validate(instance=result, schema=ResearchOutput.schema()) # Passes if the agent's output is structurally valid. # Fails immediately if any field is the wrong type or missing. # Property: every cited URL appears in the conversation's source list. async def test_no_fabricated_citations(): query = "Summarize recent EV battery progress with citations." result = await run_agent(query) cited_urls = {c["source_url"] for c in result["claims"]} seen_urls = collect_seen_urls(result["trace"]) fabricated = cited_urls - seen_urls assert not fabricated, f"Agent cited URLs it never accessed: {fabricated}" # Property: tool calls never exceed the configured budget. async def test_budget_enforcement(): result = await run_agent("Complex multi-step research query", tool_budget=15) assert result["tool_calls_made"] <= 15 # Property: the agent never calls tools from the blocklist. FORBIDDEN_TOOLS = {"send_email", "delete_record", "transfer_funds"} async def test_no_forbidden_tools(): result = await run_agent("Read-only research request") used = {c["tool_name"] for c in result["trace"]["tool_calls"]} assert not (used & FORBIDDEN_TOOLS), \ f"Agent used forbidden tools: {used & FORBIDDEN_TOOLS}"
These look like normal pytest tests because they are normal pytest tests. They live in your repo alongside other tests, run on every commit, fail fast, and produce clear error messages. They're cheap to maintain — when a check starts failing, you usually know why within minutes.
What Layer 1 catches that's hard to catch elsewhere
The class of regression Layer 1 specifically prevents: silent contract violations. These are bugs where the agent's behavior changes in a way that breaks downstream consumers, without producing an obviously-wrong output.
Concrete examples that have hit production agents:
- A model swap (Sonnet → Haiku for cost) causes the agent to occasionally produce
"confidence": "moderate"instead of one of the schema's allowed values ("high" | "medium" | "low"). The output looks reasonable; the downstream parser silently treats it as missing. Layer 1 catches this on the first run; Layer 2 might miss it if the judge isn't checking that specific field. - A prompt edit causes the agent to add explanatory text around its tool calls, breaking a downstream tool-trace analyzer that expected clean function-call-only outputs. Layer 1 (assert tool calls match exact schema) catches it; Layer 2 (semantic quality grading) doesn't.
- A change to the citation-extraction prompt causes the agent to emit citations as
[1] [2]footnote-style instead of inline-URL-style. The text reads fine to humans (and to LLM judges); but the citation-export tool produces empty CSV files. Layer 1 catches it; Layer 2 doesn't.
The pattern: these regressions don't change quality measurably, they change contract conformance. Layer 2 grades quality; Layer 1 grades contracts. Both matter.
What Layer 1 cannot catch
For completeness, the failures Layer 1 is structurally blind to. Treating Layer 1 as enough means these slip through:
- Quality regressions in well-formed output. The agent's answer is the wrong answer, but the JSON is valid, the citations exist, and no tools were misused. Layer 1 says pass; the answer is still wrong.
- Subtle judgment shifts. A new prompt makes the agent slightly more confident on uncertain claims, or slightly more verbose, or slightly less likely to escalate edge cases. These show up in aggregate quality metrics, not in any individual structural check.
- Distributional issues across queries. "Works on these 50 tests" doesn't mean "works on the next 500 user queries." Layer 1 tests specific cases; it can't generalize.
These are what Layer 2 is for. The relationship: Layer 1 says "the agent still has a working contract"; Layer 2 says "the agent still produces good answers within that contract." Both questions matter; both layers exist because each answers one of them.
The discipline: write Layer 1 tests as part of feature work
The teams that get Layer 1 right treat it the same way they treat unit tests for normal code: every new behavior gets a test. New tool? Test that schema validation passes. New output field? Test that it appears in outputs. New refusal path? Test that the refusal patterns match. New budget? Test that it's enforced.
This sounds obvious; it's the discipline that drifts when a team is moving fast. The result of skipping it: a Layer 1 suite that covers the features from 6 months ago and silently misses regressions on features from this week. The fix is the same as for code unit tests — TDD if you're disciplined, retroactive coverage if you must, but always close the gap quickly.
If your Layer 1 suite is taking more than 30 seconds to run, something is wrong — either you've snuck Layer 2-style checks into the cheap layer (LLM judge calls hiding inside what should be deterministic), or you're running full agent invocations as part of Layer 1 (those belong in Layer 2). Layer 1 checks should test contracts on captured outputs, not generate fresh outputs. Cache the agent's response from an earlier run if you need an output to check against; that keeps the check fast and the failure cause obvious.
Sometimes — the question is whether the sequence constraint is genuinely structural or whether it's a quality preference. "Tool X must come before tool Y" can be a Layer 1 assertion if the constraint is hard (the agent's API requires Y to follow X) or if violating it always indicates a bug.
"The agent should usually fetch only relevant URLs" is not Layer 1 material — that's a quality judgment that Layer 2 handles. The line is: Layer 1 catches must-not-happen violations, not should-rarely-happen patterns.
Two patterns. First, invariant checks: regardless of the specific output, certain properties always hold (output validates against schema, citations are non-fabricated, budget is respected). These work fine on non-deterministic outputs because the invariant is the same regardless of the specific values.
Second, property-based testing: run the agent multiple times on the same input and check that all runs satisfy the same invariants. If 9/10 runs validate against the schema and 1/10 doesn't, that's a Layer 1 bug — the invariant should hold every time. Strict mode (chapter 0.3) makes schema invariants nearly guaranteed; without it, you might see occasional violations that are worth catching.
Either way, you're not asserting "the output is X" — you're asserting "the output has property P." Properties are stable across non-determinism in a way that specific values aren't.
A flaky Layer 1 test is a bug in either the agent or the test itself. The whole point of Layer 1 is determinism — if the property you're checking depends on the model's behavior, it's not actually a Layer 1 property and you've miscategorized it.
Diagnose: is the test checking a contract (should pass 100% of the time, full stop) or a quality bar (depends on the model's behavior)? If contract, find why it's flaky and fix that. If quality, move it to Layer 2 where statistical evaluation is appropriate.
Letting flaky tests accumulate poisons the well — teams stop trusting CI signals, the failure-investigation reflex weakens, real bugs get ignored as "probably flaky." Hold the Layer 1 line at zero flakiness.
Layer 2: the offline judge-based eval suite.
This is the layer most teams already have, often calling it "evals" without qualifier. It's also the layer chapters 3.1 and 3.3 covered in depth — 3.1 taught the discipline of using it (hypothesis, prediction, verdict), 3.3 taught the grading mechanism (LLM-as-judge with calibration). This step puts both in context: Layer 2 catches behavioral changes that pass Layer 1 but produce worse answers, runs slower and more expensively than Layer 1, and has specific cost and curation disciplines that determine whether the layer is sustainable.
What goes here, by category
Layer 2 evals are full agent runs against a curated set of representative queries, with LLM judges grading the output along several dimensions. Five categories of check belong here:
Task completion. Did the agent successfully complete the task it was given? Did it return a substantive answer, escalate appropriately, or fail gracefully? Binary or graded; can be checked by a judge ("did the agent address the user's question?").
Factual accuracy. When the agent makes claims, are they correct? When it cites sources, do those sources actually support the claims? Chapter 4.3's citation-faithfulness check lives here. Programmatic where the source can be fetched; LLM judge for the verdict.
Quality judgments. "Was the answer good?" — graded by an LLM judge with a rubric. Chapter 3.3's pairwise and pointwise patterns apply. The judges should be calibrated against humans on a held-out calibration set (also from chapter 3.3).
Trajectory sensibility. Did the agent's path to the answer make sense? Tool calls in a reasonable order, no redundant searches, no obviously-wasted steps. A judge can evaluate this by looking at the trace.
Edge-case handling. A small set of curated tricky queries (ambiguous inputs, prompt injection attempts, out-of-scope requests, adversarial users) where the right behavior is specific. Did the agent behave correctly on each?
The shape of a Layer 2 suite
A typical mature Layer 2 setup looks like:
This is the architecture chapter 3.1 was describing. The chapter 3.3 LLM-as-judge methodology is what powers the grading. Layer 2 is where those previous chapters' machinery actually runs.
The two-tier cadence: fast subset vs full suite
The discipline that makes Layer 2 sustainable in CI: a fast subset that runs on every PR, and a full suite that runs less often. The fast subset is 5–10 queries chosen to cover the main dimensions cheaply; it runs in 2–5 minutes and costs cents. The full suite is 30–100 queries; it runs in 20–60 minutes and costs more.
How to choose the fast subset: pick queries that exercise each major dimension once. One easy task-completion test, one typical citation-faithfulness test, one trajectory-sensibility test, etc. The fast subset's job isn't comprehensive coverage — it's regression detection on the dimensions you most care about. If the fast subset catches the regression, you don't need the full suite to confirm.
The full suite runs:
- Triggered by a PR label (
eval-full) when changes warrant it (prompt rewrites, model swaps, retrieval changes). - On a schedule (nightly or weekly) regardless of PR activity, catching drift.
- Before any production release, as a release gate.
- Triple-run with averaging on noisy metrics for release-gate decisions (chapter 3.1's multi-run pattern).
This two-tier structure gives you fast PR feedback (the fast subset signals regressions in minutes) and thorough release confidence (the full suite gives high-quality numbers when it matters).
Cost discipline at Layer 2
Layer 2 is the layer where eval costs can run away. Each query is a full agent run plus judge calls — typically $0.10–$5 per query depending on agent complexity. Multiply by 30-100 queries, by every PR that triggers the full suite, by multi-run averaging, and you can be looking at hundreds of dollars per release cycle.
The disciplines that keep this manageable:
Cache aggressively. Apply chapter 2.2's prompt caching to both the agent's system prompt and the judges' system prompts. For a stable eval suite running multiple times, the cache hit rate should be 70%+ on input tokens. Eval-suite cost drops 50%+ vs uncached.
Right-size the subsets. The fast subset should be small enough to run in 2-5 minutes on every PR. The full suite should be small enough that you'd actually run it on the cadence you want. A "comprehensive" 500-query suite that nobody runs is worse than a 50-query suite that runs every release.
Use cheap judges where possible. Judge calls don't always need the most capable model. Citation-faithfulness checks ("does this source support this claim?") are simple enough that Haiku is fine. Answer-quality grading with complex rubrics benefits from Sonnet. Save the expensive judges for the dimensions that need them.
Batch where feasible. The Batch API (chapter 2.2) at 50% discount works for non-time-sensitive eval runs. Nightly or weekly full-suite runs can use Batch and halve the cost. Pre-release runs typically need the standard latency.
The curation problem
The hardest part of Layer 2 isn't running the evals — it's maintaining the curated query set. Three forces work against a good eval set over time:
Drift between eval and production. Your eval set captures the agent's intended use case at the time you built it. Six months later, real users are asking different questions. The agent might score 0.85 on the eval set and 0.65 on actual production queries. The fix: continuously sample production traffic and add representative cases to the eval set. Layer 3 → Layer 2 feedback loop (Step 4 of this chapter).
Overfitting to the eval set. Teams iterate against scores. Over time, prompt changes get optimized for "the eval suite passes" rather than "real users get better answers." This is Goodhart's Law in eval form. The defenses: refresh the eval set regularly (replace 10–20% per quarter with new cases), keep some queries held-out from the iteration loop (used only for release-gate decisions), and watch the gap between eval-set scores and Layer 3 production metrics.
Maintenance burden. Each query in the set needs to stay relevant, have its expected behaviors stay accurate, and not become obsolete (a query about a deprecated feature is useless). Quarterly grooming of the eval set — review every query, drop the obsolete ones, update expected behaviors — is real engineering work. Skipping it makes the suite gradually decay.
The honest framing: a Layer 2 eval set is a piece of software with the same maintenance needs as production code. Teams that treat it as build-once-and-forget end up with a suite whose scores don't reflect reality. Teams that treat it as a living thing maintain a useful signal indefinitely.
What Layer 2 specifically catches
The class of failure Layer 2 detects that Layer 1 can't:
- Quality drift from prompt edits. A small prompt change makes the agent slightly worse on a measurable dimension. Layer 1 sees nothing wrong; Layer 2 sees the score drop.
- Quality drift from model swaps. Swapping Sonnet for Haiku on a step that turns out to need Sonnet's reasoning. Outputs still validate; quality drops.
- Trajectory regressions. The agent now makes 8 tool calls where it used to make 4, or chooses the wrong tool for a class of queries. Layer 1 doesn't see this; Layer 2 catches it in trajectory-sensibility scores.
- Edge-case regressions. A change that improves common cases at the cost of correctly handling refusals or adversarial inputs. The "edge case" tier of the eval set is specifically there to catch this.
The pattern: anything where the failure is "the output is worse, but still well-formed" lives at Layer 2.
50 is a reasonable starting point and supports the four-tier structure (15 easy + 25 typical + 10 edge). For a mature agent with stable scope, this gives meaningful signal — most regressions will surface across multiple queries. Statistically, a 4-point drop on 50 queries is detectable.
When to grow: as your agent expands scope (new tools, new domains, new user segments), add 10–20 queries per segment. A multi-domain agent might end up at 100–200 queries. Beyond that you're often paying without learning more — at some point, more queries don't improve signal proportionally, and your time is better spent on Layer 3 production feedback.
When to shrink: if some queries always pass or always fail, they're not providing signal. Replace them with cases that actually discriminate. The right size is the one where every query in the set tells you something useful about the current state of the agent.
Both, with most weight on real. Synthetic queries are good for testing specific behaviors you can describe but haven't seen organically (edge cases, adversarial inputs, refusals). Real user queries are better for the bulk of the set because they capture the actual distribution your agent serves.
The privacy concern: real user queries may contain PII. The fix is sanitization — replace specific personal data with placeholders while keeping the structural shape ("user [USER_ID] wants to refund order [ORDER_ID]"). This preserves the eval signal while removing PII. For some categories of agent, even the structural shape leaks information; in those cases, write synthetic versions that match the real distribution.
Pure-synthetic eval sets are a warning sign. They tend to capture what the team imagines users do, which differs systematically from what users actually do. The eval-vs-production gap is large with synthetic-only sets.
Yes — and it's the reason chapter 3.3's calibration discipline is essential. The bias doesn't disappear at Layer 2; it just sits inside the scores. The "calibration tier" in the eval-set structure (the 15 hand-labeled queries) is specifically for tracking judge accuracy over time. When the calibration agreement drops, you know the judge has drifted, and the Layer 2 scores need to be interpreted with that in mind.
The discipline: run calibration alongside every full-suite run. Report calibration agreement as a metric in the scoreboard. If it drops below threshold (70-80% per chapter 3.3), pause judge-driven decisions until the judge is recalibrated.
Without this, your Layer 2 numbers are scores on a scale that's silently shifting. With it, you can tell when the judge changed vs. when the agent changed.
Layer 3: production telemetry.
Layer 1 catches contract violations. Layer 2 catches quality regressions against a curated set. Layer 3 catches what your eval set doesn't know to look for. Real user traffic surfaces edge cases your imagination didn't generate, distributional shifts your benchmark can't capture, and emerging failure modes that exist only in production. Without Layer 3, your evals tell you "the agent passes our tests" — they don't tell you "users are getting value." Those are different statements; both matter; Layer 3 is the half you'd miss.
What you can measure without ground truth
Production traffic doesn't have labels. You don't know whether a given response was "good" — no human graded it, no judge ran on it (or if a judge did, it costs real money to run on every query). Layer 3 has to extract signal from the limited information it does have.
Six categories of signal that are measurable on real traffic, in increasing order of inference difficulty:
1. Operational metrics (always trivially available). Latency, cost, error rate, token usage, cache hit rate, tool-call counts. These come for free from your observability infrastructure (chapter 2.1). They don't tell you anything about quality, but they tell you when something has gone structurally wrong — a spike in cost means runaway agent loops, a latency spike means slow tools or model congestion. Set alerts on these; they're your fastest production warning system.
2. Tool-call patterns. Which tools is the agent using, in what proportions, with what error rates? A sudden shift — "the agent started calling search_docs 3× as often this week" — usually means something changed (the user-query distribution, an upstream prompt, the tool's behavior). Catch the shifts; investigate the causes.
3. Behavioral distributions. Tool-call count per query, escalation rate, refusal rate, output length distribution. These tell you about how the agent is behaving in aggregate. Distributional shifts are usually signals — the average research run jumped from 12 to 18 tool calls; the refusal rate dropped from 8% to 3%. Each is a question worth investigating.
4. User-derived signals. Thumbs up/down, retry rates, conversation-length-before-handoff, click-through on linked sources. These are noisy but real — when users thumbs-down at 2× the usual rate, something has degraded. The signal is statistical, not per-query; you need volume to see it clearly.
5. Downstream outcomes. Did the user complete the task the agent was helping with? Did the support ticket get resolved without human escalation? Did the user come back next week? These are the closest thing to true quality signal — they measure whether the agent created real value — but they're delayed (you find out a week later) and confounded with many other factors. Use them as a check, not the primary signal.
6. Sampled quality assessment. Run Layer 2-style judges on a random sample of production traffic. The sample is small enough to be affordable (say, 1% of queries) and large enough to be statistically meaningful at scale. You get a continuous read on "what would the eval set say about today's traffic?" Catches the eval-vs-production gap as it opens.
The shape of Layer 3 dashboards
What a production dashboard for an agent looks like, focused on the metrics that actually signal problems:
Three tiers of metrics, looked at on different cadences. Tier 1 wakes you up at 2am if something breaks. Tier 2 is your daily check-in. Tier 3 informs your iteration loop — these are the numbers that tell you what to work on next.
The feedback loop from Layer 3 to Layer 2
This is the discipline that ties everything together. Layer 3 reveals what your eval set doesn't know about; Layer 2 is where that knowledge gets encoded for future regression detection. The loop:
This is what makes the three-layer system robust over time. Without the feedback loop, Layer 2 becomes a static set of checks that doesn't keep up with reality. With it, every production-surfaced issue results in permanent test coverage.
The discipline: when you investigate a production issue, the fix isn't complete until you've added a Layer 2 case for it. "Fixed in production" without "added to eval suite" means the bug can return undetected. The engineering rhythm: every Layer 3 alert that leads to a fix also leads to a small Layer 2 PR.
What Layer 3 specifically catches
The class of failure Layer 3 detects that the lower layers can't:
- Distributional shifts. Your eval set's distribution doesn't match real traffic's distribution, and the gap widens as users discover new ways to use the agent. Layer 3's sampled judging surfaces this gap. Without Layer 3, you don't know it's happening.
- Novel failure modes. The agent encounters a query type your eval set didn't anticipate, and produces poor output. No prior test covers it, no judge graded it. Layer 3's user signals (thumbs-down, retry rate, retention) surface this in aggregate even when individual cases are missed.
- Drift over time. Models get updated, tools get changed by their vendors, retrieval corpora drift. The agent's quality silently changes over weeks or months. Layer 3's continuous monitoring catches this in a way that Layer 2 (running on a snapshot) doesn't.
- Adversarial use. Real users probe the agent in ways your eval set didn't include. Some of those probes succeed at extracting unintended behavior. Layer 3's pattern detection on tool-call sequences and output formats can flag this.
For agents shipping to many users, Layer 3 is where most of your iteration loop's information comes from after the first few months. The eval set you launched with is mostly already covered; what you'll discover at month 6 lives in production traffic.
A regression caught at each layer in turn.
To anchor everything in this chapter: a representative agent project, three real-shape regressions, each caught at a different layer. The point is to show what each layer's signal looks like in practice — and what gets through when one layer is missing.
The agent
A customer-support agent at a fintech SaaS, similar to chapter 4.4's example. Three specialized peer agents (billing, technical support, account management) plus a routing layer, with both Layer 1, Layer 2, and Layer 3 in place. We'll watch three regressions hit the system across three weeks.
Week 1: Layer 1 catches a contract regression
The change. A prompt edit to the billing agent: a sentence is added to encourage more empathetic phrasing on refund declines.
What happened. The new prompt occasionally causes the agent to wrap its structured response in apologetic prose — instead of {"action": "decline", "reason": "..."} the agent emits "I'm sorry, but here's what I can do: {"action": "decline", "reason": "..."}". The schema validator rejects the response because the outer text breaks JSON parsing.
How it was caught. CI ran Layer 1 tests on the PR. The schema-validation test failed on 3 of 8 representative agent runs. CI blocked the merge. Engineer saw the failures, read the new prompt, noticed the issue, reverted the change. Total time from PR submission to caught bug: 4 minutes.
Layer 1 catches contract regressions cheaply and immediately. No agent run wasted, no judge time consumed, no production traffic affected. The engineer doesn't even need to understand the underlying failure — "the schema validator failed" is sufficient signal to investigate the change. Compare to the same regression slipping past Layer 1: it would be caught hours later by Layer 2 (when the full suite runs), would consume a chunk of eval budget, and would block release rather than block PR. Layer 1 is the smallest possible feedback loop on contract issues.
Week 2: Layer 2 catches a quality regression
The change. The routing agent's model is swapped from Sonnet to Haiku, for cost reduction. The PR description: "trying Haiku for routing — should be cheap and fast."
What happened. Haiku's classifications are mostly correct, but it misroutes about 1 in 12 tickets — sending a billing-with-technical-context query to the pure billing agent, which then produces correct billing answers that miss the technical issue underneath. Layer 1 doesn't see anything wrong (the routing emits valid structured output; the billing agent emits valid structured output). But the user is getting an answer that doesn't actually solve their problem.
How it was caught. The PR is labeled eval-full because it changes a core component. Layer 2 runs. Among the 50 queries in the suite, 4 are designed to test cross-domain routing (chapter 4.4's pattern). On 3 of those 4, the new Haiku-based routing classifies wrong. The full suite reports:
## 📊 eval-results
vs main (e7b3c20)
**overall: 0.821 → 0.768 (-0.053) ↓ [REAL]**
| metric | base | pr | delta | verdict |
| ------------------------- | ------ | ------ | ------- | ------- |
| task_completion | 0.880 | 0.840 | -0.040 | ✓ REAL|
| factual_accuracy | 0.940 | 0.940 | 0.000 | ✓ |
| answer_quality | 0.810 | 0.790 | -0.020 | noise |
| trajectory_sensibility | 0.780 | 0.620 | -0.160 | ✓ REAL|
| edge_case_correctness | 0.700 | 0.640 | -0.060 | ✓ REAL|
cost: $1.84 (full suite) · runtime: 22m
The trajectory_sensibility drop is the smoking gun — it's measuring "did the agent take a reasonable path to the answer," and on the cross-domain cases the routing-then-handling path is now wrong. The reviewer comments: "Looks like Haiku misroutes the cross-domain cases. Keep Sonnet for routing or train a custom Haiku-tuned router; either way, this PR can't ship as-is." Author closes the PR with a learning note.
Layer 2 caught what Layer 1 couldn't see: outputs that are well-formed but worse. The agent never violated any contract; it just produced answers that missed the user's actual need. The only way to detect this was full agent runs against representative queries, with judges grading the trajectory. Without Layer 2, the change would have shipped, the routing failures would have appeared in production, and users would have gotten the wrong help — a much more expensive way to learn the same thing.
Week 3: Layer 3 catches what neither lower layer knew to check for
The change. No code change. The agent has been running stably. The Layer 2 eval suite passes at 0.821 (baseline) on every nightly run.
What happened. Over 10 days, the thumbs-down rate on technical-support tickets drifts from 8% to 13%. The escalation rate on technical-support tickets stays flat. Investigation reveals: a vendor SaaS that the technical-support agent uses for documentation lookup has changed their docs site structure. The agent's web-fetch tool now returns pages where the relevant content is buried below new marketing material at the top of each page. The agent reads the top of the page and answers based on outdated overview content, missing the technically-correct detailed sections lower down.
How it was caught. Layer 3's thumbs-down-rate dashboard caught the drift. The metric crossed its alert threshold (baseline + 2σ for 24 hours) and pinged the on-call engineer. Investigation pulled the affected traces (chapter 2.1), surfaced the pattern that all the bad answers came from queries that had used fetch_docs against the affected vendor, and identified the root cause.
The fix. The team patched the tool's content-extraction logic to skip marketing material and prioritize technical content blocks. Crucially: they also added two queries to the Layer 2 eval set — one that requires fetching from this vendor and synthesizing detailed technical content, one that exercises the content-extraction logic specifically. The new tests would catch a future regression of the same shape.
Layer 3 caught what neither lower layer could have caught. Layer 1 saw nothing wrong (tool calls succeeded, outputs validated). Layer 2 had no query in its set that exercised this failure mode — the eval set was built before this vendor's docs changed. The only way to know the agent was failing was to watch real users' reactions on real production traffic. After the fix, the new Layer 2 cases close the gap: this specific failure shape can't recur silently. The feedback loop from Layer 3 to Layer 2 turned a one-time production issue into permanent coverage.
What the three weeks teach together
Three regressions, three layers, three different failure types — each catchable only at its specific layer:
- The contract regression in week 1 would have been a quality bug at Layer 2 (slow, expensive) and a user-facing bug at Layer 3 (very expensive). Layer 1 caught it in seconds.
- The quality regression in week 2 would have shipped if only Layer 1 existed. Layer 2 caught it before merge. Layer 3 would have caught it eventually, but only after users had received bad routing.
- The drift in week 3 had no way to be caught by Layers 1 or 2 alone. Only continuous production monitoring saw it. The fix then included a Layer 2 update so the same drift couldn't slip past unnoticed again.
This is the architectural value of having all three layers. Each catches a class of failure the others structurally can't. Each contributes to the others (Layer 3 informs Layer 2's curation; Layer 2 ensures Layer 3's baseline is meaningful). The cost of running all three is modest compared to the cost of shipping any of these regressions to production without catching them — and modest compared to the cost of ad-hoc post-hoc bug investigation when something went wrong.
Deliverable
A working understanding of evals as a three-layer architecture, not a single activity. Layer 1 (deterministic checks, every commit, seconds and cents) catches contract violations. Layer 2 (offline judge suite, PRs and releases, minutes and dollars) catches quality regressions. Layer 3 (production telemetry, continuous, real-time) catches what your eval set didn't know to look for. The feedback loop from Layer 3 to Layer 2 makes the system robust over time. A clear picture of what each layer specifically catches that the others can't, and the cost discipline that keeps each layer sustainable.
- Layer 1 suite: dozens of pytest-style checks on every commit, full suite under 30s
- Layer 1 covers schema, citations, tool-call correctness, budget compliance, forbidden actions, refusal patterns
- Layer 2 suite: 30–100 curated queries across tiers (easy / typical / hard / edge / calibration)
- Layer 2 grades on five dimensions: task completion, accuracy, quality, trajectory, edge cases
- Two-tier cadence: fast subset on every PR, full suite on PRs that need it + release gates
- Layer 2 with judge calibration tracked, prompt caching applied, batch API for offline runs
- Layer 3 dashboard with three tiers (alert-worthy / daily / weekly) including sampled judge scores
- Layer 3 → Layer 2 feedback discipline: every production fix includes a Layer 2 PR
- Eval-to-prod gap measured: Layer 2 score vs Layer 3 sampled estimate
- Quarterly grooming of the Layer 2 eval set: refresh 10–20%, drop obsolete cases
- Production query sampling pipeline that respects PII; samples flow into Layer 2 candidate set