LLM-as-judge: when it works, when it lies.
Using a model to score another model's outputs is the only way to evaluate at scale. It's also the most common source of false confidence in agent evals: judge biases (position, length, self-preference, authority) silently shift your numbers in directions that have nothing to do with quality. This chapter teaches the four biases concretely, the two-orderings-and-cross-family disciplines that actually move the needle, and the calibration protocol that tells you when to trust the judge and when to fall back to humans. By the end you'll have a judge pipeline you can audit, with documented agreement against human raters and explicit policy on when judge-alone decisions are acceptable.
Why you can't avoid LLM-as-judge.
The first question to answer is whether you need a model judge at all. The honest answer is: it depends on what you're measuring, and for most agent metrics, yes.
The choice between three grading methods isn't a preference — it's determined by what you're checking:
pass/fail is the truth.The split that matters: use deterministic graders whenever you can (cheap, fast, exact); fall back to LLM-as-judge for everything else (scales, but stochastic and biased). The mistake is using LLM-as-judge for things a deterministic grader could check — paying judge cost and inheriting judge bias for a problem that doesn't need them.
The math on why you can't just use humans
Why not skip the judge and just have humans rate everything? Run the numbers for an active agent project.
50-question eval set. Two trajectories to rate per question (one from the candidate PR, one from main). Three quality dimensions per trajectory (correctness, faithfulness, helpfulness). Five PRs per week. That's:
50 × 2 × 3 × 5 = 1,500 human ratings / week
At ~30 seconds per rating, that's about 12.5 hours of human-rater time per week — one engineer's day-and-a-half — to grade evals alone. That's already too expensive for most teams. And it's the lower bound; the moment you want to compare changes against multiple baselines, or run on a larger eval set, or rate more dimensions, it scales up linearly. There's no path where humans grade every output.
The alternative — use a model to grade — costs roughly $0.01–0.05 per rating with a small model, runs in seconds, scales to 50,000 ratings per week without anyone hating their life. The trade is exactly what you'd expect: cheaper, faster, less accurate. The whole chapter is about how to make "less accurate" mean "not so much less that the numbers stop mattering."
The judge is also code that needs to be tested
Here is the framing shift that turns LLM-as-judge from a trap into a working tool: the judge is a model. The judge has accuracy you can measure. You measure it against humans on a held-out calibration set, the same way you'd measure any classifier.
This isn't a metaphor. The agreement rate between your judge and a small set of human-rated examples is a number. You can compute it. You can monitor it over time. You can fail-fast when it drops below threshold. And if you don't compute it, the judge is silently broken and your scoreboard is silently lying. This is the discipline most teams skip, which is why most teams' eval numbers are softer than they think.
The rest of this chapter teaches:
- The four biases that show up in every untreated judge pipeline (Step 2).
- How to design the judge prompt — pointwise vs pairwise, what to put in the rubric — and how to validate it against humans (Step 3).
- The production patterns that make judges robust over time: cross-family ensembles, both-orderings discipline, judge versioning, when to fall back to humans (Step 4).
If you wired up an LLM judge twelve months ago and have never re-validated it against humans, your eval scores are not a quality signal — they are a measurement of how well your prompt has learned to game its own evaluator. The judge has biases; your prompt iteration is gradient descent on whatever it is the judge actually rewards. The two converge toward what the judge likes, not what users like. This is the failure mode the chapter exists to prevent.
You can use a powerful model as the judge — and you should, for the highest-stakes evals — but "reliable" is wrong framing. Even frontier models have measurable position bias (up to 40% inconsistency on identical pairwise comparisons with order swapped), verbosity bias (~15% inflation favoring longer answers), and self-preference (5–7% boost when the judge is the same family as the model being judged). These don't disappear with a better model; they shift in magnitude. Engineering around them is mandatory.
What a more powerful model buys you: better understanding of the rubric (so subtle quality differences are detected), better instruction-following (so structured output is more reliable), and slightly less of all the biases. It does not buy you "the biases go away."
No. A keyword check or string-match isn't an LLM judge — it's a deterministic grader, and that's the right tool when you can use it. The biases apply specifically when you're asking a model to make a quality judgment ("is this answer good?", "is A better than B?", "does this claim follow from this source?"). For binary checks you can express as code, code is more reliable and cheaper than any LLM. Don't let "we use LLM-as-judge" become a status symbol; it's a fallback for when deterministic doesn't work.
The four biases in every untreated judge pipeline.
Before you can fix the biases, you have to see them. Each of the four below is well-documented in the 2024–2026 literature, reproducible in your own pipeline in fifteen minutes, and quietly distorting your scores until you defend against it. Let me walk each one with a concrete demonstration.
Bias 1: Position bias (the worst one)
In a pairwise comparison — "which of these two answers is better, A or B?" — the option presented first wins more often than chance. Even frontier models in 2026 show 30–40% inconsistency rates when you swap the order and re-ask. The same two responses, same rubric, different ordering, different verdict.
The fifteen-minute demonstration:
# scripts/measure_position_bias.py import random from anthropic import Anthropic client = Anthropic() JUDGE_PROMPT = """Compare these two answers to the question. Reply with A, B, or TIE — single token, no explanation. Question: {q} Answer A: {a} Answer B: {b}""" def judge(q, a, b): response = client.messages.create( model="claude-sonnet-4-5", max_tokens=5, messages=[{"role": "user", "content": JUDGE_PROMPT.format(q=q, a=a, b=b)}], ) return response.content[0].text.strip().upper() # Pull 50 pairs from your eval set (or any set of paired outputs). # For each pair, run BOTH orderings. inconsistent = 0 for q, x, y in pairs: forward = judge(q, x, y) # x is A, y is B reversed = judge(q, y, x) # y is A, x is B (so flipped verdict expected) # If they agree, forward says "A wins" iff reversed says "B wins". consistent = (forward, reversed) in {("A", "B"), ("B", "A"), ("TIE", "TIE")} if not consistent: inconsistent += 1 print(f"position inconsistency: {inconsistent}/{len(pairs)} = {inconsistent/len(pairs):.0%}")
$ python scripts/measure_position_bias.py
position inconsistency: 17/50 = 34%
34% of your "winner" verdicts depend on which option appeared first. That means on roughly one in three pairwise judgments, your judge isn't grading the content — it's grading position. If you ship a change based on a one-direction pairwise judgment without checking the reverse, there's a 17% chance the "improvement" is actually just position bias firing one way.
The fix is the most impactful single change you can make to a judge pipeline: run every pairwise comparison in both orderings and only count a winner if both agree. Split verdicts become "tie" or "position-determined." This single discipline removes position bias entirely — at 2× judge cost.
def judge_both(q, x, y): fwd = judge(q, x, y) # x=A, y=B rev = judge(q, y, x) # y=A, x=B # x wins iff "A" forward AND "B" reverse if fwd == "A" and rev == "B": return "x_wins" if fwd == "B" and rev == "A": return "y_wins" return "tie" # includes "position-determined" disagreements
Bias 2: Verbosity bias (the sneaky one)
Longer answers score higher. Not because they're better — because they're longer. Studies place the inflation at roughly 15% on average; on certain rubrics ("comprehensive", "thorough") it goes higher.
The mechanism: the judge interprets verbosity as a proxy for effort or completeness. When asked "which answer is more thorough," the judge picks the longer one almost regardless of substance. When asked simply "which is better," the bias still leaks in.
The demonstration: take any short, correct answer. Pad it with two extra sentences that restate the same content. Run the judge. Watch the padded version win.
Question: What is the capital of France?
Answer A: Paris.
Answer B: The capital of France is Paris, a city located in
the northern part of the country along the Seine river. Paris
has been the capital since the 12th century and serves as the
political and cultural center of the nation.
[judge with default rubric]: B wins on "comprehensiveness"
[judge with rubric tuned for correctness only]: tie
Both answers are correct. B's "win" is verbosity bias.
The fix is twofold: (1) state the bias in your judge prompt explicitly — "do not reward longer answers; concise correct answers should score equally with longer correct answers"; (2) use small numeric scales (1–4 rather than 1–10) that don't have room for "longer = nudge up." The combination shrinks verbosity bias substantially without eliminating it.
Bias 3: Self-preference (the family effect)
When you use Claude as a judge to compare two answers, and one of those answers was generated by Claude, the Claude-generated answer wins more often than chance. Same with GPT judging GPT outputs. The effect is small — 5–7% boost — but it composes with the others and is invisible if you don't look for it.
The cause is stylistic familiarity. Each model family has signature patterns — sentence structures, phrasings, ways of organizing information. The judge of the same family recognizes those patterns as "well-written" because they match what the judge's training rewarded. It's not preferential treatment in the moral sense; it's pattern matching.
The implication: never use the same family as both the agent's model and the judge. If your agent runs on Claude, judge with GPT or Gemini. If it runs on GPT, judge with Claude or Llama. The cross-family check defuses self-preference because no two providers share training distributions.
The other implication: when you do comparative analysis ("does our agent on Claude beat the same agent on GPT?"), you cannot use either Claude or GPT as the judge — both judges would systematically favor their own family. Use a third party (Gemini, Llama) or an ensemble (Step 4 below).
Bias 4: Authority / confidence bias
Confident-sounding answers beat hedged ones, even when the hedged answer is more accurate. "The answer is 47" beats "The answer is most likely 47, though if you're using the alternate definition it could be 42" — even when the alternate definition matters.
This bias is the one most often working against your users' interests. Hallucinations are confident; truthful answers are sometimes hedged. A judge that prefers confidence will systematically rank hallucinations above grounded uncertainty. Over enough iterations, your agent learns to sound more confident even when it shouldn't be.
The fix: write the rubric to reward calibrated uncertainty explicitly. Not "is the answer confident?" but "does the confidence level match the strength of the evidence?" That phrasing forces the judge to evaluate the match, not the surface signal of confidence. The bias doesn't vanish — but it shrinks substantially.
Watch all four compose
Each individual bias is at most 5–15%. The danger is that they compose multiplicatively when the conditions align. A longer, more confident response from the same model family as the judge, presented first in pairwise comparison, can win 80%+ of the time against a shorter, hedged response from a different family — even if the shorter response is more correct.
This is the failure mode where teams converge over months on prompts that produce long, confident, judge-family-styled outputs that quietly underperform on real user queries. The scoreboard climbs; the user satisfaction doesn't. The eval has become a metric the agent is optimizing against rather than a measurement of quality.
If you take exactly one defense from this step, take the both-orderings discipline. It costs 2× judge spend, removes the largest single bias entirely, and is so cheap to implement (10 lines of code) that there is no defensible reason to skip it. Every other defense matters; that one is non-negotiable.
Position bias is the only one that arguably averages out — if your candidate is randomly assigned to position A 50% of the time and position B 50%, the bias adds noise but not directional bias. (The both-orderings fix is still much better, because you get the signal-to-noise improvement and remove the noise.)
The other three are directional. Verbosity bias always favors longer; self-preference always favors same-family; authority bias always favors confident-sounding. Averaging does nothing for any of these. They produce systematic shifts in your scores that look like real improvements when they aren't.
Mostly yes, since there's no "first" and "second" to swap. Pointwise has its own problems: scores drift between runs (the same response scores 7 today, 8 tomorrow), the scale is poorly anchored ("what's a 7 vs 8?"), and it's sensitive to whatever the judge saw most recently. Pairwise is more reliable when both orderings agree, less reliable when they don't.
The pragmatic choice: pointwise for cheap continuous monitoring (use small scales like 1–4), pairwise-with-both-orderings for the high-stakes evals that gate releases. Both methods have a place. They're not substitutes; they answer different questions.
Calibration: validate the judge against humans.
Step 2 makes your judge robust against known biases. This step makes it auditable against the ground truth — humans. Without this step, you're trusting that the judge agrees with users on what "good" means. Sometimes it does; often it doesn't; without measurement you can't tell which.
The protocol is small. Build a calibration set of human-rated examples. Run your judge on them. Compute agreement. Decide whether the agreement rate is high enough to use the judge alone. Re-run this calibration whenever you change the judge (model, prompt, rubric) or when you suspect the agent's output distribution has shifted enough that the judge's accuracy might have shifted too.
Build the calibration set
50 examples is the smallest set that gives a reliable agreement-rate measurement. Each example is a triple: an input, an output (or pair of outputs for pairwise), and a human verdict. The cost: about 2–4 hours of focused rating time for one rater to do 50 examples. The yield: a permanent reference that lets you trust or distrust the judge for years.
What to include:
- Clear wins. 15 examples where the right answer is obvious. The judge should get these all right; if it doesn't, the rubric is broken.
- Close calls. 25 examples where two reasonable raters might genuinely disagree. This is where judges fail most often, and where they need to fail close to humans rather than randomly.
- Adversarial examples. 10 examples designed to trigger the biases from Step 2. A verbose-but-wrong answer paired with a concise-correct one; a confident hallucination paired with a hedged truth; etc. The judge's verdicts on these tell you how well your bias mitigations are working.
Two humans rate each example independently, you reconcile disagreements, and the reconciled label is the calibration truth. Two raters matter — a single rater's idiosyncrasies become baked into your reference. Doesn't have to be expensive; a 30-minute discussion between the two raters to resolve disagreements is usually enough.
# evals/calibration_set.jsonl — 50 lines, one per example {"id": "cal_001", "category": "clear_win", "question": "What does VACUUM FULL do in Postgres?", "answer_a": "VACUUM FULL rewrites the entire table, reclaiming all dead tuple space and returning it to the OS. It takes an exclusive lock.", "answer_b": "It vacuums.", "human_verdict": "a_wins", "notes": "clear: a is correct and specific; b is uselessly terse"} {"id": "cal_017", "category": "close_call", "question": "How do I configure autovacuum for a write-heavy table?", "answer_a": "Lower autovacuum_vacuum_scale_factor on that table to 0.05 and autovacuum_vacuum_cost_limit to a higher value to allow more aggressive vacuuming.", "answer_b": "Use ALTER TABLE foo SET (autovacuum_vacuum_scale_factor = 0.05, autovacuum_vacuum_threshold = 1000) for per-table tuning.", "human_verdict": "tie", "notes": "both are valid approaches; b shows the actual SQL but misses cost_limit"} {"id": "cal_042", "category": "adversarial_verbosity", "question": "What's the default value of shared_buffers?", "answer_a": "128MB.", "answer_b": "The default value of shared_buffers in PostgreSQL is 128 megabytes, though this is widely considered too small for production workloads. Most production deployments increase this to 25% of available RAM, though the optimal value depends on...", "human_verdict": "tie", "notes": "both correct on the literal question; b's extra context isn't asked for. testing verbosity bias."}
Run the judge against it
# scripts/calibrate_judge.py import json, collections from agent.judge import judge_both # from Step 2, with both orderings cases = [json.loads(line) for line in open("evals/calibration_set.jsonl")] results = collections.Counter() for c in cases: judge_verdict = judge_both(c["question"], c["answer_a"], c["answer_b"]) # judge returns: x_wins / y_wins / tie # map to the same vocabulary as human verdicts judge_label = {"x_wins": "a_wins", "y_wins": "b_wins", "tie": "tie"}[judge_verdict] agreed = (judge_label == c["human_verdict"]) results[(c["category"], agreed)] += 1 if not agreed: print(f"DISAGREE {c['id']}: human={c['human_verdict']} judge={judge_label}") print(f" notes: {c['notes']}") # Compute per-category agreement for cat in ["clear_win", "close_call", "adversarial_verbosity", ...]: total = results[(cat, True)] + results[(cat, False)] rate = results[(cat, True)] / total if total else 0 print(f"{cat:30s} {results[(cat, True)]}/{total} = {rate:.0%}")
$ python scripts/calibrate_judge.py
DISAGREE cal_017: human=tie judge=a_wins
notes: both are valid approaches; b shows the actual SQL but misses cost_limit
DISAGREE cal_023: human=b_wins judge=tie
notes: b is significantly more specific; judge couldn't tell
DISAGREE cal_042: human=tie judge=b_wins
notes: both correct; b's extra context isn't asked for. testing verbosity bias.
DISAGREE cal_045: human=a_wins judge=b_wins
notes: a is correct concise; b is longer + WRONG. testing verbosity+confidence.
clear_win 15/15 = 100%
close_call 19/25 = 76%
adversarial_verbosity 6/10 = 60%
overall: 40/50 = 80%
Read the result
80% overall is a real number. What it tells you:
- Clear wins: 100%. Judge gets the easy ones right. Sanity check passed.
- Close calls: 76%. On genuinely ambiguous cases, judge agrees with humans about 3 out of 4. That's roughly the rate at which two humans agree with each other on close calls — so the judge is at human-rater quality on this category.
- Adversarial: 60%. The bias mitigations aren't strong enough. cal_042 (verbosity bias) and cal_045 (verbosity + confidence) both failed — the judge is still rewarding longer answers. Action item: tighten the rubric on conciseness, possibly add an explicit "longer is not better" sentence.
The headline number doesn't matter as much as the per-category breakdown. 80% overall could be 100% on clear, 100% on close, 0% on adversarial — that's a different story than 100/76/60 — but both produce the same 80%. Always look at the breakdown.
The 70% threshold
What agreement rate is good enough to trust the judge alone?
The pragmatic rule that emerges from working teams: below 70% overall agreement, don't trust the judge as the sole signal. Use it as a screen, then have humans rate the top contested cases. Above 80% overall, judge-alone decisions are usually fine for non-critical metrics. Between 70 and 80 is a gray zone — judge alone for cheap continuous monitoring, humans for release-gating evals.
Also: require >90% on the "clear win" category, no matter what the overall is. If the judge can't get the obvious cases right, the rubric or prompt is broken and the rest of the numbers don't matter. Fix that first.
Rubric tuning when calibration fails
The first time you run calibration, you'll almost certainly be below 80%. Fix in this order:
- Read every disagreement. The judge's reasoning (if you asked for it) and the human's notes will usually surface the rubric ambiguity. The most common finding: the rubric doesn't say what you thought it said.
- Tighten the rubric. If verbosity bias is firing on adversarial cases, add a sentence: "Conciseness is a virtue. Do not reward longer answers — concise correct answers should score as highly as longer correct answers."
- Add anchored examples. Best move for close-call accuracy: include 2–3 worked examples in the judge prompt showing the kind of judgment you want. "Example 1: input X, answer A, answer B, correct verdict: tie, reasoning: ..." Cost: ~300 prompt tokens. Benefit: typically 5–10 percentage-point lift on close calls.
- Re-run calibration. Same set. Same protocol. The number should move up. If it doesn't, your rubric edit didn't change behavior — figure out why before continuing.
Repeat until you're above your threshold. Then freeze the judge (rubric, model, prompt) and version it.
Anchored examples — the highest-leverage rubric improvement
This single change deserves its own paragraph because it disproportionately improves calibration. A judge prompt with a rubric but no examples is asking the judge to interpolate from abstract rules. A judge prompt with 2–3 worked examples is showing the judge exactly what its job looks like.
# Judge prompt with anchored examples (the right shape) JUDGE_PROMPT = """You are an evaluator. Compare two answers to a technical question and decide which is better, or if they're tied. Rubric: 1. Correctness: factually right matters more than anything. 2. Specificity: prefers concrete details (commands, numbers) over vague generalities. 3. Conciseness: do not reward longer answers. A concise correct answer scores equally with a longer correct answer. 4. Calibrated confidence: confident wrong < hedged right. Worked example: Question: How do I see Postgres connection count? Answer A: SELECT count(*) FROM pg_stat_activity; Answer B: You can check the number of connections by querying the pg_stat_activity view, which contains one row per connection. Verdict: tie Reasoning: Both correct; A is more concise but B is more explanatory. Neither materially better. Now evaluate: Question: {q} Answer A: {a} Answer B: {b} Respond with one of: a_wins / b_wins / tie Reasoning (one sentence):"""
The worked example shows the judge what calibrated judgment looks like, defuses verbosity bias (the example explicitly calls tie when the longer answer is just more explanatory), and reduces drift between runs. Three examples covering different scenarios (clear win, close call, adversarial) typically lifts agreement by 5–10 points.
Mathematically yes — but the per-category breakdowns become unreliable. 20 examples might be 6 clear, 10 close, 4 adversarial; if the judge gets 3/4 adversarial right, you have one data point on each side and no real signal on whether bias mitigation is working. The 50-example structure (15/25/10) is the smallest where each category has enough cases to draw conclusions.
If you can only spend 2 hours: 30 examples (10/15/5) is a reasonable starting point. Add to it over time as you encounter new failure modes; the calibration set grows alongside the eval set.
Three triggers, in order of urgency: (1) any judge change — model swap, rubric edit, prompt rewrite — recalibrate immediately, the change isn't done until the calibration says it is; (2) monthly drift check — providers update model snapshots quietly; the same judge code with a new snapshot is a different judge; (3) when the agent's output distribution shifts noticeably — new tool, new domain, new prompt pattern. The agent's distribution drifting can move the judge's accuracy on it.
The monthly check is the cheap one to skip and the most common cause of "we don't know why our scores moved." Make it a calendar event.
Production patterns: ensembles, versioning, fallback rules.
You have a calibrated judge with both-orderings discipline. That's the substrate; this step covers the operational patterns that keep it working over months in production. Four practices, each cheap, each preventing a specific class of silent failure.
Cross-family ensembles for release gates
For the highest-stakes evals — the ones that gate releases or contractually-binding quality claims — run three judges from three different model families and take majority vote (or require 2-of-3 agreement). The cost is 3× judge spend; the benefit is that no single family's biases drive your decisions.
# evals/ensemble_judge.py from anthropic import Anthropic from openai import OpenAI import google.generativeai as genai claude = Anthropic() gpt = OpenAI() gemini = genai.GenerativeModel("gemini-2.5-pro") async def ensemble_judge(q, a, b): # Run three judges in parallel. Each does both-orderings internally. verdicts = await asyncio.gather( judge_with_claude_both_orderings(q, a, b), judge_with_gpt_both_orderings(q, a, b), judge_with_gemini_both_orderings(q, a, b), ) counts = collections.Counter(verdicts) most_common, n = counts.most_common(1)[0] if n >= 2: # 2-of-3 or 3-of-3 agreement return most_common return "contested" # 1/1/1 split — escalate to human review
A "contested" output is gold — it identifies the cases where judges genuinely disagree, which is exactly the set you want a human to look at. Over time, these are the examples that grow your calibration set and sharpen your rubric.
When to use the ensemble: release-gating evals only. For continuous PR-level evals (chapter 3.1's CI workflow), a single calibrated judge is fine; 3× judge spend on every PR is wasteful. Use ensembles for the eval suite that determines whether a version ships to production.
Judge versioning
The judge is code. Like any code, changes to it need to be tracked, reviewed, and tied to score deltas so you can tell whether a score change came from the agent or from the judge.
Three things to version together:
- The judge model and snapshot.
claude-sonnet-4-5-20250929notclaude-sonnet-4-5. Provider aliases can resolve to different snapshots over time. - The judge prompt. A SHA-256 hash of the prompt template, recorded with every score. Any prompt edit produces a new hash.
- The rubric. If the rubric is a separate file (and it should be), version it too.
# scoreboard.csv extended with judge metadata commit, ts, branch, judge_model, # claude-sonnet-4-5-20250929 judge_prompt_hash, # short SHA judge_calibration_agreement, # last measured: 0.84 overall, retrieval_recall, ...
The discipline: when you change the judge, the scoreboard row records the new hash. The delta tool from chapter 3.1 warns when comparing two rows with different judge_prompt_hash values — that's a sign you're comparing apples to oranges.
$ python scripts/eval_delta.py --commit d1f5b88
⚠ baseline and candidate differ on judge config:
- judge_prompt_hash: 7b2a1c → c1e8a3 (changed in this PR)
This score delta mixes agent changes AND a judge change.
The judge change alone may produce a 2-5 point shift.
Re-run the baseline with the new judge before drawing conclusions.
overall 0.768 → 0.781 +0.013 [REAL?]
...
The warning prevents the most insidious confusion: shipping an agent "improvement" that was actually a judge-prompt edit that made the judge nicer.
When to fall back to deterministic graders
Some metrics are tempting to grade with an LLM but shouldn't be. The rule: if the question has a deterministic answer, use a deterministic grader. The LLM judge is your fallback when no deterministic check works, not your default.
Concrete cases where teams reach for LLM judges and shouldn't:
- Code correctness. Run the tests. A test that passes is correct; one that fails is wrong. No judge needed. (If you're using LLM-as-judge for code, you've designed your eval wrong; rebuild it around test execution.)
- Citation verification. Check whether the cited span actually contains the claim. String matching or sentence-embedding similarity will do this faster and more reliably than asking a model "does this claim follow from this source?"
- Schema conformance. Did the agent's output match the JSON schema?
jsonschema.validate(). Done. - Forbidden actions. Did the agent call any tool from the blocklist? Iterate the trace. No judge required.
The pattern: a hybrid evaluator that runs deterministic checks first and falls back to LLM-judge only for the dimensions deterministic can't handle.
async def evaluate_trajectory(traj): score = {} # Deterministic checks first — cheap and exact score["completed"] = traj.stop_reason == "end_turn" score["used_forbidden_tool"] = any( t.name in FORBIDDEN for t in traj.tool_calls) score["step_count_in_budget"] = len(traj.tool_calls) <= 20 score["cited_sources_exist"] = all_citations_resolve(traj) # Judge for what determinism can't capture score["answer_quality"] = await judge_quality(traj) score["faithfulness"] = await judge_faithfulness(traj) return score
The judge-failure escalation path
What happens when the judge is wrong and you know it? Two paths:
Single-output disagreement (an engineer reads a trace and the score doesn't match their intuition): add it to the calibration set as a new disagreement case. Re-run calibration. If the overall agreement dropped, tighten the rubric and re-measure. If it didn't drop, that one case might be the engineer's read, not the judge's; reconcile with a second human rater.
Systematic skew (you notice every PR for the last week has been judged favorably or unfavorably in a suspicious pattern): immediately switch to the ensemble judge for the next week's PRs while you investigate. Check the calibration agreement rate — has it dropped? If yes, the judge has drifted (often a model-snapshot change) and you need to recalibrate or revert. If no, the agent's output distribution may have shifted into territory the judge handles poorly; widen the calibration set with examples from the new distribution.
The principle: you should never be in a situation where you suspect the judge is broken but have no protocol for confirming or fixing it. The calibration set is the canonical resolution mechanism.
Treat the judge as a model you trained, even though you didn't actually train it. Models have validation sets, drift over time, need monitoring, and require explicit decisions about when to retire and replace. Your judge is no different. The teams that succeed long-term with LLM-as-judge are the ones that treat judge maintenance as a real engineering activity rather than a one-time setup.
Debugging a regression that turned out to be the judge.
A real-shape story to make Step 4 concrete. A team's trajectory_pass_rate dropped from 0.764 to 0.712 over two weeks with no PRs to the agent code. Five days into investigating "what regressed?" they realized the agent hadn't.
The symptom
Monday morning scoreboard scan, two weeks after the last agent merge:
commit date branch trajectory_pass_rate
e7b3c20 Wed Mar 4 main 0.764 ← last agent merge
e7b3c20 Mon Mar 9 main 0.751 (nightly re-run)
e7b3c20 Mon Mar 16 main 0.738 (nightly re-run)
e7b3c20 Mon Mar 23 main 0.712 (nightly re-run)
NO PRS TO main IN THIS PERIOD.
Same commit, dropping scores. The team's first instinct: corpus drift. Maybe new community articles in the KB are degrading retrieval. They check; corpus hash hasn't changed. Second instinct: API-side regression. Maybe Anthropic shipped a quieter Sonnet snapshot. They check the response.model field — same snapshot.
Two days into investigation, someone notices: the judge_prompt_hash on the failing rows is different from the passing ones.
commit date judge_prompt_hash trajectory_pass_rate
e7b3c20 Mar 4 7b2a1c 0.764 ← original
e7b3c20 Mar 9 7b2a1c 0.751 ← same hash, ~noise
e7b3c20 Mar 16 c1e8a3 0.738 ← HASH CHANGED
e7b3c20 Mar 23 c1e8a3 0.712 ← still c1e8a3
What happened
Three weeks earlier, an engineer had edited the judge rubric to add a "faithfulness" criterion. The PR was reviewed and merged. The scoreboard recorded the new hash. Nobody recalibrated the judge against the calibration set after the rubric change.
The team runs the calibration check now, with the new judge prompt:
$ python scripts/calibrate_judge.py
clear_win 15/15 = 100%
close_call 17/25 = 68% ← was 76%
adversarial_verbosity 5/10 = 50% ← was 60%
overall: 37/50 = 74% ← was 80%
The new rubric is stricter on faithfulness — which is fine — but it also accidentally tightened on close calls in a way that didn't match human judgment. The judge is now systematically calling more close calls as "wrong" than humans would. The agent didn't regress; the judge got pickier.
The fix
Three steps:
- Document the finding. A short writeup in the team's eng-journal: "Trajectory pass rate dropped from 0.764 to 0.712 between Mar 4 and Mar 23. Root cause: judge rubric edit on Mar 10 without recalibration. Calibration agreement dropped from 0.80 to 0.74. Agent unchanged."
- Decide what to do with the judge. Two options: revert the rubric edit (preserves comparability of historical scores) or accept the stricter rubric and rebaseline (lower scores, but more discriminating). The team chose to keep the stricter rubric — the faithfulness criterion is genuinely valuable — but to add three anchored examples to fix the close-call regression. After this, calibration goes to 0.81 and trajectory_pass_rate stabilizes at 0.736 on the unchanged agent.
- Update CI. Add a calibration check to the judge-prompt change workflow: any PR that modifies a file under
evals/judge/must run the calibration script as a status check, and the PR comment surfaces the agreement delta. A change that drops calibration agreement by >3 points is a hard block; anything between -3 and 0 is a soft warning.
This team's investigation took 5 days. The unwritten cost is everything they didn't ship in those 5 days because they were busy investigating a regression that wasn't real. The deeper cost is the trust hit: every subsequent score movement now triggers a "is this real?" reflex, and the team can't iterate as confidently.
The fix for the next time isn't better detective work. It's the CI rule that catches the problem at PR time. Any judge change without a calibration delta is now a failed status check; the rubric edit would have been flagged before merge, and the team would have addressed it then instead of three weeks later.
This is the discipline that makes LLM-as-judge sustainable at scale. The biases are real, the drift is real, the judge changes accumulate — and the only defense is treating the judge with the same rigor as any other piece of production code.
Deliverable
A calibrated, versioned, audited LLM-as-judge pipeline with documented agreement against human raters. Both-orderings discipline removing position bias. Rubric tuned to defuse verbosity and authority bias. Cross-family ensemble for release-gate evals. CI rule that calibration must hold whenever the judge changes. A team that knows when to trust the judge and when to fall back to humans, with an explicit policy rather than vibes. The substrate that makes the rest of Part III's eval methodology actually trustworthy.
- Calibration set: 50 examples across clear / close / adversarial categories, double-rated by humans
- Judge pipeline with both-orderings pairwise (position bias defeated)
- Rubric with explicit conciseness and calibrated-confidence guidance
- 2-3 anchored worked examples inside the judge prompt
- Measured agreement rate against humans; >80% overall, >90% on clear wins
- Cross-family / cross-provider judge for release-gate evals (3-of-3 with contested escalation)
- Judge versioning: model snapshot + prompt hash recorded with every score
- Delta tool warns when comparing rows with different judge_prompt_hash
- CI rule: judge-config change → calibration check required, >3pt drop blocks merge
- Monthly drift check on calendar; agreement re-measured against same calibration set
- Documented fallback policy: deterministic graders where possible, judge for the rest