Verifier-Guided Search

Deep Dive · Reasoning & Test-Time Compute

The verifier is the product; the generator is just a proposal distribution.

Every search method in this section ends at a selector. When that selector is a learned reward model rather than a vote, you get verifier-guided search: best-of-N re-ranking, beam search, or tree search steered by an outcome reward model (ORM) or a process reward model (PRM). This essay contrasts ORM and PRM, shows where each plugs into the search, and defends the central claim — the verifier's quality is the ceiling on everything.

STEP 1

Two reward shapes: outcome vs process.

An ORM scores a finished trajectory: one scalar for the final answer's correctness. Cheap to label (you only need outcome labels), trivially compatible with best-of-N, but blind to where a trajectory went wrong, so it gives no signal to prune mid-search. A PRM scores each step: a per-step correctness estimate over the partial trace. It enables pruning and beam search because you can rank partial states, but step-level labels are expensive and noisy to collect (what is a "correct step" in a derivation that still reaches the wrong answer?). The 2025 trend is generative verifiers — GenPRM/ThinkPRM-style models that reason about a step before scoring it, reducing the labeled-data burden and outperforming discriminative PRMs under beam search.

STEP 2

Where the verifier plugs in.

# Best-of-N with an outcome verifier (ORM)
cands = [model.sample(q) for _ in range(N)]
best  = argmax(cands, key=lambda c: ORM.score(q, c))

# Beam search with a process verifier (PRM)
beam = [root(q)]
for _ in range(DEPTH):
    nxt = [s.extend(t) for s in beam for t in s.expand(K)]
    beam = top_k(nxt, B, key=lambda s: PRM.step_score(s))

ORM is a pure re-ranker bolted onto independent samples. PRM is a steering signal inside the search loop. Same generator; the verifier decides what survives.

STEP 3

The verifier is everything: the asymmetry argument.

Verifier-guided search works because verification is often easier than generation. Best-of-N converts a weak generator into a strong system iff the verifier can reliably tell a correct trajectory from a plausible-but-wrong one. The generator only needs to put non-negligible probability mass on a correct trajectory somewhere in N samples; the verifier does the hard discrimination. This is why a small policy model with a good PRM can, under compute-optimal test-time scaling, match a much larger model without one. The corollary is brutal: your system is exactly as good as your verifier, not your generator. Money spent improving the verifier dominates money spent improving the generator for any task where a verifier exists at all.

STEP 4

Reward hacking: search optimizes the proxy you actually wrote.

Search is an optimizer pointed at the verifier's score, so it finds the verifier's blind spots. Crank N high enough and best-of-N stops finding the best answer and starts finding the answer that best games the reward model — verbose, confident, superficially well-structured outputs that the RM over-rewards. This is reward hacking at inference time, and it gets worse with more compute, not better: the curve of true accuracy vs N can rise, plateau, then bend down even as the RM score keeps climbing.

Always plot true task accuracy (held-out, ground-truth) against N, not just mean reward. If accuracy turns over while reward keeps rising, you are over-optimizing a flawed proxy. The fix is a better/harder verifier or a tighter N — never "add more samples."

STEP 5

Picking the regime by what you can actually score.

If a ground-truth checker exists (unit tests, a SAT/LP solver, a proof checker, an executable spec), use that as the verifier — it is unhackable in the ways learned RMs are not, and best-of-N over it is the strongest cheap lever in the section. If only a learned ORM is available, prefer best-of-N with conservative N and monitor the turn-over. Reach for a PRM with beam/tree search only when the problem has long multi-step structure where pruning early saves real compute and you can afford the step-scoring calls (each beam step pays K generator + K verifier calls). If you have no usable verifier, you are not doing verifier-guided search — you are doing self-consistency or nothing, and you should be honest about which.

Prefer a cheap exact checker over an expensive learned RM whenever the task admits one, even a partial one (type-checks, a few property tests). A noisy 0.9-AUC RM caps your system at roughly that discrimination; a sound checker caps it at the generator's coverage of correct trajectories — usually far higher.

STEP 6

The honest tradeoff.

Verifier-guided search converts extra inference compute into accuracy at exactly the conversion rate your verifier's discrimination allows — and a flawed verifier turns more compute into more confidently-hacked output. Invest in the verifier before the generator, prefer an exact checker to a learned one, and watch the accuracy-vs-N curve for the turn-over that says you are now optimizing the proxy, not the task.