Process vs Outcome Reward Models

Deep Dive · Training Agentic Models

Process vs outcome rewards: pay for steps, or pay for the answer.

An outcome reward model (ORM) scores only the final result: was the answer right, did the tests pass. A process reward model (PRM) scores each intermediate step. ORM is cheap to label and hard to game on outcome but gives almost no credit-assignment signal across a long trajectory; PRM gives dense per-step gradient but costs far more to label and introduces a new gameable surface. This essay is about that trade and how to decide which to pay for.

STEP 1

The two reward shapes

An ORM is a function of the terminal state only: r = orm(final). One scalar per trajectory, regardless of length. A PRM assigns a value to every step: r_t = prm(step_t), turning one trajectory into a dense vector of signal. ORM answers "was it right?"; PRM answers "was this step on a good path?" — a much more informative, much more expensive question.

# ORM: one terminal scalar      PRM: signal at every step
orm_r = orm(traj.final)
prm_r = [prm(s) for s in traj.steps]   # dense, per-step

STEP 2

Why outcome reward starves long trajectories

Recall the credit-assignment problem (T3): a single terminal scalar shared across a 20-step trajectory tells the policy that it failed, not where. The per-step gradient is the same tiny number for the brilliant step three and the fatal step seventeen. As horizon grows, ORM's signal-per-step shrinks toward noise and sample efficiency collapses. ORM is excellent when trajectories are short or the outcome is the only thing that is cleanly verifiable — and increasingly inadequate as steps multiply.

STEP 3

Why process reward helps — and what it costs

A PRM localizes the error. The policy learns "step 17 was the mistake" instead of "something in those 20 steps was," which is dramatically more sample-efficient on long-horizon reasoning and multi-step tool tasks. The empirical pattern: dense process reward beats sparse outcome reward exactly when the horizon is long and the failure is localizable — multi-step math, multi-hop tool use, agentic coding.

The cost is steep. PRM labels require a human or a strong model to judge every step — orders of magnitude more annotation than one outcome label per trajectory. Worse, the PRM is itself a learned model, so it is a new gameable surface: the policy can learn to emit steps that look good to the PRM while the trajectory still fails. You have traded one proxy problem for a denser, more expensive one.

STEP 4

The labeling-cost ledger

ORM labels — one judgment per trajectory; often free when an executable verifier exists (tests, proof checker). Cheap, objective, robust on the dimension it measures.
PRM labels — one judgment per step, requiring graders who can assess intermediate reasoning. Expensive, slower, and subjective unless steps are themselves verifiable.
Automated PRM — bootstrap step labels from rollouts (a step is "good" if continuations from it tend to succeed). Cheaper than humans, but noisier and inherits the base policy's blind spots.

The decision usually reduces to: is your outcome cheaply and objectively verifiable, and is your horizon short enough that ORM credit assignment still works? If yes, ORM is the disciplined default. PRM is what you buy when the horizon defeats ORM and you can afford the labels.

STEP 5

The pragmatic middle: cheap dense signal without a full PRM

You rarely need a fully human-labeled PRM to escape sparse-reward starvation. Cheaper densifications often capture most of the benefit:

Verifiable subgoals — milestones with programmatic checks (compiles, intermediate test passes), giving step-ish signal for free.
Outcome-supervised process labels — derive step credit from many sampled completions per step (Monte-Carlo rollouts), no per-step human grading.
PRM at inference only — use a PRM to rerank/search at decode time rather than as a training reward; captures much of the gain without the policy learning to game it.

Often the right answer is ORM for the training reward plus a few verifiable subgoal checks — not a full hand-labeled PRM.

STEP 6

When NOT to build a PRM

Don't build a PRM if your horizon is short (ORM's signal is fine), if your outcome verifier is already cheap and trustworthy (don't add a gameable model on top), or if you cannot afford labels good enough that the PRM is more accurate than the policy it supervises — a bad PRM is worse than honest sparse reward, because the policy learns to satisfy a wrong step-judge. Use the sparsest reward that still assigns credit; pay for process supervision only when the horizon makes outcome reward go silent and you can afford a step-judge you trust.