RLHF & RLAIF

T2
Deep Dive · Training Agentic Models

RLHF and RLAIF: what each stage actually fixes.

RLHF is not one technique; it is a pipeline — SFT to set a competent base, a reward model trained from human preferences, and a policy-optimization step (PPO, GRPO, or DPO) that pushes the policy up the reward while staying near the base. RLAIF swaps the human labeler for a model judge, often guided by a written constitution. This essay walks the pipeline stage by stage and is precise about which problem each stage solves — and which it does not.

STEP 1

Why preferences, not demonstrations

SFT can teach a model to follow instructions, but it cannot teach which of two plausible answers is better — humans are far better at comparing outputs than at writing the ideal one. RLHF exploits exactly this asymmetry. You collect pairs (a, b) and a human label "a ≻ b"; the signal is relative, cheap to produce, and covers behaviors (helpfulness, harmlessness, tone) no demonstration corpus captures cleanly. The job of RLHF is to convert that comparative signal into a policy.

STEP 2

The reward model: a learned, gameable proxy

A reward model r_phi is trained on the preference pairs, typically with a Bradley–Terry objective, to output a scalar where preferred completions score higher.

# Bradley-Terry preference loss
loss = -log(sigmoid(r_phi(x, y_win) - r_phi(x, y_lose)))

The reward model is the heart of RLHF and its weakest joint. It is a learned approximation of human judgment, trained on finite data, and the policy step will optimize against it relentlessly. Anywhere the reward model is wrong — out-of-distribution, length-biased, sycophancy-rewarding — the policy will find and exploit that error. Stage clarity: the reward model fixes "we have no scalar to optimize"; it does not fix "the scalar is a faithful proxy for what we want."

STEP 3

The policy step, and the KL leash

With a reward model in hand, the policy is optimized to increase reward while a KL penalty keeps it from drifting far from the SFT reference. Without that leash the policy collapses onto reward-model artifacts and stops being a coherent language model.

# RLHF policy objective (PPO-style)
objective = E[ r_phi(x, y) ] - beta * KL(pi_theta || pi_ref)

PPO uses a learned value function and clipped updates; GRPO drops the value model and estimates advantage from a group of sampled completions per prompt — cheaper and now common for agentic RL. DPO is the sharpest shortcut: it derives a closed-form loss directly on preference pairs, eliminating the explicit reward model and the sampling loop entirely.

# DPO: implicit reward, no separate RM, no rollout loop
loss = -log(sigmoid(
  beta * logratio(y_win) - beta * logratio(y_lose)))

DPO is simpler and more stable, but folds the reward into the loss — you lose the explicit, inspectable reward model and the ability to do online exploration. PPO/GRPO keep a separate reward you can audit and red-team. That trade is the real choice, not "DPO is newer."

STEP 4

RLAIF: replace the human labeler, not the pipeline

Human preference labels are slow, expensive, inconsistent, and a privacy surface. RLAIF replaces the labeler with a model that judges which completion is better, usually steered by an explicit written constitution — a set of principles the judge applies ("prefer the response that is more honest about uncertainty"). Constitutional AI is the canonical instance: the model critiques and revises its own outputs against the constitution to generate the preference data, then standard RLHF runs on top.

What RLAIF actually fixes: the throughput and consistency of the preference signal, and it makes the value judgments explicit and editable as text rather than implicit in a crowd of annotators. What it does not fix: a model judge has the same blind spots as the policy, can be gamed the same way, and inherits the constitution author's omissions.

STEP 5

What each stage fixes — the honest ledger

  • SFT — fixes "the model can't do the task format at all." Sets the prior. Does not fix preference among good answers.
  • Reward model — fixes "no scalar to optimize." Introduces a new risk: a gameable proxy.
  • Policy step + KL — fixes "turn the scalar into behavior without destroying the model." Does not make a bad reward good.
  • RLAIF / constitution — fixes "human labeling doesn't scale and its values are implicit." Does not fix shared blind spots between judge and policy.

The recurring failure: treating reward-model score as ground truth. It is a model. Past a point, higher reward-model score means the policy is exploiting the reward model, not getting better. Hold out human (or independent) evals that the policy never trains against.

STEP 6

When NOT to run the full pipeline

If you have not exhausted SFT and prompting, RLHF is premature — it is the most operationally heavy way to move a metric. If your preference data is thin or noisy, DPO on a small clean set beats a fragile PPO loop on a bad reward model. If your judgments are stable and writable, RLAIF with a tight constitution can replace most human labeling at a fraction of the cost. RLHF converts comparisons into policy; every stage is a proxy, and the pipeline is only as honest as the weakest proxy in it.