RLHF and RLAIF: what each stage actually fixes.
RLHF is not one technique; it is a pipeline — SFT to set a competent base, a reward model trained from human preferences, and a policy-optimization step (PPO, GRPO, or DPO) that pushes the policy up the reward while staying near the base. RLAIF swaps the human labeler for a model judge, often guided by a written constitution. This essay walks the pipeline stage by stage and is precise about which problem each stage solves — and which it does not.
Why preferences, not demonstrations
SFT can teach a model to follow instructions, but it cannot teach which of two plausible answers is better — humans are far better at comparing outputs than at writing the ideal one. RLHF exploits exactly this asymmetry. You collect pairs (a, b) and a human label "a ≻ b"; the signal is relative, cheap to produce, and covers behaviors (helpfulness, harmlessness, tone) no demonstration corpus captures cleanly. The job of RLHF is to convert that comparative signal into a policy.
The reward model: a learned, gameable proxy
A reward model r_phi is trained on the preference pairs, typically with a Bradley–Terry objective, to output a scalar where preferred completions score higher.
# Bradley-Terry preference loss loss = -log(sigmoid(r_phi(x, y_win) - r_phi(x, y_lose)))
The reward model is the heart of RLHF and its weakest joint. It is a learned approximation of human judgment, trained on finite data, and the policy step will optimize against it relentlessly. Anywhere the reward model is wrong — out-of-distribution, length-biased, sycophancy-rewarding — the policy will find and exploit that error. Stage clarity: the reward model fixes "we have no scalar to optimize"; it does not fix "the scalar is a faithful proxy for what we want."
The policy step, and the KL leash
With a reward model in hand, the policy is optimized to increase reward while a KL penalty keeps it from drifting far from the SFT reference. Without that leash the policy collapses onto reward-model artifacts and stops being a coherent language model.
# RLHF policy objective (PPO-style) objective = E[ r_phi(x, y) ] - beta * KL(pi_theta || pi_ref)
PPO uses a learned value function and clipped updates; GRPO drops the value model and estimates advantage from a group of sampled completions per prompt — cheaper and now common for agentic RL. DPO is the sharpest shortcut: it derives a closed-form loss directly on preference pairs, eliminating the explicit reward model and the sampling loop entirely.
# DPO: implicit reward, no separate RM, no rollout loop loss = -log(sigmoid( beta * logratio(y_win) - beta * logratio(y_lose)))
DPO is simpler and more stable, but folds the reward into the loss — you lose the explicit, inspectable reward model and the ability to do online exploration. PPO/GRPO keep a separate reward you can audit and red-team. That trade is the real choice, not "DPO is newer."
RLAIF: replace the human labeler, not the pipeline
Human preference labels are slow, expensive, inconsistent, and a privacy surface. RLAIF replaces the labeler with a model that judges which completion is better, usually steered by an explicit written constitution — a set of principles the judge applies ("prefer the response that is more honest about uncertainty"). Constitutional AI is the canonical instance: the model critiques and revises its own outputs against the constitution to generate the preference data, then standard RLHF runs on top.
What RLAIF actually fixes: the throughput and consistency of the preference signal, and it makes the value judgments explicit and editable as text rather than implicit in a crowd of annotators. What it does not fix: a model judge has the same blind spots as the policy, can be gamed the same way, and inherits the constitution author's omissions.
What each stage fixes — the honest ledger
- SFT — fixes "the model can't do the task format at all." Sets the prior. Does not fix preference among good answers.
- Reward model — fixes "no scalar to optimize." Introduces a new risk: a gameable proxy.
- Policy step + KL — fixes "turn the scalar into behavior without destroying the model." Does not make a bad reward good.
- RLAIF / constitution — fixes "human labeling doesn't scale and its values are implicit." Does not fix shared blind spots between judge and policy.
The recurring failure: treating reward-model score as ground truth. It is a model. Past a point, higher reward-model score means the policy is exploiting the reward model, not getting better. Hold out human (or independent) evals that the policy never trains against.
When NOT to run the full pipeline
If you have not exhausted SFT and prompting, RLHF is premature — it is the most operationally heavy way to move a metric. If your preference data is thin or noisy, DPO on a small clean set beats a fragile PPO loop on a bad reward model. If your judgments are stable and writable, RLAIF with a tight constitution can replace most human labeling at a fraction of the cost. RLHF converts comparisons into policy; every stage is a proxy, and the pipeline is only as honest as the weakest proxy in it.