SFT, Rejection Sampling & Distillation

Deep Dive · Training Agentic Models

SFT, rejection sampling, and distillation: bootstrap data before you reach for RL.

Between "prompt it" and "run RL" sits a family of supervised techniques that quietly solve most agentic training problems: SFT on demonstrations, rejection sampling / best-of-N to manufacture a better SFT set than you could collect by hand, expert iteration to compound that improvement, and distillation to fold a strong-but-expensive agent into a cheap one. These methods give you much of RL's outcome with a fraction of its instability. This essay is when and how to use each.

STEP 1

SFT is imitation, and imitation has a hard ceiling

SFT maximizes the likelihood of demonstrated trajectories. Its ceiling is exactly the quality of the demonstrations: it will faithfully reproduce the demonstrator's reasoning, its shortcuts, and its mistakes, and it will not invent a strategy absent from the data. The entire game of advanced supervised training is therefore one question: where do better demonstrations come from when human-collected ones are scarce, inconsistent, or capped at human skill?

STEP 2

Rejection sampling: let the model write its own training set

The key move: a model is often capable of occasionally producing an excellent trajectory even if it usually does not. Sample N trajectories per task, filter to the ones that a verifier or reward model says are good, and SFT on the survivors. You have manufactured a training set above the model's average behavior using its own best behavior.

# Rejection sampling / best-of-N -> new SFT set
cands = [sample(model, task) for _ in range(N)]
good  = [c for c in cands if verify(c)]   # keep only passing
sft(model, good)                            # imitate your own best

Rejection sampling is the highest-leverage, lowest-risk technique in this whole essay. It needs only a verifier and sampling — no RL loop, no value function, no KL tuning — and captures much of the gain teams reach for PPO to get. Try it before any RL.

STEP 3

Expert iteration: compound the bootstrap

Do it again. SFT on the filtered best raises the model's average; now its best-of-N is better still; filter and SFT again. Each round, the policy distills its own search into its weights, and the next round's sampling starts from a stronger prior. This is expert iteration (a.k.a. STaR / self-taught reasoning): a poor man's RL that often closes most of the gap to PPO on verifiable tasks.

# Expert iteration: search, distill, repeat
for round in range(K):
    good = filter(sample_many(model, tasks), verify)
    model = sft(model, good)        # next round samples from a better prior

The relationship to RL is precise: rejection-sampling + SFT is RL with a degenerate, hard-thresholded advantage and no exploration bonus. It is more stable, far easier to debug, and the correct first thing to try when a verifier exists.

STEP 4

Distillation: fold a strong agent into a cheap one

Distillation trains a smaller/cheaper student on a stronger teacher's trajectories. Crucially, what transfers is the teacher's behavior on your task distribution — its tool-use patterns, reasoning structure, and recovery moves — not generic capability. The teacher can be a frontier model, an expensive ensemble, or your own slow best-of-N pipeline collapsed into a single fast forward pass.

Distillation copies the teacher's failure modes and biases as faithfully as its strengths, and the student inherits a capability ceiling at the teacher's level. It also raises licensing/ToS questions when the teacher is a third-party API — settle that before you build a pipeline on it.

STEP 5

How to choose among them

SFT — you already have good trajectories. Fastest path, ceiling = data quality.
Rejection sampling — you have a verifier and a model that sometimes succeeds. Manufactures a better SFT set; the default bootstrap.
Expert iteration — rejection sampling worked once and is still improving round-over-round. Compound it until it plateaus.
Distillation — a strong teacher exists and you need it cheaper or smaller. Behavior transfer, not magic.
RL (T3) — only after the above plateau and the residual gap justifies the instability and infra.

STEP 6

When NOT to use these

Rejection sampling and expert iteration are useless without a verifier or a trustworthy reward model — without filtering you are just SFT-ing on average behavior. Distillation is the wrong tool if no model already exhibits the behavior, because there is nothing to copy. And none of these can exceed the best trajectory they were trained on; if you need behavior better than anything currently producible, that is the one case that genuinely requires RL. Bootstrap with supervision until it plateaus; reach for RL only for the gap supervision provably cannot close.