Prompt, Fine-Tune, or RL?

Deep Dive · Training Agentic Models

Prompt, fine-tune, or RL: pick the cheapest tool that closes the gap.

Three interventions can change what an agent does: change the prompt (no weight update), supervised fine-tuning on demonstrations (imitate a target distribution), or reinforcement learning against a reward (optimize an objective you can score but not demonstrate). They form a ladder of escalating cost, data, and skill. Most teams reach for training when a prompt would have worked. This essay is the decision tree, with the failure modes that put you on each rung.

STEP 1

Each lever changes a different thing

Prompting changes the conditioning, not the model. You are steering a fixed policy with context: instructions, exemplars, tools, retrieved facts. SFT changes the policy to match a distribution of demonstrations — it makes the model more likely to produce outputs that look like your examples. RL changes the policy to maximize a scalar reward over its own sampled trajectories — it makes the model more likely to produce outputs that score well, even ones no human demonstrated.

The crisp distinction: SFT teaches the model to imitate ("do what these examples did"); RL teaches it to optimize ("do whatever gets a high score"). Prompting teaches it nothing — it just asks. Knowing which of imitate, optimize, or ask your problem needs is most of the decision.

STEP 2

Don't train if a prompt works

A prompt change ships in minutes, is auditable in a diff, costs no GPUs, and is reversible. A fine-tune costs a data pipeline, a training run, an eval harness, a serving artifact, and a permanent maintenance liability that drifts every time the base model updates. The expected-value math overwhelmingly favors exhausting prompting first: better instructions, few-shot exemplars, decomposition, tool affordances, retrieval, and a stronger base model.

Heuristic: if a capable engineer can lift the metric by editing the prompt and tool spec for a day, you do not have a training problem yet. Training is justified when prompting has plateaued and the residual gap is worth a standing pipeline.

The two situations where prompting genuinely cannot win: the behavior needs tacit knowledge or a style too large to fit or specify in context (an SFT signal), or the behavior needs the model to discover a strategy better than anything you can write down, judged by an outcome you can score (an RL signal).

STEP 3

SFT: when you can demonstrate it but not say it

Reach for SFT when you have — or can cheaply produce — a corpus of good trajectories and the goal is to make those the model's default. SFT excels at format adherence, domain tone, tool-call syntax, and compressing a long brittle prompt into weights. It is imitation: the ceiling is the quality of your demonstrations. SFT will not invent a behavior absent from the data, and it will faithfully reproduce your demonstrators' mistakes and shortcuts.

# SFT objective: maximize likelihood of demonstrated tokens
loss = -log_prob(model, demo.completion | demo.prompt)
# ceiling = quality of demos; cannot exceed the demonstrator

Teams routinely run SFT to "improve quality" with demonstrations that are merely average. You will get an average model, faster and cheaper — not a better one. Curate ruthlessly before you train.

STEP 4

RL: when you can score it but not demonstrate it

Reach for RL when you can cheaply evaluate an outcome but cannot enumerate the behavior that produces it: code that passes tests, a proof that checks, a multi-step task that reaches a verified end state, a response a reward model prefers. RL lets the model search its own action space and reinforces whatever trajectories score well — including strategies no human would have written.

# RL objective: maximize expected reward over sampled rollouts
traj  = sample(policy, task)
r     = reward(traj)            # scorable, not demonstrated
loss  = -advantage(r) * log_prob(policy, traj)

RL's power is also its hazard: it optimizes the reward you wrote, not the one you meant. It needs a usable signal at scale, a stable base policy to start from (almost always an SFT'd model), and far more infrastructure and ML judgment than SFT. It is the top of the ladder for a reason.

STEP 5

The ladder, and how the rungs compose

Prompt — minutes, zero training data, low skill, fully reversible. Always the first move.
SFT — days–weeks, hundreds–thousands of curated trajectories, moderate skill. The workhorse; the right answer for most "training" problems.
RL — weeks–months, a robust reward at scale, high skill and infra. The right answer only when imitation provably cannot reach the bar.

They are not exclusive — they stack. The standard recipe is prompt-first; if that plateaus, SFT to set a strong policy; only then RL on top of the SFT model to push past the imitation ceiling. Skipping the rungs (RL on a weak base, SFT on uncurated data) is the most common way teams burn a quarter.

STEP 6

When NOT to climb the ladder

Do not train if the metric is moving on prompt edits, if you cannot articulate the target as either demonstrations or a score, if the base model is about to be replaced, or if you lack an eval that would prove the trained model is better. A fine-tune you cannot evaluate is a liability you cannot retire. Prompting asks, SFT imitates, RL optimizes — and the cheapest lever that closes the gap is the correct one, not the most powerful.