Prompt, fine-tune, or RL: pick the cheapest tool that closes the gap.
Three interventions can change what an agent does: change the prompt (no weight update), supervised fine-tuning on demonstrations (imitate a target distribution), or reinforcement learning against a reward (optimize an objective you can score but not demonstrate). They form a ladder of escalating cost, data, and skill. Most teams reach for training when a prompt would have worked. This essay is the decision tree, with the failure modes that put you on each rung.
Each lever changes a different thing
Prompting changes the conditioning, not the model. You are steering a fixed policy with context: instructions, exemplars, tools, retrieved facts. SFT changes the policy to match a distribution of demonstrations — it makes the model more likely to produce outputs that look like your examples. RL changes the policy to maximize a scalar reward over its own sampled trajectories — it makes the model more likely to produce outputs that score well, even ones no human demonstrated.
The crisp distinction: SFT teaches the model to imitate ("do what these examples did"); RL teaches it to optimize ("do whatever gets a high score"). Prompting teaches it nothing — it just asks. Knowing which of imitate, optimize, or ask your problem needs is most of the decision.
Don't train if a prompt works
A prompt change ships in minutes, is auditable in a diff, costs no GPUs, and is reversible. A fine-tune costs a data pipeline, a training run, an eval harness, a serving artifact, and a permanent maintenance liability that drifts every time the base model updates. The expected-value math overwhelmingly favors exhausting prompting first: better instructions, few-shot exemplars, decomposition, tool affordances, retrieval, and a stronger base model.
Heuristic: if a capable engineer can lift the metric by editing the prompt and tool spec for a day, you do not have a training problem yet. Training is justified when prompting has plateaued and the residual gap is worth a standing pipeline.
The two situations where prompting genuinely cannot win: the behavior needs tacit knowledge or a style too large to fit or specify in context (an SFT signal), or the behavior needs the model to discover a strategy better than anything you can write down, judged by an outcome you can score (an RL signal).
SFT: when you can demonstrate it but not say it
Reach for SFT when you have — or can cheaply produce — a corpus of good trajectories and the goal is to make those the model's default. SFT excels at format adherence, domain tone, tool-call syntax, and compressing a long brittle prompt into weights. It is imitation: the ceiling is the quality of your demonstrations. SFT will not invent a behavior absent from the data, and it will faithfully reproduce your demonstrators' mistakes and shortcuts.
# SFT objective: maximize likelihood of demonstrated tokens loss = -log_prob(model, demo.completion | demo.prompt) # ceiling = quality of demos; cannot exceed the demonstrator
Teams routinely run SFT to "improve quality" with demonstrations that are merely average. You will get an average model, faster and cheaper — not a better one. Curate ruthlessly before you train.
RL: when you can score it but not demonstrate it
Reach for RL when you can cheaply evaluate an outcome but cannot enumerate the behavior that produces it: code that passes tests, a proof that checks, a multi-step task that reaches a verified end state, a response a reward model prefers. RL lets the model search its own action space and reinforces whatever trajectories score well — including strategies no human would have written.
# RL objective: maximize expected reward over sampled rollouts traj = sample(policy, task) r = reward(traj) # scorable, not demonstrated loss = -advantage(r) * log_prob(policy, traj)
RL's power is also its hazard: it optimizes the reward you wrote, not the one you meant. It needs a usable signal at scale, a stable base policy to start from (almost always an SFT'd model), and far more infrastructure and ML judgment than SFT. It is the top of the ladder for a reason.
The ladder, and how the rungs compose
- Prompt — minutes, zero training data, low skill, fully reversible. Always the first move.
- SFT — days–weeks, hundreds–thousands of curated trajectories, moderate skill. The workhorse; the right answer for most "training" problems.
- RL — weeks–months, a robust reward at scale, high skill and infra. The right answer only when imitation provably cannot reach the bar.
They are not exclusive — they stack. The standard recipe is prompt-first; if that plateaus, SFT to set a strong policy; only then RL on top of the SFT model to push past the imitation ceiling. Skipping the rungs (RL on a weak base, SFT on uncurated data) is the most common way teams burn a quarter.
When NOT to climb the ladder
Do not train if the metric is moving on prompt edits, if you cannot articulate the target as either demonstrations or a score, if the base model is about to be replaced, or if you lack an eval that would prove the trained model is better. A fine-tune you cannot evaluate is a liability you cannot retire. Prompting asks, SFT imitates, RL optimizes — and the cheapest lever that closes the gap is the correct one, not the most powerful.