Deep-Dives / Training Agentic Models
Training Agentic Models
Post-training for agentic ability — SFT, rejection sampling, distillation, RLHF/RLAIF, RL for tool use, reward design.
- Prompt, Fine-Tune, or RL?The decision tree for changing agent behavior: prompting asks, SFT imitates, RL optimizes — pick the cheapest lever that closes the gap.
- RLHF & RLAIFWalking the RLHF pipeline stage by stage — SFT, reward model, PPO/GRPO/DPO — and what swapping human labels for an AI judge actually fixes.
- RL for Tool Use & Multi-Step TasksWhy RL over tool trajectories is hard: sparse terminal reward, credit assignment across steps, and why a trustworthy verifier is the whole game.
- Reward Design & Reward HackingThe reward is always a proxy: concrete agent reward-hacking patterns, the KL leash to the base policy, and the discipline of auditing the top, not the mean.
- SFT, Rejection Sampling & DistillationThe supervised techniques that solve most agentic training problems before RL: rejection sampling, expert iteration, and distilling a strong agent into a cheap one.
- Process vs Outcome Reward ModelsPay for the answer or pay for the steps: when dense process reward beats sparse outcome reward, and the labeling-cost trade that decides it.