Deep-Dives / Training Agentic Models

Training Agentic Models

Post-training for agentic ability — SFT, rejection sampling, distillation, RLHF/RLAIF, RL for tool use, reward design.

  1. Prompt, Fine-Tune, or RL?
    The decision tree for changing agent behavior: prompting asks, SFT imitates, RL optimizes — pick the cheapest lever that closes the gap.
  2. RLHF & RLAIF
    Walking the RLHF pipeline stage by stage — SFT, reward model, PPO/GRPO/DPO — and what swapping human labels for an AI judge actually fixes.
  3. RL for Tool Use & Multi-Step Tasks
    Why RL over tool trajectories is hard: sparse terminal reward, credit assignment across steps, and why a trustworthy verifier is the whole game.
  4. Reward Design & Reward Hacking
    The reward is always a proxy: concrete agent reward-hacking patterns, the KL leash to the base policy, and the discipline of auditing the top, not the mean.
  5. SFT, Rejection Sampling & Distillation
    The supervised techniques that solve most agentic training problems before RL: rejection sampling, expert iteration, and distilling a strong agent into a cheap one.
  6. Process vs Outcome Reward Models
    Pay for the answer or pay for the steps: when dense process reward beats sparse outcome reward, and the labeling-cost trade that decides it.