RL for Tool Use & Multi-Step Tasks

Deep Dive · Training Agentic Models

RL over tool trajectories: sparse rewards, long horizons, and verifiable signal.

Single-turn RLHF optimizes one response. Agentic RL optimizes a trajectory: a sequence of model turns interleaved with tool calls and environment responses, judged mostly by whether the end state is correct. That shift breaks the easy parts of RLHF — the reward is sparse and terminal, credit must be assigned across many steps, and the environment is now part of the training system. This essay is about why tool-use RL is hard and what makes it tractable: verifiable rewards and disciplined environment design.

STEP 1

The unit of optimization is a trajectory, not a turn

A rollout now looks like s0 → a0 → tool → o0 → a1 → tool → o1 → … → done. The policy emits actions; the environment emits observations; reward typically arrives only at done. The model is being trained to act under partial observability over a long horizon, where most individual tokens it emits are tool-call plumbing, not the thing you care about. This is closer to classical RL than to RLHF, with all the instability that implies.

STEP 2

Sparse terminal reward and the credit-assignment problem

If a 14-step task fails, which step was the mistake? A single terminal scalar gives the policy almost no information about where it went wrong, only that it did. With sparse reward the gradient signal per step is tiny and high-variance; the policy needs enormous sample counts to learn which actions mattered.

# Sparse, terminal: one scalar for the whole trajectory
r = verify(final_state)        # 1.0 if correct else 0.0
# every step shares the same return -> weak credit assignment
for step in traj:
    advantage[step] = r - baseline(step)

This is the core difficulty. Mitigations: shorter horizons, group-relative baselines (GRPO) that compare trajectories on the same task to extract signal from a binary outcome, and — where you can afford it — per-step rewards (covered in T6). The honest framing: sparse-reward long-horizon RL is sample-hungry and unstable; you fight it with environment and reward design, not optimizer tricks.

STEP 3

Verifiable rewards are what make this work at all

The reason code and math are the breakout domains for agentic RL is that they have cheap, programmatic, hard-to-game verifiers: did the tests pass, does the proof check, does the SQL return the expected rows. A verifiable reward is objective, scalable to millions of rollouts, and resistant to the reward hacking that plagues learned reward models.

# Verifiable reward: an executable oracle, not a learned RM
r = 1.0 if run_tests(agent_patch) else 0.0

Before designing tool-use RL, ask: is there a cheap programmatic oracle for success? If yes, you have a strong project. If success can only be judged by a model or a human, you have a much harder, more expensive, more hackable one — solve the verifier first.

STEP 4

The environment is now part of the model

In agentic RL the training distribution is generated by interacting with the environment, so the environment's properties become training properties. It must be deterministic enough to be a stable signal, fast and parallelizable enough for millions of rollouts, sandboxed so the agent cannot cheat or cause harm, and resettable to a clean state. A flaky tool that times out 5% of the time injects 5% reward noise the policy will happily learn to exploit or be confused by.

Most tool-use RL failures are environment bugs, not algorithm bugs: nondeterministic tools, stale fixtures, a verifier with a loophole, an env that leaks the answer into an observation. The policy will find every one of these. Treat the environment with more rigor than the training code.

STEP 5

Why tool-use RL is genuinely hard — concretely

Horizon — variance compounds with steps; a 30-step task has far more ways to fail than a 3-step one, and a sparse reward sees only the end.
Partial observability — the policy acts on a context window that is a lossy view of true state; tool outputs may be huge, truncated, or misleading.
Exploration cost — each rollout runs real tools; exploration is orders of magnitude more expensive than sampling text tokens.
Reward sparsity — binary terminal signal, weak per-step credit, high-variance gradients.
Distribution shift — as the policy improves, it visits states the verifier and environment were never tested on.

None of these are fatal, but together they explain why tool-use RL needs more infrastructure, more compute, and more careful evaluation than any other rung on the training ladder.

STEP 6

When NOT to do tool-use RL

Skip it if you have no cheap verifier (you will be reward-hacked before you are improved), if the horizon is so long that credit assignment is hopeless without process rewards you cannot afford, or if SFT on a few thousand strong trajectories already clears the bar — it usually gets you most of the way at a fraction of the cost. Tool-use RL pays off exactly when success is cheap to verify and hard to demonstrate; absent a trustworthy verifier, you are not training an agent, you are training a reward exploiter.