Tool Granularity & Composition

Deep Dive · Tool & Capability Design

Granularity is the central tool-design decision: coarse enough to be safe, fine enough to be flexible.

Every tool sits somewhere on a spectrum from one fat do-everything call to a hundred atomic primitives, and where you place it determines how well the agent reasons, how many round-trips a task costs, how much can go wrong per call, and how badly your tool list bloats. This essay is about choosing the grain deliberately: when to consolidate steps behind one call, when to split a tool apart, and how to recognize the tool-explosion failure that is silently degrading agents across the 2025–2026 ecosystem.

STEP 1

Coarse tools and fine tools fail in opposite directions.

A coarse tool — one call that finds the order, refunds it, and notifies the customer — is easy for the model to invoke correctly and cheap in round-trips, but it hides decisions the agent might need to make, can't be recombined for a case you didn't anticipate, and concentrates blast radius (one wrong call does three things). A fine tool — separate find, refund, notify — is flexible and recombinable, but multiplies round-trips, multiplies the places the model can desynchronize, and inflates the tool count. Neither is "correct"; the question is always which failure mode you can least afford for this capability.

STEP 2

Consolidate when the steps are an indivisible job the model shouldn't orchestrate.

If the model has no meaningful decision to make between step A and step B — it will always do B after A, and doing A without B is a bug — then exposing them separately only creates a way to fail. Collapse them. The test is decisional, not technical: not "are these one transaction in the database" but "does the agent ever legitimately want to do one without the other?" Refund-then-notify is one tool because a refund nobody is told about is never the goal. Search-then-read is two tools because the agent legitimately searches, reasons, and reads only some results.

# No decision between the steps -> one coarse tool
refund_order(order_id, reason)        # find + refund + notify, atomic

# Real decision between the steps -> keep them split
search_issues(query) -> [ids]       # model reasons over results...
get_issue(id)                          # ...then reads only the ones it chose

STEP 3

Split when one tool is overloaded with modes or unbounded output.

The opposite move. A tool that does subtly different things depending on a mode or action argument is two tools wearing one coat: the model has to first pick the tool, then re-pick the mode, doubling the place it can be wrong, and the schema becomes a union where half the fields are conditionally relevant. Split it. Likewise, a single tool whose output is unbounded — "return everything about this customer" — should be split into a cheap discovery call and a targeted fetch, so the model pays for only what it reads.

If a tool's description contains the word "or" describing what it does — "creates or updates," "searches or fetches" — that "or" is usually a seam where two tools were welded together. Splitting on the "or" almost always raises tool-selection accuracy.

STEP 4

Composition is a property you design for, not one you get for free.

Fine tools are only valuable if the agent can actually chain them, and chaining works only when the output of one tool is shaped to be the input of the next. A search that returns opaque internal IDs the fetch tool doesn't accept is two tools that don't compose — the model has to invent a translation step and will get it wrong. Designing for composition means agreeing on shared identifiers and shapes across a tool family so the agent can pipe them, the same way Unix pipes only work because every tool speaks lines of text.

STEP 5

Tool explosion is a measurable, ecosystem-wide failure.

More tools is not more capability past a point — it is context bloat and worse selection. The 2025–2026 numbers are blunt: a standard MCP stack (Playwright + GitHub + an IDE server) can consume over 20% of the context window before the agent starts, and correct-tool-selection accuracy has been measured dropping from ~95% with a focused set to ~71% with a full GitHub MCP server loaded — a ~24-point loss caused purely by tool count. GitHub Copilot cutting its tool set from 40 to 13, and Block going from 30+ Linear tools to 2, both produced benchmark improvements. Anthropic shipped deferred tool loading in late 2025 precisely so agents stop paying for tools they aren't using.

Every tool you add taxes every call the agent ever makes, whether or not it uses that tool, because the description sits in context and competes for selection. The marginal tool is rarely free; treat the tool list as a budget, not a buffet.

STEP 6

When fine-grained is worth the cost.

Coarse-by-default is the right bias for most agents, but it is wrong when the agent's value is open-ended composition — a coding or data-analysis agent needs primitive read/write/run tools precisely because the tasks were not anticipated and consolidation would amputate them. Consolidate when you know the workflow; keep primitives when the workflow is the thing the agent is supposed to invent — and never let primitive count grow unbounded just because each one looked cheap.