Tool Docs & Discoverability

Deep Dive · Tool & Capability Design

The name and description are the entire discoverability surface — the model can't use a tool it can't find.

Before a tool is ever called correctly it has to be selected correctly, and selection happens purely on the name, the description, and whatever examples sit in the schema, read against every other tool competing for the same slot. A perfectly engineered tool with a vague name is invisible; a precise name and a description that says exactly when to reach for it is the difference between an agent that finds the right capability and one that flails. This essay is about the discovery surface and the too-many-tools problem that is degrading it across the ecosystem.

STEP 1

Tool selection is a retrieval problem the model solves from text alone.

Faced with a task and a list of tools, the model is doing retrieval: matching intent against descriptions. It has no telemetry, no usage stats, no ability to "try one and see" cheaply — it picks from the words. That means the name and description are not metadata, they are the index. If two tools' descriptions are near-synonyms, the model will confuse them; if a description says what the tool does but never when to use it, the model can't tell whether this task is the one. Write descriptions to disambiguate against the rest of the list, not in isolation.

STEP 2

The name carries most of the selection signal — spend it well.

Names are read first and weighted heaviest. A good tool name states the action and the object and is distinct from its neighbors: refund_order beats process; search_customers beats lookup. When a system has many tools, namespacing by service and resource is the single most effective disambiguator — Anthropic's guidance explicitly recommends prefixes like asana_search vs jira_search, and asana_projects_search vs asana_users_search, because the shared prefix tells the model which family it is in before it even reads the description.

# Unnamespaced: model confuses which 'search' is which
search(q)   lookup(q)   find(q)   query(q)

# Namespaced by service + resource: selection is obvious
asana_projects_search(q)
asana_users_search(q)
jira_issues_search(q)

STEP 3

A description must say when to use it AND when not to.

The most common description defect is stating the capability and stopping. The model needs the boundary: "Use this to refund a completed order. Do not use for cancelling an order that has not shipped — use cancel_order for that." The negative half is what prevents the confident-wrong selection. A description that only says what it does is a tool that gets called in adjacent situations it was never meant for.

Add one "use this when… not when…" sentence to every tool whose neighbor could plausibly be confused with it. This single sentence does more for selection accuracy than any amount of detail about the tool's internals.

STEP 4

One concrete example outperforms a paragraph of prose.

Models pattern-match. A worked invocation in the description — "Example: to refund order 8f3a for a defect, call with order_id="order_8f3a", reason="defective"" — does more to produce correct calls than three sentences explaining the arguments abstractly, because it shows the shape instead of describing it. Examples are also where you encode the non-obvious: the units, the ID format, the one combination that is the common case. Spend the description budget on the example before the explanation.

STEP 5

Too many tools destroys discoverability — this is now measured, not theorized.

Every tool's description sits in the context window competing for selection, so the discovery surface degrades with size. The 2025–2026 evidence is concrete: correct-tool-selection accuracy measured falling from ~95% with a focused set to ~71% with a full GitHub MCP server loaded, and a standard MCP stack eating 20%+ of context before work begins. The fixes that worked were curation and loading discipline: GitHub Copilot from 40 tools to 13, Block from 30+ to 2, and Anthropic's late-2025 deferred tool loading so the model discovers tools on demand instead of carrying all of them. Discoverability is inversely related to inventory.

Adding a tool to "help the model" past roughly a dozen often hurts every selection, including for tasks unrelated to the new tool, because it dilutes the index and burns the context every call reads. More tools is a discoverability cost, not a free capability gain.

STEP 6

When polishing docs is not the bottleneck.

Description quality has a ceiling: no wording rescues a tool whose shape is wrong (K1–K3) or whose presence shouldn't be in the list at all (K2). If the agent keeps mis-selecting after the description clearly states when and when-not, the problem is too many overlapping tools or a bad granularity boundary — fix the inventory, not the prose.