Tool-Use Loops & Error Recovery

D7
Deep Dive · Architectures & Patterns

Tool-use loops & error recovery.

Every pattern in this section calls tools, and tools fail: timeouts, malformed arguments, empty results, partial successes. Whether an architecture survives contact with reality is decided here, in the error-handling layer that diagrams omit. This essay covers the failure taxonomy, the recovery loop, and the durability concerns of side-effecting tools.

STEP 1

A taxonomy of tool failures — because each needs a different response.

"The tool failed" is not actionable. Production agents must distinguish failure classes, because the correct recovery differs by class and conflating them is the root cause of most agent flakiness:

  • Malformed call. The model produced arguments that violate the schema or are semantically invalid (bad date, nonexistent ID). Recovery: return a precise, structured error to the model and let it retry — this is the failure models recover from best.
  • Transient infrastructure failure. Timeout, 503, rate limit, network blip. Recovery: deterministic retry with exponential backoff in the harness, not via the model. The model should never see a transient blip.
  • Empty / unhelpful result. The call succeeded but returned nothing useful (no search hits, empty list). Recovery: this is a reasoning problem, not an error — surface it as an observation so the model can reformulate or change approach.
  • Hard semantic failure. Permission denied, resource gone, precondition unmet. Recovery: usually unrecoverable by retry; the model must replan or escalate to a human.
  • Partial success. A batch/multi-item tool succeeded for some items and failed for others. Recovery: the hardest case — return a structured per-item result so the agent can act on the successful subset and decide about the rest.

The dominant production anti-pattern: collapsing all of these into a generic "Error: something went wrong" string fed back to the model. The model cannot distinguish a transient blip it should ignore from a permission failure it must escalate, so it either retries forever or gives up — both look like a hung or "dumb" agent. Error taxonomy is not polish; it is the difference between a robust agent and a flaky one.

STEP 2

The recovery loop: layered, with the model as the outermost ring.

# Tool dispatch with layered recovery
def dispatch(call):
    try:
        args = schema.validate(call.args)              # layer 0: pre-flight
    except SchemaError as e:
        return tool_error("INVALID_ARGS", e.detail)   # → model retries

    for attempt in range(MAX_RETRIES):              # layer 1: deterministic
        try:
            return ok(TOOLS[call.name](**args))
        except Transient as e:
            backoff(attempt)                        # model never sees this
        except HardFailure as e:
            return tool_error("UNRECOVERABLE", e.reason)  # → model replans
    return tool_error("EXHAUSTED_RETRIES", call.name)   # → model replans

The principle: recover deterministically at the lowest layer that can; only escalate to the model what genuinely requires reasoning. Schema validation catches malformed calls before any side effect. The retry loop absorbs transients invisibly. Only hard semantic failures and exhausted retries reach the model — as a structured observation it can reason about, never a raw stack trace.

STEP 3

Error messages are prompts. Engineer them.

When an error does reach the model, that message is a prompt fragment that steers the next action. A vague error produces a flailing retry; a specific one produces a targeted correction. Compare:

BAD:  Error: request failed
GOOD: INVALID_ARGS on create_ticket: field "priority" must be one
      of [low, medium, high, urgent]; received "P1". Re-call with a
      valid value.

The good message states the error class, the offending field, the constraint, the received value, and the corrective action. Models recover from malformed calls almost perfectly when given this; almost never when given "request failed." This is the single highest-ROI piece of agent error handling and the one most often skipped.

Make your tool layer return errors in a fixed, parseable shape ({code, field, constraint, received, hint}). It improves model recovery, makes failure classes measurable, and lets you alert on classes (a spike in UNRECOVERABLE is an incident; a spike in INVALID_ARGS is a prompt/schema bug).

STEP 4

Side effects: idempotency, durability, and the ground truth.

Read-only tools forgive retries. Side-effecting tools (send email, charge card, mutate prod) do not — a blind retry double-sends. The disciplines that make recovery safe:

  • Idempotency keys. Derive a key from the intended effect; the tool dedupes server-side. Now the harness retry layer is safe for mutating calls.
  • The tool result is ground truth, not the model's belief. The agent "thinks" it sent the email only if the tool returned success. State transitions follow tool results, never the model's narration — a model that says "done" after a failed call is the classic silent-corruption bug.
  • Durable execution for long/critical flows. Persist the trajectory (a workflow/checkpoint engine) so a crash mid-plan resumes from the last completed step instead of re-running side effects. This is why plan-and-execute's explicit, persistable plan is operationally safer than an in-memory ReAct history for high-stakes work.
  • Compensating actions. When a multi-step flow fails midway with committed side effects, the agent needs explicit rollback/compensation tools — there is no automatic undo for "money was moved."
STEP 5

The cross-cutting takeaway.

Error recovery is not a pattern you choose; it is the layer that determines whether the pattern you chose works in production. ReAct's loop, plan-and-execute's replanning, reflection's verifier, the router's fallback — every one degrades into thrash, silent corruption, or a hung agent without a disciplined error taxonomy, deterministic low-layer recovery, engineered error messages, and idempotent side effects. The honest tradeoff is blunt: this layer is unglamorous and absent from every architecture diagram, and it is where the majority of production agent reliability is actually won or lost.