4.4
Part IV / Specialize · The pattern with the worst hype-to-reality ratio

Multi-agent systems: when to decompose, and when not.

"Multi-agent" is the agent topic with the highest hype-to-reality ratio. The marketing picture — agents collaborating like a team, dynamically dispatching tasks, debating each other to better answers — is mostly fiction. The engineering reality is narrower: a small set of decomposition patterns work in production, the rest don't, and applying any of them costs roughly 10–20× a single-agent baseline. This chapter teaches when multi-agent is the right answer, the three patterns that actually ship at scale, the coordination problems that sink naive implementations, and how to evaluate a multi-agent system honestly. By the end you'll know when to reach for it and when to refuse — and have the engineering vocabulary to defend either choice.

STEP 1

When multi-agent makes sense — and when it doesn't.

The default position on multi-agent should be skepticism. Most "multi-agent" architectures you see in demos and tutorials would work equally well, faster, and at a tenth the cost as a single-agent loop with well-designed tools. The honest question to ask before reaching for multi-agent: what specifically am I gaining that I couldn't get from a single agent with better tools?

The answer is a real one in some cases. The four properties below describe the situations where multi-agent decomposition genuinely earns its cost. If your problem doesn't have at least one of them strongly, you probably don't need multi-agent.

Property 1: parallelizable independent work

The strongest justification. If a task decomposes into N sub-tasks that can run independently — no sub-task depends on another's output — then multi-agent lets you do them concurrently. Wall-clock time drops from the sum of sub-task durations to the max of them. For a research query with 5 independent investigation threads, this is the difference between 25 minutes and 5 minutes.

The test: can you describe the sub-tasks before any work happens? If yes (orchestrator can plan upfront), multi-agent parallelism is genuinely available. If no (the next sub-task depends on what the previous one found), you're describing a sequential agent, not a parallel one — multi-agent gives you no parallelism, only overhead.

This is the property behind the research-agent pattern in chapter 4.3. It's why parallel research works and why "agents debating to improve an answer" usually doesn't — the latter is fundamentally sequential, with each "round" depending on the previous.

Property 2: role specialization at the prompt level

Some tasks involve cognitive operations different enough that one prompt can't be optimized for both. A planner needs different reasoning patterns than a synthesizer; a code-reviewer needs different criteria than a code-writer; a moderator needs different judgment than a creative writer. Trying to encode all of these in one system prompt produces a mediocre prompt that does each badly.

The fix: separate prompts for separate roles, with each role's prompt optimized for its specific cognitive task. This is multi-agent in the form of "multiple specialized prompts called in sequence or in coordination," even when those prompts are running on the same model and inside one workflow.

The test: are the cognitive demands of your sub-tasks so different that one prompt can't serve both well? If a single prompt can do "find sources and synthesize them," it's not a multi-agent case. If "deeply explore one angle" and "reconcile findings across angles" need fundamentally different instructions, that's a real role split.

Property 3: context isolation

From chapter 4.3's research-agent discussion: as a single agent's context grows with accumulated tool results, performance degrades. Lost-in-the-middle effects kick in. The agent forgets early findings. Synthesis quality drops.

Multi-agent lets each sub-agent maintain a context window dedicated to its own narrow task. The lead/orchestrator only sees the compressed summaries from each, not the raw context that produced them. This is the architecture's most subtle benefit — and often the one that justifies the cost when parallelism alone wouldn't.

The test: does your task involve enough raw input that a single agent would suffer context degradation? Research with many sources, code review across many files, document analysis across many pages — all are context-bound. Pure reasoning tasks with small inputs aren't.

Property 4: fault containment

A single agent that gets stuck in a bad reasoning chain — wrong assumption, hallucinated tool result, infinite loop — corrupts its own context and the entire task. Recovery requires either restarting the whole task or surgical context editing, neither of which is great.

Multi-agent contains faults to single sub-agents. If subagent #3 spirals into an unproductive loop, the orchestrator can detect this (the sub-agent's findings are missing or incoherent), retry it from a clean context, or substitute a different approach. The other sub-agents' work is unaffected. The task as a whole survives the local failure.

The test: is partial failure recoverable in your task? If you can ship the answer even when one sub-task fails (with a noted gap), multi-agent gives you graceful degradation. If every sub-task is essential, fault containment is less valuable.

The four-property test, summarized

┌─────────────────────────────────────────────────────────────┐ │ SHOULD YOU GO MULTI-AGENT? │ │ │ │ How many of these are TRUE for your problem? │ │ │ │ □ Parallelizable: sub-tasks are independent, knowable │ │ upfront │ │ │ │ □ Role-specialized: sub-tasks need genuinely different │ │ prompts │ │ │ │ □ Context-bound: single agent would hit context limits │ │ or degradation │ │ │ │ □ Partial-failure-ok: graceful degradation is valuable │ │ │ │ Score: │ │ 3-4 properties → multi-agent is likely right │ │ 2 properties → consider it, but try single-agent first │ │ 0-1 properties → almost certainly use single-agent │ │ │ │ The 15× cost multiplier needs ≥3 of these to pay back. │ └─────────────────────────────────────────────────────────────┘

This is a deliberately conservative test. The default failure mode in this space is over-engineering — building multi-agent systems for problems a well-designed single-agent loop would handle, then spending months on coordination logic. The cost shows up later, when you're paying 15× per query for a benefit you couldn't articulate.

The patterns that look multi-agent but aren't

Three patterns are commonly described as "multi-agent" that are not multi-agent in any meaningful engineering sense — and treating them as such adds complexity without benefit.

An agent that calls multiple LLMs. If your agent's loop dispatches a Sonnet call for some steps and a Haiku call for others (chapter 2.2's model cascade), that's not multi-agent. That's one agent with model-routing per step. The agent has a single coherent state, single conversation history, single loop. No coordination problem exists.

An agent with many tools. An agent with web search, database access, code execution, file ops, and email — even 20 tools — is still a single agent. Tools aren't agents. Each tool call is structured and returns to the same agent's loop. The agent doesn't coordinate with itself.

A pipeline of LLM calls. A workflow that does "summarize this doc, then translate the summary, then post it to Slack" via three sequential LLM calls is a pipeline, not multi-agent. There's no shared task, no coordination, no decomposition. Each call is its own unit, processed in order.

Genuine multi-agent has agents with their own independent loops that interact via structured communication. Multiple model calls aren't enough; multiple tools aren't enough; sequential workflows aren't enough. The communication-between-loops is what makes it multi-agent, and that's where the coordination cost lives.

Question
What about "agents debating to reach a better answer"? That sounds like a clear multi-agent win.

It's the most-cited pattern in research papers and one of the least-effective in production. The mechanism: spawn two agents, have them argue or critique each other's answers, take the consensus. The hope: errors get caught, reasoning improves.

What actually happens, most of the time: the agents converge on whatever was said most confidently first (especially when both agents are the same underlying model — they have correlated reasoning, so they "agree" with each other for the same reasons they'd have produced the same wrong answer alone). The result: 2× cost for marginal quality improvement, often within noise.

Where debate does work: when the agents are different models (Claude debating GPT, with their genuinely different training distributions), and when the question has objective verifiable structure (math, code with tests). The papers showing dramatic debate benefits are usually one of these. Generalizing to "any task" doesn't replicate.

Question
What about "agent X dynamically deciding to spawn agent Y based on the situation"? That feels powerful.

It's powerful in concept and brittle in practice. The failure mode: agent X has poor judgment about when to spawn agent Y. It spawns too often (cost balloons), too rarely (misses opportunities), or with bad task specs (Y produces useless work).

The way this pattern works in production — the chapter 4.3 architecture — is to constrain "dynamic spawning" tightly. The orchestrator decides the decomposition once at planning time, based on the user's query. It doesn't dynamically spawn during execution. This shifts the decision from a hard runtime judgment ("should I create a new agent?") to a one-time planning decision ("what are the sub-questions for this query?"), which is much more reliable.

The unrestricted version — agents spawning other agents during execution, recursively, based on their own judgment — is a footgun in current capabilities. It's been demoed; it doesn't ship reliably.

Question
How do I decide between "more tools" and "more agents"?

Default to "more tools" until the four-property test fires. Tools are cheaper, simpler, easier to evaluate, and the agent's reasoning stays in one coherent context. You only reach for "more agents" when one of the four properties strongly applies — and even then, the right shape is usually one orchestrator plus a small number of specialized workers, not a swarm.

Heuristic: if you can describe your additional capability as "the agent should be able to do X," that's a tool. If you can describe it as "the agent should be able to delegate Y to a separate context that handles it independently and returns a summary," that's a subagent. Most "I need another agent for this" intuitions turn out to be "I need a better tool for this."

STEP 2

The three patterns that work in production.

Engineering practice has converged on three multi-agent patterns that ship reliably. Most production multi-agent systems are one of these, occasionally with two combined. The patterns are orchestrator-worker, pipeline, and specialized peer routing. Each has a specific shape, a specific use case, and specific failure modes — and most of the exotic patterns from the research literature are variations on these or genuinely don't generalize.

Pattern 1: orchestrator-worker

The pattern from chapter 4.3, generalized beyond research. One coordinating agent (orchestrator) decomposes the task, spawns N parallel workers each handling one sub-task, and synthesizes their structured outputs into the final result.

┌─────────────────────────────────────────────────────────────┐ │ PATTERN 1: ORCHESTRATOR-WORKER │ │ │ │ ┌───────────────┐ │ │ task → │ ORCHESTRATOR │ │ │ │ (plans + sync)│ │ │ └───────┬───────┘ │ │ │ │ │ ┌───────────────┼───────────────┐ │ │ ▼ ▼ ▼ │ │ ┌─────┐ ┌─────┐ ┌─────┐ │ │ │ W₁ │ │ W₂ │ │ W₃ │ (parallel) │ │ └──┬──┘ └──┬──┘ └──┬──┘ │ │ │ │ │ │ │ └───────────────┼───────────────┘ │ │ ▼ │ │ ┌───────────────┐ │ │ │ result │ │ │ └───────────────┘ │ │ │ │ Best for: tasks decomposable into parallel sub-tasks │ │ Examples: research, multi-document review, audit-style │ │ tasks that fan out then converge │ └─────────────────────────────────────────────────────────────┘

The structure is simple but it's the workhorse pattern. Anthropic's Research feature uses it. OpenAI's Deep Research uses it. Most "AI does work in parallel" products are this shape under the hood. What makes it work in practice:

  • The orchestrator plans upfront. All workers are dispatched after one planning step. The orchestrator doesn't make spawning decisions during execution.
  • Workers are stateless w.r.t. each other. Worker #2 doesn't know about Worker #1's existence, much less their findings. This isolation is the whole point — it's what gives you context isolation and fault containment.
  • Structured worker outputs. Each worker returns a typed object (per the chapter 4.3 COMPRESSED_FINDING schema), not free-form chat. The orchestrator can reliably parse and synthesize.
  • Sized to the task. Effort scaling: 2 workers for simple comparisons, 10+ for complex problems. The orchestrator decides at planning time.

Where this pattern fits beyond research: legal-document review (each section reviewed by a worker), code audit (each file by a worker), financial analysis (each thesis-line by a worker), customer-feedback synthesis (each cluster by a worker). The common shape: a body of work that splits naturally into independent pieces, each piece can be analyzed alone, and the value comes from synthesizing across them.

Pattern 2: pipeline (sequential specialization)

A different shape: each agent handles one stage of a fixed workflow, and the output of stage N is the input of stage N+1. No parallelism; the value is role specialization across sequential phases.

┌─────────────────────────────────────────────────────────────┐ │ PATTERN 2: PIPELINE │ │ │ │ input → Agent A → Agent B → Agent C → output │ │ (extract) (analyze) (format) │ │ │ │ Each agent: │ │ • Has its own optimized prompt for its phase │ │ • Receives structured input from the previous │ │ • Emits structured output for the next │ │ • Knows nothing about earlier/later phases │ │ │ │ Best for: tasks with clearly distinct phases that each │ │ need their own prompt-engineering │ │ Examples: document → extract entities → classify by type │ │ → format as report │ └─────────────────────────────────────────────────────────────┘

The pipeline pattern wins when the phases of work are genuinely distinct cognitive operations and each benefits from a focused prompt. Classic example: a content moderation pipeline that does (1) extract the claim being made → (2) categorize the claim type → (3) verify against policy → (4) format the moderation decision. Trying to do all four in one prompt produces a long, complex prompt that scores worse on each individual phase.

What's important to recognize: a pipeline isn't the same as a multi-step single-agent loop. The difference is that pipeline stages are independent processes — they don't share conversation state, don't iterate together, don't loop back. Each stage is essentially a structured-input-to-structured-output transformation.

Anthropic's Claude Code internally uses this for some workflows: a "planning" stage produces a structured plan, the "execution" stage consumes that plan and runs against it, the "verification" stage checks the results. Each stage has its own model and prompt; they're loosely coupled via the structured intermediate outputs.

The trade-off: pipelines lose the flexibility of an agent loop. If stage B realizes that stage A's extraction was wrong, it can't go back — it has to either work with the wrong extraction or fail. Pipelines fit when the phases are well-understood and stable; they don't fit when the work has to iterate.

Pattern 3: specialized peer routing

A small set of agents with distinct roles, and a routing layer that decides which agent handles which request. The agents themselves don't communicate directly; the router does the dispatch.

┌─────────────────────────────────────────────────────────────┐ │ PATTERN 3: SPECIALIZED PEER ROUTING │ │ │ │ ┌──────────────┐ │ │ request → │ ROUTER │ │ │ │ (classifies) │ │ │ └──────┬───────┘ │ │ │ │ │ ┌─────────────────┼─────────────────┐ │ │ ▼ ▼ ▼ │ │ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │Billing │ │Tech │ │Account │ │ │ │ agent │ │support │ │mgmt │ │ │ └────────┘ └────────┘ └────────┘ │ │ (each tuned for one domain) │ │ │ │ Best for: support systems with distinct request types │ │ requiring different tools/prompts │ │ Examples: customer support triage, multi-domain assistant │ └─────────────────────────────────────────────────────────────┘

This is the support-agent pattern most enterprise deployments converge on. A customer asks a question; a routing model classifies it (billing question? technical support? account management?); the right specialized agent handles it. Each specialized agent has its own tools, system prompt, and domain knowledge.

The win: each agent can be deeply tuned for its domain — the billing agent has access to billing-system APIs and a system prompt with billing-specific rules; the tech-support agent has docs and runbooks for the product. Trying to make one agent good at all of these produces a generalist that's mediocre at each.

Two important variants:

Pure routing: the router picks one agent and that agent handles the entire conversation. Simple, clear escalation paths. Limitation: if the conversation drifts into another domain mid-way, you need explicit re-routing.

Routing with handoff: a specialized agent can hand off to a peer if the question turns out to be outside its domain. More flexible, but the handoff protocol becomes a coordination problem. Implementation: the agent emits a special "handoff" tool result that the router interprets to redirect.

Picking the right pattern

The decision often comes down to the shape of your task:

Task shape
Right pattern
Why
Fan out, then synthesize across many pieces
Orchestrator-worker
Parallelism + context isolation + structured synthesis
Sequential phases, each different cognitively
Pipeline
Role specialization, structured intermediate outputs
Many distinct request types from users
Specialized peer routing
Each domain agent stays focused and tunable
Single ongoing conversation, iterative
Single agent (not multi-agent)
Multi-agent overhead doesn't pay for iterative work
Code generation, then verification
Single agent (chapter 4.1)
The "verification" is tool use, not another agent

Combinations: when patterns layer

Production systems sometimes layer patterns. The most common combination: specialized peer routing at the top, with each specialized agent itself using orchestrator-worker for complex tasks within its domain. A customer-support deployment might route a "complex billing dispute" to the billing agent, which then spawns workers to investigate different aspects of the dispute in parallel.

The general principle: layering is fine when each layer's pattern fits the problem at that level. Don't layer patterns just to be sophisticated; layer them when the additional structure genuinely matches the work.

The anti-pattern: layering patterns to fit a framework you've adopted, regardless of whether the layering helps. A common failure: someone adopts CrewAI or LangGraph, builds a 7-agent system where one would have worked, and ships a system that's 10× more expensive and harder to evaluate than the simple version.

Question
What about the "manager-worker-evaluator" pattern, where an evaluator agent grades the worker's output and feeds back?

This is real and works in specific cases — it's essentially the LLM-as-judge pattern (chapter 3.3) embedded in the agent loop. The worker produces a candidate output, the evaluator scores it, and either the worker iterates or the output ships.

Where it works: tasks with clear quality criteria the evaluator can apply (code with tests, structured output matching a schema, document containing required sections). The evaluator's prompt is focused on grading, the worker's on producing.

Where it breaks: when the evaluator's standards drift or the worker can't act on the evaluator's feedback (the worker rewrites everything from scratch each iteration instead of revising). Both are common failure modes in production. If you adopt this pattern, instrument it carefully — and have a clear iteration cap so you don't loop forever.

Question
Are frameworks like LangGraph, CrewAI, AutoGen worth using for multi-agent?

Sometimes. They give you primitives (agent definitions, message passing, state graphs) that match the patterns above. Used well, they save you from reinventing coordination plumbing.

Used poorly, they encourage over-engineering — making it easy to spin up many agents tempts you to use many agents whether the problem needs them or not. The frameworks themselves aren't the issue; the encouragement to "make it more agentic" is.

The pragmatic test: if you can articulate which of the three patterns above you're implementing and why, the framework is a good fit. If you're using the framework's "agent network" or "swarm" features because they sound cool, you're likely over-engineering. Most production multi-agent systems are simple enough that you don't strictly need a framework — but the frameworks are convenient and not harmful when used with discipline.

Question
How do agents in these patterns actually communicate? Function calls? Message queues? Shared state?

Three options, with different trade-offs:

  • Direct function calls / structured returns. The orchestrator-worker pattern usually uses this. The orchestrator calls await run_subagent(...) and the subagent's structured output is the return value. Simplest, in-process, no infrastructure required.
  • Message queues (Redis streams, SQS). Useful when sub-agents are independent processes that might be on different machines. Adds operational complexity. Use this when you need to scale sub-agents independently of the orchestrator, or when sub-agent runs are long enough to need durability.
  • Shared state (database, key-value store). Some patterns benefit from sub-agents reading shared state. The pipeline pattern uses this implicitly (the output of stage A is read by stage B). Other multi-agent patterns mostly avoid shared mutable state because it creates coordination headaches.

Default: structured function calls. Reach for queues or shared state only when the architecture genuinely needs them.

STEP 3

Coordination: the hard part.

Multi-agent's reputation as fragile-in-production isn't because the patterns are bad — they're solid. The fragility lives in coordination: how agents pass work to each other, how failures are handled, how partial successes are reconciled. Coordination problems sink more multi-agent systems than agent-quality problems do. This step covers the patterns that hold up.

Task specification: the highest-leverage discipline

Chapter 4.3 named this in the research-agent context; it generalizes. The most important determinant of multi-agent quality is how well the orchestrator describes each sub-task to the worker. Sloppy task specs produce sloppy work; tight task specs produce tight work.

A good task spec contains five elements:

Element
What it specifies
Why it matters
Objective
The single concrete question this worker answers
Without this, workers wander or duplicate each other
Scope boundaries
What's in-scope and what's NOT
Prevents scope drift; flags genuine ambiguity
Tool grants
Which tools this worker may use
Narrow toolsets focus the worker; broad toolsets confuse it
Budget
Maximum tool calls or token spend
Hard cap prevents runaway sub-agents
Output schema
Exact structure of the expected return
Makes synthesis reliable; prevents free-form mush

An orchestrator that emits task specs with all five elements gets dramatically better worker output than one that emits "investigate X." This is exactly the discipline chapter 4.3 covered for research; it applies to every orchestrator-worker setup.

Concretely in code:

# Bad: vague task spec
await run_worker("Investigate Acme's recent product launches")

# Good: complete task spec
await run_worker(
    objective="Identify product launches by Acme Corp in the last 6 months "
              "with launch date, product name, and one-sentence positioning",
    scope_in=["products launched publicly",
              "feature releases announced as 'new product'"],
    scope_out=["feature updates to existing products",
               "acquisitions",
               "hiring announcements"],
    tools=["web_search", "web_fetch"],
    budget_tool_calls=10,
    output_schema=PRODUCT_LAUNCH_SCHEMA,   # typed object
)

The good version is much more code at the call site, but it's the same effort the orchestrator has to do somehow — better in structured task-spec code than ad-hoc in a prompt string. The structured spec is also easier to log, audit, and reuse.

Failure handling: every worker can fail

The naive multi-agent implementation assumes workers succeed. They don't. A worker can fail in several distinct ways, and a robust system handles each.

Hard failure: the worker errors out. Network issue, API rate limit, exception in the loop. The orchestrator's await raises. Easiest to handle: catch, log, decide whether to retry or skip. A worker that hard-fails on transient errors gets one retry; on persistent errors gets logged and skipped (with the sub-question noted as unanswered in the final output).

Soft failure: the worker returns, but its output is bad. The structured output validates against the schema but the content is wrong — empty findings, off-topic results, hallucinated sources. Harder to handle because the orchestrator can't tell from the return value alone. Detection patterns: schema-level checks ("at least N claims required"), content-level checks (an LLM judge inspects the worker's output for "did this actually answer the sub-question?"), source-validity checks (sources must resolve to real URLs).

Timeout: the worker never returns. Stuck in a loop, or just slow. The orchestrator's wait blocks. Mitigation: each worker call has a wall-clock timeout (typically 2–5× the budgeted runtime). On timeout, the worker is canceled and treated as a soft failure.

Partial success: the worker returns a degraded but usable output. "Only investigated 2 of the 3 sources I planned." "Found 1 of the expected 5 claims." The orchestrator includes this in synthesis with a note about the limitation, rather than treating it as either complete success or total failure.

# Failure-aware worker invocation

async def run_worker_safe(task_spec, timeout_s=300):
    try:
        result = await asyncio.wait_for(
            run_worker(**task_spec),
            timeout=timeout_s,
        )
        # Soft-failure check: did the worker actually do useful work?
        if not validate_worker_output(result, task_spec):
            return {"status": "soft_fail",
                    "reason": "output did not meet quality bar",
                    "partial": result}
        return {"status": "success", "result": result}
    except asyncio.TimeoutError:
        return {"status": "timeout", "task_spec": task_spec}
    except Exception as e:
        return {"status": "error", "error": str(e),
                "task_spec": task_spec}

# Orchestrator gathers all worker results, including failures
worker_results = await asyncio.gather(*[
    run_worker_safe(spec) for spec in task_specs
])

successes  = [w["result"] for w in worker_results if w["status"] == "success"]
soft_fails = [w for w in worker_results if w["status"] == "soft_fail"]
hard_fails = [w for w in worker_results if w["status"] in {"timeout", "error"}]

The synthesis step then handles each category appropriately — include successes, mention soft failures as partial information, note hard failures as gaps. The final output is honest about what worked and what didn't.

The "structured output for handoffs" principle

The most common multi-agent failure mode: free-form chat being passed between agents. Agent A writes "I found that the customer's last payment was 2024-03-15 and the amount was $1,200." Agent B has to parse this from natural language. Sometimes the parsing works; sometimes it doesn't; the inconsistency is invisible until it bites you.

The fix: every inter-agent handoff goes through a typed schema, not free-form text. Agent A doesn't write a sentence about the payment — it emits a structured object: {"last_payment_date": "2024-03-15", "last_payment_amount_usd": 1200.00}. Agent B receives this object directly. No parsing required.

This is the equivalent of the citation-discipline argument from chapter 4.3: structural commitments prevent classes of errors that documentation can't. A schema enforces what a prompt-instruction can only request. The agents can still produce explanatory text — but the load-bearing communication is structured.

The token cost of coordination, made honest

The chapter 4.3 number (15× chat) is the baseline. Three operational realities make it potentially worse if not managed:

Each worker's context includes redundant material. Every worker needs to know the original task, the relevant sub-question, the tool docs, the output schema. This baseline context is duplicated across N workers. Mitigation: prompt caching (chapter 2.2) on the per-worker boilerplate cuts this substantially.

The orchestrator's synthesis context includes everyone's summaries. If 10 workers each return a 500-token finding, the orchestrator's synthesis call has 5,000 tokens of input on top of its own context. Mitigation: structured findings (not chat text) are more compact than prose summaries.

Failed/retried workers double-pay. If a worker times out and is retried, you paid for both attempts. Mitigation: identify the soft-failure pattern and abandon (don't retry) when retry is unlikely to help; track retry rates per worker type as a metric.

A well-engineered multi-agent system runs at 10–15× single-agent cost. A poorly engineered one runs at 30–50×. The difference is exactly these three coordination economies. They sound boring; they're where the cost savings live.

The patterns that look good in papers and break in production

For honesty, the multi-agent patterns that the literature is enthusiastic about and that production teams have largely abandoned:

Free-form agent debate. Multiple agents critique each other's outputs through chat. Sounds powerful; in practice produces ~5% quality improvement at 3-5× cost. Same-model debate is especially weak. Cross-model debate (Claude + GPT) sometimes works for verifiable tasks but most teams don't ship it.

Hierarchical agent trees. Manager spawns sub-managers, who spawn workers. Cost compounds exponentially with depth; coordination becomes intractable. Production systems max out at one level of delegation.

Open-ended agent societies. A "team" of agents with roles (CEO, engineer, marketer) that "collaborate" on tasks. Demos well, ships poorly. The collaboration is too unstructured; agents talk past each other, duplicate work, fail to converge.

Self-modifying agent systems. Agents that update their own prompts based on performance. Demoed; the loop doesn't converge. The agent's judgment about its own prompt isn't reliable enough to drive prompt improvement autonomously.

None of these are impossible — they may work in the future, with better models or with novel techniques. They don't ship reliably in 2026. If you're building a multi-agent system right now, stick to the three patterns from Step 2.

The framework that helps you distinguish "exciting multi-agent paper" from "production-shipping multi-agent system": ask whether the paper's success is measured on closed-domain benchmarks (where coordination overhead is amortized over many runs and tuning is possible) versus open-ended production traffic (where coordination overhead hits every query and tuning has to generalize). Many patterns that beat baselines on benchmarks don't survive open-ended production traffic.

Question
What's a reasonable retry strategy for failed workers?

Three levels:

  • Hard failures (network errors, timeouts): retry once with exponential backoff. If retry fails too, the worker is dead — note the gap and continue.
  • Soft failures (bad output): retry zero or one times, with a different prompt that includes feedback ("your previous answer didn't include the required claims; please ensure..."). If the second attempt also soft-fails, escalate (different model, different scoping, or skip).
  • Don't retry indefinitely. Set a per-task retry budget (typically 2 retries max) and respect it. Infinite retry is how you turn a 5-minute multi-agent run into a 30-minute one without improving quality.

Log retry rates per worker type. High retry rates indicate a worker's prompt or task spec needs work, not that you need more retries.

Question
How do I debug a multi-agent system when something goes wrong?

Observability is non-negotiable (chapter 2.1). Every agent's full conversation, every tool call, every structured handoff, every retry — all spans, all traced. When something breaks in production, you'll be reading these traces to figure out which agent did what, in what order, with what inputs.

Specific things to instrument that single-agent systems don't need:

  • Per-worker latency and token cost (so you can spot the one expensive worker).
  • Worker output validation rate (how often workers produce schema-conforming output the first time vs need a retry).
  • Synthesis-input size (the orchestrator's context just before synthesis — bloat here is a warning).
  • End-to-end latency P95 (parallelism should make this bounded by max(workers); if it's growing, something's gone serial).

Teams that ship multi-agent without proper observability spend their first month firefighting things they could have prevented with proper traces.

Question
When sub-agents need to share state during execution (not just at the end), how do I do that without breaking context isolation?

Three options, in increasing order of intrusiveness:

  • Shared blackboard via tool. Add a post_finding tool that sub-agents can call to publish a finding to a shared blackboard, and a read_blackboard tool to consume what others have posted. Each sub-agent only sees the blackboard when it chooses to. Doesn't bloat the per-agent context until needed.
  • Orchestrator-mediated relay. Sub-agents send messages to the orchestrator (via structured output); the orchestrator decides what to relay to whom. More controlled, more code; useful when "who needs to know what" is policy-driven.
  • Direct sub-agent communication. Rarely a good idea. Hard to reason about; coordination problems multiply.

For most cases, the simpler design — sub-agents work in isolation and the orchestrator handles synthesis at the end — is enough. Reach for shared state only when sub-agents genuinely benefit from each other's findings mid-execution, which is less often than it sounds.

STEP 4

Evaluating a multi-agent system.

A single-agent system has one set of metrics — task-level success, cost, latency. A multi-agent system has all those plus a layer of metrics specific to the multi-agent structure: how well does the orchestrator decompose? How well do workers complete their tasks? How well does synthesis combine them? Evaluating the system requires looking at each layer.

The two-layer eval structure

For any orchestrator-worker system, two distinct evaluation questions:

Layer 1: end-to-end quality. Did the final output meet user needs? This is the chapter 3.1 eval applied unchanged. From the user's perspective, all that matters is the output — they don't care how many agents were involved.

Layer 2: per-agent quality. How well did each agent do its job? This is multi-agent-specific. Three sub-questions: did the orchestrator decompose the task sensibly? Did each worker complete its assigned sub-task? Did the synthesis preserve and combine the workers' findings faithfully?

Why both layers: if your end-to-end quality is bad, you need the per-agent diagnostics to know where the failure originated. A poor output could be the orchestrator's planning failure, a worker's investigation failure, or the synthesis losing what the workers found. Without layer-2 metrics, you can't distinguish these — and you don't know which prompt to fix.

Orchestrator quality: did the decomposition make sense?

The first thing to grade. Given a user query, did the orchestrator's plan break it into sensible sub-questions? Three concrete checks:

Coverage. Do the sub-questions, taken together, address the user's full query? An LLM judge can grade this: given the user query and the orchestrator's plan, are there obvious aspects of the query that aren't covered by any sub-question?

Non-overlap. Do the sub-questions duplicate each other? (Chapter 4.3's named failure mode — two sub-agents investigating the same thing in parallel.) Judge: do these sub-questions describe distinct investigations?

Right-sizing. Is the number of sub-agents appropriate to the question complexity? Anthropic's documented rules (1 for fact-check / 2-4 for comparison / 10+ for complex research) make this measurable: take a labeled dataset of queries with their expected complexity tier, run the orchestrator, check whether the actual number of sub-agents matches.

These three metrics let you measure orchestrator quality without grading downstream worker output, which is essential — you want to be able to fix the orchestrator independent of fixing workers.

Worker quality: did each worker complete its task?

For each worker invocation, two questions:

Schema-validity. Did the worker emit structured output matching the expected schema? This is binary — output either validates or doesn't. Track the validity rate as a per-worker-type metric. Drops indicate prompt issues with that worker type.

Task-completion. Given the task spec and the worker's output, did the worker actually answer its sub-question? An LLM judge grades this: read the task spec, read the output, is the output a substantive answer to the spec's objective?

The task-completion check is the one that catches "worker returned a schema-valid object full of nothing." A worker that hit its tool budget without finding answers might emit an empty findings array — schema-valid, task-incomplete.

Synthesis quality: did the orchestrator faithfully combine?

The final step, and the one most likely to silently degrade. Chapter 4.3's citation-faithfulness check applies directly: every substantive claim in the synthesized output should trace to a worker's finding. The worker findings themselves are already source-attributed; the synthesis just needs to preserve the attribution.

Two specific failure modes to test for:

Confabulation in synthesis. The synthesizer adds claims that weren't in any worker's findings, drawing from training data without source. Detection: for each claim in the output, trace it to a worker finding; if no match, flag.

Dropped findings. The synthesizer omits relevant findings from workers (perhaps because the synthesizer's context was busy with other findings). Detection: count workers whose findings appear in the synthesis vs. workers whose findings are absent. If a worker's substantive findings are absent from the synthesis without justification, flag.

The combined eval dashboard

For a multi-agent system in production, the metrics worth tracking, in priority order:

┌─────────────────────────────────────────────────────────────┐ │ MULTI-AGENT EVAL DASHBOARD │ │ │ │ End-to-end (the user's perspective): │ │ ─ task_completion_rate (0-1) │ │ ─ end_to_end_latency_p95 (seconds) │ │ ─ cost_per_run_usd (dollars) │ │ ─ citation_faithfulness (0-1) │ │ │ │ Orchestrator-specific: │ │ ─ plan_coverage_score (0-1 judge) │ │ ─ subquestion_overlap_rate (0-1 judge, lower better) │ │ ─ subagent_count_p50/p95 (calibration vs complexity) │ │ │ │ Worker-specific (per worker type): │ │ ─ schema_validity_rate (0-1) │ │ ─ task_completion_rate (0-1 judge) │ │ ─ retry_rate (0-1, lower better) │ │ ─ tool_budget_utilization (mean fraction of budget used) │ │ │ │ Synthesis-specific: │ │ ─ confabulation_rate (0-1, lower better) │ │ ─ dropped_findings_rate (0-1, lower better) │ └─────────────────────────────────────────────────────────────┘

This is more instrumentation than a single-agent system needs, and it's the cost of running multi-agent in production. The metrics aren't optional — without them, when end-to-end quality degrades, you have no way to localize the regression. With them, "orchestrator's plan_coverage dropped 8 points after Tuesday's deploy" tells you exactly what to look at.

WORKED EXAMPLE

A customer-support triage system, traced through the patterns it earned.

To anchor everything in this chapter concretely: a real-shape system that started simple and ended multi-agent — but only after the team had clear evidence the simpler version didn't work. The honest version of "we became multi-agent because we had to," not "we were multi-agent from day one."

The setup

A SaaS company's customer support is overwhelmed. Tickets arrive at 800/day; the team can answer ~600. The 200 that don't get answered create churn risk. They want an AI agent to handle the simple ones, freeing humans for the hard ones.

The product requirement: handle tickets where the answer is unambiguous, escalate to humans on anything ambiguous, never make things worse (no wrong answers to billing questions, no incorrect technical guidance).

Week 1: single-agent attempt

The team's first build: one Sonnet-based agent with web access, knowledge-base search, account-lookup tool, and "respond" + "escalate to human" actions. System prompt: "You are a customer support agent. Handle the ticket if you can answer confidently; escalate if you can't."

Results after a week on shadow traffic (the agent's responses are produced but not sent; humans evaluate retroactively):

  • Tickets confidently handled: 68% (target: 70%, close)
  • Wrong-answer rate: 14% (target: <2%, dramatic miss)
  • Escalation rate: 32% (target: 30%, fine)

The wrong-answer rate is the killer. The agent confidently answered billing questions with technical-support reasoning, gave incorrect refund guidance, and misclassified account types. The team's investigation: the single agent's system prompt was trying to handle "billing rules" + "technical troubleshooting" + "account management" + "escalation judgment" all at once, and was mediocre at each.

Week 2: pattern 3 — specialized peer routing

The team adopts the pattern-3 (specialized peer routing) architecture. Three specialized agents (billing, tech support, account management) each with their own system prompt and tools, plus a routing agent that classifies incoming tickets and dispatches to the right specialist.

Each specialist has only the tools relevant to its domain — the billing agent has access to billing APIs but not technical knowledge-base; the tech agent has KB access but not billing systems. This is itself a major improvement: incorrect tool use is impossible.

Results after a week:

  • Tickets confidently handled: 71%
  • Wrong-answer rate: 3.2% (much better, still missing the <2% target)
  • Escalation rate: 28%

The wrong-answer rate dropped a lot but isn't yet acceptable. Investigation: most remaining wrong answers are in the routing step — the router classifies a ticket as "billing" when it's actually a technical issue about a paid feature, and the billing agent confidently produces wrong technical guidance.

Week 3: adding orchestrator-worker for complex cases

The team adds a layer: complex tickets (multi-topic, ambiguous category, or VIP customer) get routed to an "orchestrator" path instead of directly to a specialist. The orchestrator does pattern 1 — it dispatches sub-questions to relevant specialists in parallel and synthesizes their answers.

This is more expensive (10× per ticket) but only fires on the ~15% of tickets that need it. For the other 85%, the cheap routing path still applies.

Results after a week:

  • Tickets confidently handled: 73%
  • Wrong-answer rate: 1.4% (under target!)
  • Escalation rate: 26%
  • Cost: $0.04 per ticket average ($0.02 for simple, $0.40 for complex)

The wrong-answer rate is now acceptable. The architectural shape: pattern 3 (routing) for the common case, pattern 1 (orchestrator-worker) for the edge cases. The composition isn't elegant in any abstract sense — it's an honest response to the failure modes the team actually hit.

The failure mode that surfaced in week 4

Production-quality at end of week 3. Week 4 brought a different problem: end-to-end latency for the complex-case path was 12 seconds (5 sub-agents running in parallel, plus orchestrator overhead). Customers noticed the lag and abandoned. The 12% of customers on the complex path were responding to "are you still there?" before the agent's response arrived.

The team's fix: streaming partial responses to the customer as sub-agents complete, with a status indicator ("checking your account... looking up your invoice... preparing your response..."). End-to-end latency unchanged, perceived latency dropped to ~3 seconds (chapter 2.4's streaming pattern). Customer abandonment on the complex path dropped to baseline.

This is a chapter 2.4 problem, not a multi-agent problem — but it surfaces in multi-agent contexts more often because the operations are longer-running.

What the trace teaches

Three lessons that generalize beyond this specific system:

Don't start multi-agent. The team's instinct could have been "this is a multi-domain problem, let's build multi-agent from day one." They didn't — they tried single-agent first, found specific failure modes, then adopted multi-agent patterns to address specific failures. The eventual architecture is justified because the team can name the specific problems multi-agent solved.

Compose patterns based on real needs. The final shape (routing + orchestrator-worker for edge cases) isn't a textbook pattern. It's two patterns layered because the team had two distinct problems (most tickets are simple-and-specialized; some tickets are complex-and-cross-domain). Both needed solving; the architecture is the union of solutions.

Operations problems compound with multi-agent. Latency, observability, cost — all get harder. Each new agent in the system is another moving part. Plan for the operational reality, not just the agent-quality story.

The honest economics of multi-agent

This system runs at about $0.04 per ticket average — a 10× cost increase from a naive single-agent baseline, but still small in absolute terms. The cost is more than justified because the alternative (200 unhandled tickets per day creating churn) is far more expensive. Multi-agent's reputation for being expensive is true relative to single-agent baselines; it's cheap relative to the human-labor it replaces. Whether your specific problem fits depends on the relative costs in your situation.

End of chapter 4.4

Deliverable

A working understanding of multi-agent systems as a specific tool with real costs and real benefits. The four-property test for when multi-agent is justified. The three patterns that ship in production (orchestrator-worker, pipeline, specialized peer routing) and what each is good for. The coordination disciplines (task specs, structured handoffs, failure handling) that determine whether the architecture works or breaks. The two-layer evaluation framework that lets you debug multi-agent failures. The honest discussion of patterns that don't ship despite literature enthusiasm. Most importantly: the discipline to default to single-agent and only reach for multi-agent when the four properties strongly apply.

  • Four-property test applied to a candidate problem; multi-agent justified only with 3+ matches
  • Pattern selected based on task shape (orchestrator-worker / pipeline / peer routing)
  • Task specs include objective, scope, tools, budget, output schema
  • Inter-agent handoffs are structured (typed schemas), never free-form chat
  • Failure handling for hard fails (retry-once), soft fails (retry-with-feedback), timeouts (wall-clock), partial successes
  • Per-worker retry budgets enforced (max 2); retry rate tracked as a metric
  • Prompt caching applied to per-worker boilerplate to limit token bloat
  • Observability instruments per-worker latency, schema validity, retry rate
  • Two-layer eval: end-to-end metrics AND per-agent diagnostics
  • Orchestrator-specific evals: coverage, non-overlap, right-sizing
  • Synthesis-specific evals: confabulation rate, dropped-findings rate
  • Default position: single-agent until specific failure modes justify decomposition