Debate, Voting & Ensembles

G4
Deep Dive · Multi-Agent Systems

Debate, voting and ensembles: where the gains actually come from.

Multi-agent debate (MAD) — agents argue, critique, and converge — is one of the most cited multi-agent patterns and one of the most misunderstood. The 2025–2026 literature is now sharp on this: most of the measured benefit comes from ensembling (sampling several answers and voting), not from the debate rounds themselves, and debate collapses to the initial majority when agents lack diversity. This essay separates the part that works from the part that mostly does not, and tells you when each is worth its multiplied cost.

STEP 1

Ensembling is the workhorse: independent samples plus a vote.

Run the same problem N times independently and aggregate — majority vote for discrete answers, or a judge/synthesis for open-ended ones. This is self-consistency generalized across agents, and it is robust because errors that are independent tend not to coincide, so the modal answer is more often right than any single sample. It is also embarrassingly parallel and easy to observe. Multiple 2025–2026 benchmark studies of MAD protocols find that simple majority voting over independent outputs already captures most of the reported gains.

# ensemble: independent samples, then vote
def ensemble(task, n):
    answers = [agent.run(task) for _ in range(n)]  # independent
    return majority(answers)                          # or judge-synthesis
STEP 2

Debate adds value only on top of ensembling, and only sometimes.

In debate, agents see each other's answers and reasoning across rounds and may revise. The honest 2025–2026 finding: once you control for the ensemble baseline, the debate rounds add little systematic benefit unless coupled with explicit corrective structure — a dedicated critic, asymmetric roles, or a stopping rule that detects stability. Debate's real wins are on tasks where one agent can verify another's step (math proofs, code, factual chains) so a wrong line gets caught and corrected. Where verification is hard (open-ended judgment), debate mostly converts compute into confident agreement.

STEP 3

Diversity is the load-bearing variable; without it, debate collapses.

The mechanism behind every ensemble and debate gain is error independence. If your agents are the same model with the same prompt and similar samples, their errors are correlated — the vote just re-counts one opinion, and debate dynamics go static and collapse back to the initial majority. 2026 work on multi-agent committees measures this directly as representational collapse: agents' reasoning becomes near-identical (high pairwise similarity, low effective rank), so adding agents adds cost but no information. Diversity must be engineered: different models, different prompts/personas, different temperatures, or different tool access — not assumed.

N copies of the same model with the same prompt is not an ensemble — it is one opinion sampled N times, priced at N×. The vote will look confident and be exactly as wrong as one call, while you pay the full multiplier. Correlated agents are the single most common reason debate "doesn't help."

STEP 4

Diversity helps until it collapses or it never converges — both are failure modes.

There is a usable band, not a monotonic "more is better." Too little diversity → collapse to a single correlated opinion (no benefit, full cost). Too much, with no mechanism to resolve disagreement → the agents never converge, the judge sees noise, and you have spent N× to manufacture an unbreakable tie. The engineering target is calibrated diversity: agents that disagree for substantive reasons plus an aggregation rule (confidence-weighted vote, a strong judge, or a stability-detecting stopping rule) that can actually adjudicate the disagreement instead of averaging it into mush.

Weight votes by calibrated confidence, not raw count, and add a stability-detection stopping rule. 2025–2026 results show confidence- and diversity-aware aggregation beats flat majority vote — and a stability check stops debate the round it stops changing minds, which is usually round one or two, saving the rest of the multiplier.

STEP 5

Cost scales linearly with agents and super-linearly with debate rounds.

An ensemble of N is N× a single call. A debate of N agents over R rounds is roughly N×R, and each round's context grows because agents re-read the transcript — so token cost climbs faster than R alone suggests. Spend this only where a correct answer is worth a multiple of a single call and where you have engineered diversity so the multiple buys real error reduction. On easy or low-stakes tasks the ensemble's accuracy gain is in the noise while the bill is not.

STEP 6

When NOT to debate or ensemble.

Skip both when one good call already meets the bar, when agents are correlated (you will pay N× for one opinion), when the task has no verifiable structure for debate to exploit, or when no aggregation rule can adjudicate the disagreement you will produce. Prefer the cheaper, more observable ensemble over debate unless the task supports step-level verification and you have a corrective structure. Ensembles convert diversity into accuracy; without engineered diversity and an adjudicator they convert money into confident wrongness — debate doubly so.