Self-consistency works when the truth is the mode — and only then.
Self-consistency (Wang et al., 2022) samples many chains of thought at non-zero temperature and returns the majority answer. It is the cheapest, most parallel test-time-compute lever and often the highest-leverage one. This essay explains the statistical reason it works, the precise condition under which it fails, the shape of its diminishing returns, and how to spend the sampling budget you actually have.
The mechanism: marginalize over reasoning paths.
A single greedy decode commits to one path; a wrong token early dooms it. Self-consistency samples k independent traces and votes on the final answer, not the reasoning. The bet is that many distinct correct derivations converge on the same answer, while errors are diffuse and scatter across many wrong answers. So the correct answer accumulates a plurality even when no single sample is reliable. It is, in effect, Monte-Carlo marginalization over the latent reasoning path — you stop caring which path was taken and ask which answer the path distribution concentrates on.
# Self-consistency: sample, extract, vote samples = [model.call(q, temperature=0.7) for _ in range(k)] answers = [extract_final(s) for s in samples] ans = majority(answers) # mode of the answer distribution
The failure condition is exact: bias toward a wrong mode.
Self-consistency only recovers the truth if the correct answer is the modal sample. If the model is systematically biased — a common misconception, a flawed heuristic, a leading prompt — the wrong answer is the mode, and voting amplifies the bias. More samples make you more confident in the wrong answer, not less. This is the single most important caveat: self-consistency reduces variance (unlucky decoding) but does nothing about bias (a model that is wrong on average). It cannot manufacture a signal the sampling distribution does not already contain.
Never read a tight vote (e.g. 28/30 agree) as a confidence score. A confidently-wrong model produces lopsided votes for the wrong answer. Vote margin measures the model's certainty, not the answer's correctness — they coincide only when the model is unbiased on that task.
The structural constraint: a small, discrete answer space.
Voting needs answers that can be compared for equality and that collide. It works for a final integer, a multiple-choice label, a normalized short string. It is useless for free-form text, code, proofs, or essays — two correct programs are textually different and never collide, so every sample is its own singleton and the "majority" is meaningless. For those, you need a different selector (a verifier or judge over candidates), which is best-of-N, not self-consistency. Normalize aggressively before voting: canonicalize numbers, strip units, lowercase labels — a parsing miss splits a real majority into noise.
Diminishing returns: a saturating curve, not a line.
Accuracy versus k is concave and saturating. Early samples add real information; later ones increasingly retrace paths already seen, so the marginal gain decays. On contemporary strong models the curve typically flattens in the low-to-mid tens of samples for many reasoning benchmarks — beyond that, extra samples mostly duplicate existing traces and accuracy plateaus or, on some tasks, slightly degrades as low-quality traces dilute the vote. Cost, meanwhile, is strictly linear in k. You are buying a logarithmic-ish accuracy gain with linear spend; the economically sane region is the knee of the curve, not its tail.
Find the knee empirically per task class: sweep k ∈ {1,3,5,10,20,40} on a labeled dev set, plot accuracy vs cost, and pick the smallest k past which the slope goes flat. Then add adaptive early-stopping: stop sampling once the lead is statistically decisive, so easy items spend 3 samples and only hard ones spend 40.
Spend the budget better than uniform voting.
Plain majority weights every trace equally, including obviously broken ones. Cheap upgrades, in increasing order of dependency: filter traces that fail a free check (wrong format, failed unit test, contradiction) before voting; weight votes by a verifier or self-reported confidence when you have a usable one; or switch to weighted best-of-N when an answer-level verifier exists — that dominates voting because it can reward a rare-but-correct trace the majority would drown out. Diversity also matters: sampling at sane temperature with varied prompts/personas decorrelates errors and sharpens the mode faster than cranking k on one prompt.
The honest tradeoff.
Self-consistency is the best dollars-per-accuracy lever for tasks with a small discrete answer space where the model is roughly unbiased — and a confidence-inflating trap when it is biased. It cuts decoding variance for linear cost, saturates at the knee, and silently amplifies any systematic error. Measure the knee, cap k there, and reach for a verifier the moment the answer space stops being votable.