Reasoning is a dial with a price; the decision rule is task class × verifiability × budget.
The previous five essays each ended at the same place: extra reasoning compute pays in proportion to a usable signal, and uniform application wastes money. This essay is the synthesis — a single decision procedure for whether to spend reasoning compute on a workload, treating reasoning models and search as a tunable dial rather than a default, with concrete do/don't.
The three-factor rule.
Whether reasoning compute pays is the product of three factors. Task class: is the task decomposable (multi-hop, compositional — reasoning has headroom) or single-hop / intuition (lookup, classification, perceptual — reasoning is neutral-to-harmful)? Verifiability: is there a signal — exact checker, discriminative verifier, votable answer — that can convert extra candidates into a better selection? Budget: does the value of a marginally-better answer exceed the latency and token premium for this workload's volume? If any factor is near zero, the product is near zero: a decomposable task with no verifier and a tight QPS budget should not get search.
The escalation ladder — cheapest rung that clears the bar wins.
# Climb only as far as the quality bar forces you 1 single pass # default; measure accuracy first 2 + structured CoT # decomposable tasks 3 + self-consistency (k at knee) # small discrete answer space 4 + best-of-N w/ verifier # a usable verifier exists 5 + beam/tree w/ PRM # long structure, pruning pays 6 + reasoning model, budget=hi # verifiable + hard tail only
Each rung multiplies cost; only climb when the rung below has a measured accuracy gap on a labeled set. The single most common production error is starting at rung 5 or 6 because the task "felt hard," without proving rungs 1–3 were insufficient.
Reasoning models are a dial, not a category upgrade.
"Use the reasoning model" is not a decision; "set the thinking budget to X for query class Y" is. A reasoning model at minimum budget is roughly a normal model; at maximum budget it is the search-and-verify of essays N2–N5 baked into weights, with the same concave, turn-over curve. Treat reasoning effort as a per-query parameter chosen by a difficulty estimate: low for the easy majority, high only for the verifiable hard tail. A flat "max thinking everywhere" setting pays the worst-case premium on every query, including the ones a single pass would have answered identically.
The money-burning patterns, named.
- Search without a scorer. ToT/best-of-N over an LLM self-judge on un-checkable output — paying 30x to be confidently wrong (N3, N4).
- Voting a biased model. Self-consistency where the model's modal answer is wrong — inflating confidence in the error (N2).
- Reasoning on intuition tasks. Forced CoT on perceptual/pattern tasks that get worse with verbalization (N1).
- Uniform max budget. Full thinking on every query regardless of difficulty — most spend lands on queries with no headroom (N5).
- Over-N reward hacking. Cranking N until the system optimizes the proxy, not the task (N4).
Every one of these gets worse with more compute, not better. If a quality problem does not improve when you add reasoning, adding more reasoning is the wrong fix — you have a signal problem (no verifier, biased model, wrong task class), and compute amplifies a missing signal into expensive noise.
The do list.
Do: measure single-pass accuracy and cost before adding any reasoning; build a difficulty router before scaling any budget; secure an exact checker wherever the task admits one and prefer it to a learned RM; structure CoT so you can program against it and audit premises; plot true accuracy (not reward, not vote margin) against compute and locate the knee; cap every method at its knee and route the hard tail there only; re-measure after model upgrades — a stronger base model moves every knee and can make a rung you needed last quarter pure waste this quarter.
The portfolio view: reasoning compute is a budget allocated across queries, not a setting applied to a model. The optimum almost always spends most queries at rung 1–2 and concentrates the expensive rungs on a small, verifiable, high-value tail. If your spend is uniform, you are leaving accuracy and money on the table simultaneously.
The honest tradeoff.
Reasoning compute converts to accuracy only where the task is decomposable, the answer is verifiable, and the budget justifies the premium — applied uniformly it is the most expensive way to not improve. Default to the cheapest rung, escalate only on measured gaps, treat the thinking budget as a routed per-query dial, and remember the recurring law of this entire section: architecture amplifies a signal, it never manufactures one.