When Reasoning Helps (and When It Burns Money)

Deep Dive · Reasoning & Test-Time Compute

Reasoning is a dial with a price; the decision rule is task class × verifiability × budget.

The previous five essays each ended at the same place: extra reasoning compute pays in proportion to a usable signal, and uniform application wastes money. This essay is the synthesis — a single decision procedure for whether to spend reasoning compute on a workload, treating reasoning models and search as a tunable dial rather than a default, with concrete do/don't.

STEP 1

The three-factor rule.

Whether reasoning compute pays is the product of three factors. Task class: is the task decomposable (multi-hop, compositional — reasoning has headroom) or single-hop / intuition (lookup, classification, perceptual — reasoning is neutral-to-harmful)? Verifiability: is there a signal — exact checker, discriminative verifier, votable answer — that can convert extra candidates into a better selection? Budget: does the value of a marginally-better answer exceed the latency and token premium for this workload's volume? If any factor is near zero, the product is near zero: a decomposable task with no verifier and a tight QPS budget should not get search.

STEP 2

The escalation ladder — cheapest rung that clears the bar wins.

# Climb only as far as the quality bar forces you
1  single pass                       # default; measure accuracy first
2  + structured CoT                  # decomposable tasks
3  + self-consistency (k at knee)    # small discrete answer space
4  + best-of-N w/ verifier           # a usable verifier exists
5  + beam/tree w/ PRM                # long structure, pruning pays
6  + reasoning model, budget=hi      # verifiable + hard tail only

Each rung multiplies cost; only climb when the rung below has a measured accuracy gap on a labeled set. The single most common production error is starting at rung 5 or 6 because the task "felt hard," without proving rungs 1–3 were insufficient.

STEP 3

Reasoning models are a dial, not a category upgrade.

"Use the reasoning model" is not a decision; "set the thinking budget to X for query class Y" is. A reasoning model at minimum budget is roughly a normal model; at maximum budget it is the search-and-verify of essays N2–N5 baked into weights, with the same concave, turn-over curve. Treat reasoning effort as a per-query parameter chosen by a difficulty estimate: low for the easy majority, high only for the verifiable hard tail. A flat "max thinking everywhere" setting pays the worst-case premium on every query, including the ones a single pass would have answered identically.

STEP 4

The money-burning patterns, named.

Search without a scorer. ToT/best-of-N over an LLM self-judge on un-checkable output — paying 30x to be confidently wrong (N3, N4).
Voting a biased model. Self-consistency where the model's modal answer is wrong — inflating confidence in the error (N2).
Reasoning on intuition tasks. Forced CoT on perceptual/pattern tasks that get worse with verbalization (N1).
Uniform max budget. Full thinking on every query regardless of difficulty — most spend lands on queries with no headroom (N5).
Over-N reward hacking. Cranking N until the system optimizes the proxy, not the task (N4).

Every one of these gets worse with more compute, not better. If a quality problem does not improve when you add reasoning, adding more reasoning is the wrong fix — you have a signal problem (no verifier, biased model, wrong task class), and compute amplifies a missing signal into expensive noise.

STEP 5

The do list.

Do: measure single-pass accuracy and cost before adding any reasoning; build a difficulty router before scaling any budget; secure an exact checker wherever the task admits one and prefer it to a learned RM; structure CoT so you can program against it and audit premises; plot true accuracy (not reward, not vote margin) against compute and locate the knee; cap every method at its knee and route the hard tail there only; re-measure after model upgrades — a stronger base model moves every knee and can make a rung you needed last quarter pure waste this quarter.

The portfolio view: reasoning compute is a budget allocated across queries, not a setting applied to a model. The optimum almost always spends most queries at rung 1–2 and concentrates the expensive rungs on a small, verifiable, high-value tail. If your spend is uniform, you are leaving accuracy and money on the table simultaneously.

STEP 6

The honest tradeoff.

Reasoning compute converts to accuracy only where the task is decomposable, the answer is verifiable, and the budget justifies the premium — applied uniformly it is the most expensive way to not improve. Default to the cheapest rung, escalate only on measured gaps, treat the thinking budget as a routed per-query dial, and remember the recurring law of this entire section: architecture amplifies a signal, it never manufactures one.