Inference-Time Scaling

Deep Dive · Reasoning & Test-Time Compute

Inference-time compute is a second scaling axis — with its own diminishing-returns frontier.

Pretraining scaling buys capability before deployment; inference-time scaling buys accuracy per query at run time, either internally (RL-trained reasoning models that emit long thought before answering, as in the o-series and R1-class systems) or externally (sampling, search, verification). This essay frames the two axes as substitutable compute, sketches the compute-optimal allocation, and pins down where extra thinking stops paying — and starts hurting.

STEP 1

Two ways to spend a FLOP: bigger model or longer thought.

For a fixed accuracy target you have a budget you can spend on parameters (train a larger model, pay it on every token forever) or on test-time compute (keep the model, spend more FLOPs per hard query). DeepMind's 2024 result and the 2025 follow-ups show these trade against each other: on many reasoning tasks, a smaller model given a compute-optimal test-time strategy matches a much larger model run greedily, and test-time scaling can even make overtraining a smaller model the compute-optimal choice. The reframe that matters operationally: model size is amortized capability you pay for on every call; test-time compute is on-demand capability you pay for only on the queries that need it.

STEP 2

Internal vs external scaling are the same lever, different ownership.

Internal: a reasoning model trained with RL to produce a long private chain before the answer (o-series, R1-class). You scale by raising a reasoning-effort/thinking-budget setting; the search and verification are baked into weights. External: you orchestrate sampling, best-of-N, beam/tree search, and a verifier around an ordinary model (the previous four essays). Internal is simpler to operate and often better calibrated per token; external gives you an explicit, inspectable verifier and exact control over the budget. They compose — best-of-N over a reasoning model is two scaling axes stacked — and they share the same diminishing-returns shape.

STEP 3

The compute-optimal allocation is difficulty-adaptive.

The headline finding of compute-optimal test-time scaling: the right strategy depends on the policy model, the verifier, and the per-query difficulty, so a fixed budget is wasteful. The compute-optimal policy estimates difficulty and routes: easy queries get near-zero extra compute (one pass, maybe light self-revision); medium queries get moderate sampling or shallow beam search with verifier guidance; only the hard tail gets the full sampling/deep-search budget. Spending the same large budget on every query buys almost nothing on the easy majority and underspends nothing it needed — it just burns money.

# Difficulty-adaptive test-time budget
d = estimate_difficulty(q)        # calibrated proxy, cheap
if   d < LO:  ans = model.call(q)                     # single pass
elif d < HI:  ans = best_of_n(q, n=8, verifier=V)        # moderate
else:        ans = beam_search(q, depth=D, verifier=V)  # full budget

STEP 4

The frontier bends, and past the bend it bends down.

Accuracy as a function of test-time compute is concave and saturating, exactly like the self-consistency curve, for the same reason: marginal samples/thoughts increasingly retrace explored reasoning. Worse, the 2025 "test-time compute paradox" results show it can turn over: beyond a problem-dependent point, more thinking makes some reasoning models less accurate — overthinking, anchoring on an early wrong commitment, or arguing themselves out of a correct answer. There is no universal optimal budget; there is a per-task-class knee you must measure, after which spend is negative ROI.

"It is a reasoning model, so give it maximum thinking budget" is a cost and an accuracy mistake. On easy and on intuition-heavy tasks, max budget pays a large latency/token premium for zero or negative accuracy change. The budget is a dial to tune per task, not a slider to max.

STEP 5

Where the axis actually pays.

Test-time scaling has the steepest, most reliable returns exactly where search did: verifiable domains (competition math, code with tests, formal/constraint problems) and a hard tail of queries where single-pass accuracy is the measured bottleneck and the value of a correct answer dwarfs the inference premium. It pays poorly on easy queries (nothing to think about), on open-ended generation with no verifier (more thought, no better selection), and under tight latency/QPS budgets (the multiplier is unaffordable). The decisive question is not "is this model a reasoner" but "does this query have headroom that more compute can convert, and can I tell which queries those are."

Build the difficulty router before scaling the budget. The single highest-ROI move in test-time scaling is not "more samples" — it is spending the samples only on the queries that have headroom, which usually cuts total cost while raising mean accuracy.

STEP 6

The honest tradeoff.

Inference-time compute is real, substitutable scaling — but it is concave, it turns over, and uniform budgets waste most of the spend on queries that never needed it. Treat the budget as a per-query dial set by a difficulty estimate and a verifier, measure the knee per task class, and stop at it: past the bend you are paying linearly for accuracy that is flat or falling.