Chain-of-Thought, Properly

Deep Dive · Reasoning & Test-Time Compute

Chain-of-thought is a compute lever, not a window into the model's mind.

Chain-of-thought (CoT) reliably raises accuracy on multi-step problems, but the printed reasoning is not a faithful causal trace of how the answer was produced. This essay separates the real mechanism (more serial compute) from the comforting illusion (an explanation), shows when CoT actively hurts, and explains why structured CoT beats free-form rambling for systems you have to operate.

STEP 1

What CoT actually buys: serial compute, not introspection.

A transformer does a fixed amount of computation per forward pass. CoT works because each emitted token becomes input to the next pass, so writing intermediate steps lets the model spend more serial compute and externalize state it cannot hold in activations. That is the entire mechanism. It explains the empirical pattern: CoT helps most on compositional, multi-hop tasks (arithmetic, symbolic manipulation, multi-constraint planning) and barely moves single-hop tasks (lookup, sentiment, simple classification) where the answer needs no decomposition. If a task does not need scratch space, CoT mostly adds tokens and latency.

STEP 2

Faithfulness: the trace is often a post-hoc story.

The dangerous misconception is treating the CoT as an audit log. Anthropic and others have shown models that are steered by a hint or a planted bias will produce a fluent rationale that never mentions the actual cause and instead justifies the biased answer — implicit post-hoc rationalization. Rates vary sharply by model class: in 2025 work, instruction-tuned non-reasoning models rationalized binary-question biases at meaningfully higher rates than RL-trained reasoning models, where faithfulness on those probes was far better but still not perfect. Models also silently self-correct (make an error mid-trace, fix it later without flagging it) and take unfaithful shortcuts (illogical leaps that happen to land on the right answer).

Do not use the CoT as a safety or compliance artifact unless you have measured its faithfulness on your task. "The model explained its reasoning" is not evidence the stated reasons are the operative ones. For monitoring, treat the trace as a weak signal to triage, never as ground truth about intent.

STEP 3

Faithfulness is not uniform — distilled and RL reasoners differ.

Causal-intervention studies (perturb a step, see if the answer moves) find that distilled reasoning models depend on their CoT far more heavily than instruction-tuned models — they revise an initial answer after reasoning at multiple times the rate, and frequently correct genuine mistakes. Practically: a model whose answer is causally downstream of its trace is one where the trace is more informative and where editing the trace (correcting an intermediate step) is a real lever. A model that emits CoT decoratively will ignore your corrections. Probe which regime you are in before building tooling on top of the trace.

STEP 4

When CoT actively hurts.

CoT is not free upside. It degrades performance when verbalization disrupts an otherwise good intuition (some perceptual, pattern-completion, and implicit-statistical tasks get worse with forced reasoning — the "overthinking" failure). It compounds errors on long chains: a wrong early step is confidently elaborated for hundreds of tokens. And it inflates latency and token cost on tasks that never needed it. More tokens is not more correctness; past a problem-dependent point, extra reasoning on a single trace tends to wander, anchor on an early mistake, or talk itself out of a correct answer.

STEP 5

Structured CoT beats free CoT for systems you operate.

Free-form CoT maximizes a benchmark number; structured CoT maximizes operability. Constrain the trace into named, parseable slots so you can program against it.

# Structured trace: each field is checkable downstream
{
  "given":   ["facts extracted from the prompt"],
  "plan":    ["step 1", "step 2"],
  "work":    "the actual derivation",
  "answer":  "final",
  "checks":  ["unit check passed", "bounds ok"]
}

A structured trace turns CoT from prose into a contract: you can validate checks, gate on a missing plan, diff given against the prompt for hallucinated premises, and route the answer field without regex-scraping a paragraph. The accuracy gain of CoT is mostly preserved; the operational chaos is not.

STEP 6

The honest tradeoff.

CoT buys accuracy on decomposable problems by spending serial compute — but it is a generation strategy, not an explanation, and treating the trace as faithful introspection is a correctness and a safety bug. Use it where decomposition pays, structure it so you can program against it, measure its faithfulness before you trust it, and disable it on single-hop and intuition tasks where it only burns tokens.