Generation & Sampling: Temperature

Concepts · AI Foundations

Generation and sampling: temperature explained.

"Why did I get a different answer to the exact same prompt?" is the question that reveals the most important shift in how to think about LLMs: the model does not return an answer, it returns a probability distribution, and a separate sampling step turns that into the text you see. This entry explains that step, demystifies the temperature dial you have seen in every API, and gives you a task-driven rule for setting it.

STEP 1

The model outputs a distribution, not a word.

At each generation step, the model does not choose the next token. It produces a probability distribution over every token in its vocabulary — a confidence score for each of the ~100,000+ possibilities of what could come next. Only after that does a separate sampling step pick one token from that distribution. Then the chosen token is appended and the whole process repeats for the next token.

Prompt: "The capital of France is"
Model's distribution over the next token:
  " Paris"    -> 0.85
  " the"      -> 0.04
  " a"        -> 0.02
  " located"  -> 0.02
  " home"     -> 0.01
  ... ~100,000 more tokens, almost all near zero ...

Sampling picks ONE. Then repeat for the token after that.

This is the mental model that dissolves a whole class of confusion: the model's actual output is the entire ranked distribution. "Generation" is a long sequence of distribution-then-pick steps. Whether the same prompt gives the same text depends entirely on how that pick is made.

STEP 2

Temperature: how sharply to favour the top choice.

Temperature is a single number (typically 0 to about 2) that reshapes the distribution before sampling, controlling how strongly the most probable tokens are favoured over the rest. It does not change what the model "thinks" — the underlying distribution is the same. It only changes how decisively sampling commits to the front-runner.

Temperature 0 — always take the single highest-probability token. The distribution is effectively collapsed to its peak. Most repeatable, most predictable, least varied. Sometimes called "greedy."
Temperature ≈ 1 — sample from the model's distribution roughly as-is. Its natural level of variability: usually the likely token, but plausible alternatives genuinely surface.
Temperature ≈ 2 — flatten the distribution. Unlikely tokens get a real chance. Output becomes more surprising and more diverse, but also less coherent and more error-prone.

  temperature 0          temperature 1          temperature 2
  |#                     |#                     |#
  |#                     |##                    |###
  |#                     |####                  |#####
  |#  . . . . .          |#######  . .          |#########
  one peak, always       likely wins but        nearly flat,
  picks the top          variety is real        anything can appear

  Same underlying distribution; temperature only reshapes it.

The practical mental image: temperature is a "play it safe ↔ take risks" dial applied to text. Low temperature hugs the most-likely path; high temperature wanders off it on purpose.

STEP 3

Choosing temperature by task, not by taste.

The setting should follow the task, not personal preference. A reliable guide:

Classification, extraction, structured output: 0 – 0.2. There is a correct answer; you want the highest-probability one. Variety is pure downside here.
Tool-calling / decisions in an agent: 0 – 0.3. You want predictable behaviour on the same input. Randomness in which tool gets called is a bug, not a feature.
Code generation: ~0.2 – 0.4. Low, but not zero — a little flexibility for novel problems while staying close to well-trodden patterns.
Summarising, rewriting: ~0.3 – 0.7. Variety in phrasing is genuinely desirable; the content is constrained anyway.
Brainstorming, creative writing: ~0.7 – 1.0+. You want surprise. The single most-likely token is often the most generic and forgettable one.

You may also meet top_p (nucleus sampling), a related dial that instead truncates the distribution — sample only from the smallest set of top tokens whose probabilities sum to top_p (e.g. 0.9). Intuitively it caps how improbable a sampled token may be. Most applications tune temperature alone and leave top_p at its default; reach for it only when you have a specific failure to mitigate.

STEP 4

The trap: "temperature 0 is deterministic."

The most consequential misconception. Temperature 0 makes output much more consistent, but it is not a guarantee of identical results, for reasons that have nothing to do with sampling randomness:

Floating-point non-determinism. Inference servers batch many requests together for efficiency. The exact batch changes the order of tiny numerical additions, and floating-point addition is not perfectly associative. Usually invisible — but occasionally enough to flip which token is ranked first, especially when the top two are nearly tied.
Model snapshot updates. The same API model name can point to slightly updated weights over time. Same call, slightly different distribution.
Server-side variation. Caching, routing, and fallback machinery introduce small perturbations even at temperature 0.

So treat temperature 0 as "low variance," never "no variance." If you need genuine reproducibility for tests, combine temperature 0 with a fixed seed where supported and ideally a pinned model version — and still expect rare drift, since providers document seeds as best-effort, not contractual.

The real upgrade: stop calling the model "wrong" when it gives a different answer to the same prompt. It is sampling from a distribution. The useful question is not "why isn't it deterministic?" but "is the distribution centred on the right answer with appropriate confidence?" — and that is measurable. A model that is right 80% of the time is not broken when one run lands in the other 20%; that is sampling, working as designed. Use a low temperature when you want consistency, a higher one when you want range, and judge quality over many runs, not one.