Generation and sampling: temperature explained.
"Why did I get a different answer to the exact same prompt?" is the question that reveals the most important shift in how to think about LLMs: the model does not return an answer, it returns a probability distribution, and a separate sampling step turns that into the text you see. This entry explains that step, demystifies the temperature dial you have seen in every API, and gives you a task-driven rule for setting it.
The model outputs a distribution, not a word.
At each generation step, the model does not choose the next token. It produces a probability distribution over every token in its vocabulary — a confidence score for each of the ~100,000+ possibilities of what could come next. Only after that does a separate sampling step pick one token from that distribution. Then the chosen token is appended and the whole process repeats for the next token.
Prompt: "The capital of France is"
Model's distribution over the next token:
" Paris" -> 0.85
" the" -> 0.04
" a" -> 0.02
" located" -> 0.02
" home" -> 0.01
... ~100,000 more tokens, almost all near zero ...
Sampling picks ONE. Then repeat for the token after that.
This is the mental model that dissolves a whole class of confusion: the model's actual output is the entire ranked distribution. "Generation" is a long sequence of distribution-then-pick steps. Whether the same prompt gives the same text depends entirely on how that pick is made.
Temperature: how sharply to favour the top choice.
Temperature is a single number (typically 0 to about 2) that reshapes the distribution before sampling, controlling how strongly the most probable tokens are favoured over the rest. It does not change what the model "thinks" — the underlying distribution is the same. It only changes how decisively sampling commits to the front-runner.
- Temperature 0 — always take the single highest-probability token. The distribution is effectively collapsed to its peak. Most repeatable, most predictable, least varied. Sometimes called "greedy."
- Temperature ≈ 1 — sample from the model's distribution roughly as-is. Its natural level of variability: usually the likely token, but plausible alternatives genuinely surface.
- Temperature ≈ 2 — flatten the distribution. Unlikely tokens get a real chance. Output becomes more surprising and more diverse, but also less coherent and more error-prone.
temperature 0 temperature 1 temperature 2 |# |# |# |# |## |### |# |#### |##### |# . . . . . |####### . . |######### one peak, always likely wins but nearly flat, picks the top variety is real anything can appear Same underlying distribution; temperature only reshapes it.
The practical mental image: temperature is a "play it safe ↔ take risks" dial applied to text. Low temperature hugs the most-likely path; high temperature wanders off it on purpose.
Choosing temperature by task, not by taste.
The setting should follow the task, not personal preference. A reliable guide:
- Classification, extraction, structured output: 0 – 0.2. There is a correct answer; you want the highest-probability one. Variety is pure downside here.
- Tool-calling / decisions in an agent: 0 – 0.3. You want predictable behaviour on the same input. Randomness in which tool gets called is a bug, not a feature.
- Code generation: ~0.2 – 0.4. Low, but not zero — a little flexibility for novel problems while staying close to well-trodden patterns.
- Summarising, rewriting: ~0.3 – 0.7. Variety in phrasing is genuinely desirable; the content is constrained anyway.
- Brainstorming, creative writing: ~0.7 – 1.0+. You want surprise. The single most-likely token is often the most generic and forgettable one.
You may also meet top_p (nucleus sampling), a related dial that instead truncates the distribution — sample only from the smallest set of top tokens whose probabilities sum to top_p (e.g. 0.9). Intuitively it caps how improbable a sampled token may be. Most applications tune temperature alone and leave top_p at its default; reach for it only when you have a specific failure to mitigate.
The trap: "temperature 0 is deterministic."
The most consequential misconception. Temperature 0 makes output much more consistent, but it is not a guarantee of identical results, for reasons that have nothing to do with sampling randomness:
- Floating-point non-determinism. Inference servers batch many requests together for efficiency. The exact batch changes the order of tiny numerical additions, and floating-point addition is not perfectly associative. Usually invisible — but occasionally enough to flip which token is ranked first, especially when the top two are nearly tied.
- Model snapshot updates. The same API model name can point to slightly updated weights over time. Same call, slightly different distribution.
- Server-side variation. Caching, routing, and fallback machinery introduce small perturbations even at temperature 0.
So treat temperature 0 as "low variance," never "no variance." If you need genuine reproducibility for tests, combine temperature 0 with a fixed seed where supported and ideally a pinned model version — and still expect rare drift, since providers document seeds as best-effort, not contractual.
The real upgrade: stop calling the model "wrong" when it gives a different answer to the same prompt. It is sampling from a distribution. The useful question is not "why isn't it deterministic?" but "is the distribution centred on the right answer with appropriate confidence?" — and that is measurable. A model that is right 80% of the time is not broken when one run lands in the other 20%; that is sampling, working as designed. Use a low temperature when you want consistency, a higher one when you want range, and judge quality over many runs, not one.