LLM Mental Model — The Agentic AI Field Guide

0.1

Part 0 / Foundations · The intuitions everything later assumes

The LLM mental model.

This chapter teaches you to think about a language model well enough to predict roughly what it will do on a given prompt, and to explain why a specific failure mode happens when it doesn't. Not "what is an LLM" — you've used one. The working intuitions that turn LLMs from black boxes into systems you can reason about: tokens as the real unit, context windows as a finite resource, sampling as a distribution to navigate, and hallucinations as predictable failure modes with named mechanisms. Everything in the rest of this guide assumes these intuitions.

STEP 1

Tokens: the real unit.

A language model does not read characters and does not read words. It reads tokens — subword units, typically between 2 and 6 characters in length for English, picked by the tokenizer to balance vocabulary size against sequence length. Internalizing this is the foundation everything else rests on, because half the bugs and surprises in LLM work trace back to confusion about what counts as a token.

The first thing to see

Run a tokenizer on a few strings. Here's what happens with Anthropic's Claude tokenizer (OpenAI's tiktoken behaves similarly for English):

"Hello"           → 1 token
"Hello world"     → 2 tokens  (Hello | world)
"ChatGPT"         → 1 token   (one canonical chunk)
"Anthropic"       → 1 token
"unhappiness"     → 3 tokens  (un | happiness ... or un | happi | ness)
"supercalifragilisticexpialidocious" → 8 tokens

"Czesław Miłosz"  → 7 tokens  (C | z | es | ł | a | w |  Mił | osz ...)
"日本語"          → 3 tokens  (one per character)
"🦀"             → 2 tokens  (the emoji decomposes into bytes)

"{\"name\":\"alex\"}"           → 7 tokens
"{ \"name\" : \"alex\" }"       → 10 tokens (extra spaces cost real tokens)

Four things in this list deserve attention.

"ChatGPT" is one token. Frequent strings in the training corpus get their own dedicated token because it's efficient. "Anthropic" gets one. "OpenAI" probably gets one or two. Common product names, API names, programming keywords — all one token, because they appeared often enough that the tokenizer training merged them. This is also why def function_name(): in Python is fewer tokens than the English description of the same thing.

Polish, Japanese, and emoji cost much more. "Czesław" — a normal Polish name — costs six or seven tokens because the tokenizer was trained mostly on English and isn't optimized for Polish diacritics. The same name in English transliteration ("Czeslaw") would be 2–3 tokens. This is not a bug; it's a direct consequence of the training distribution. The practical impact: a chatbot serving non-English users costs 2–4× more per turn than the same chatbot serving English users, and has a smaller effective context window because each user message eats more tokens.

Whitespace and formatting are real tokens. The two JSON examples differ by whitespace and the difference is meaningful. Tabs, newlines, extra spaces — they all consume tokens. A "compact" prompt format is genuinely cheaper than a "readable" one. This matters most in production where you're paying for every token, multiplied by every request.

Tokenization is deterministic but not intuitive. The tokenizer is a piece of code that takes a string and produces a sequence of integers (token IDs). Same input, same output, every time. But the boundaries are picked by an algorithm trained on a corpus, not by linguistic rules. You cannot predict tokenization by inspection; you have to actually run the tokenizer to know.

The implications for your prompt costs

Three concrete consequences you'll feel in production:

Your prompt cost is not your character count, your word count, or any quantity you can eyeball. It's a tokenizer-output count, and the only way to know it is to measure. The provider's billing dashboard will show you the real number after the fact; if you want to know in advance, run the tokenizer locally on your inputs.

# Pre-flight token counting (Anthropic)
from anthropic import Anthropic
client = Anthropic()

count = client.messages.count_tokens(
    model="claude-sonnet-4-5",
    messages=[{"role": "user", "content": prompt}],
)
print(f"input tokens: {count.input_tokens}")

# Or locally with tiktoken for OpenAI
import tiktoken
enc = tiktoken.encoding_for_model("gpt-5.5")
print(f"input tokens: {len(enc.encode(prompt))}")

The implications for failures

Three failure modes that trace back to tokens, all of which look like other things until you check:

Truncation mid-name in JSON outputs. The model is generating a JSON object that contains a name field. max_tokens is set to 200. The model emits {"name": "Czesł and stops because the tokens to finish that Polish name plus the rest of the JSON exceeded the budget. The JSON is malformed; your downstream code crashes. If you'd measured token counts before setting max_tokens, you'd have given it 250.

Subword bleeding in structured output. You ask the model to output one of three categories: "critical", "warning", "info". You get back "criti" or "warning_" or "infos". The model started emitting the right token but its sampling drifted at a subword boundary. Strict mode and JSON-schema-constrained generation prevent this; their existence is why tokens matter.

Inconsistent behavior on edge cases. Your prompt works on examples typed in ASCII. It fails on the same examples typed with curly quotes ("smart" quotes). The reason: " and " tokenize differently. The model has seen the ASCII version a million times in training and the smart-quote version maybe 10,000 times. Behavior diverges on inputs that look identical to humans.

The single highest-leverage habit for an LLM engineer: look at the tokenization of any prompt you're optimizing. The free tokenizer playgrounds (Anthropic and OpenAI both publish web tools) let you paste text and see the token boundaries colored. Five minutes of staring at how your prompt tokenizes will surface bugs you've been chasing for hours.

Question

How big are the tokenizer vocabularies?

Roughly 100,000–200,000 tokens. Claude's tokenizer is around 200K; OpenAI's cl100k_base is 100,256. Llama's tokenizers vary by version. The vocabulary is chosen at tokenizer training time — once frozen, it's part of the model. Bigger vocabulary = each token covers more text on average = lower token counts for the same input = lower cost, but more parameters in the embedding layer.

The practical takeaway: tokenizer choice affects cost, but you don't control it. Provider picks the tokenizer; you pay for what it gives you.

Question

Why is "ChatGPT" one token but "Czesław" six?

BPE (byte-pair encoding, the algorithm) merges the most frequent character pairs in the training corpus into vocabulary entries. "ChatGPT" appeared often enough during tokenizer training to earn a dedicated entry. "Czesław" did not, so it gets broken down into smaller pieces (and "ł" specifically is rare enough in English text that it usually gets its own token or even decomposes to UTF-8 bytes). The vocabulary is a frequency-weighted snapshot of the training corpus.

This is why technical English (with words like "function", "object", "request") tokenizes well — those words are extremely frequent in code-heavy training corpora.

Question

Does the model see token IDs or token strings?

Token IDs — integers. The model never sees the string "ChatGPT"; it sees something like [5942]. The model's embedding layer maps each ID to a high-dimensional vector, and from there it's all vector arithmetic. The string-to-ID mapping (the tokenizer) is a separate piece of code that runs before the model sees anything and after the model emits anything.

This is why models can output tokens that don't decode to valid UTF-8 in rare cases — the model is operating on IDs, the tokenizer is operating on strings, and the round-trip can fail for byte-fallback tokens. Streaming has to buffer until valid UTF-8 emerges.

STEP 2

The context window is a finite, position-sensitive resource.

You hear "200K context window" and intuit "200,000 tokens of free space." Both halves of that intuition are wrong in production. The space costs real money, and not all positions inside it are equal — content in the middle of a long context performs measurably worse than the same content at the start or end. Treating the context window as homogeneous flat memory is the source of subtle quality bugs that don't surface in dev.

Context costs scale with usage, not capacity

The headline number — Claude Sonnet 4.5's 200K context window — is a maximum, not a price tier. You pay for the tokens you actually use, and you pay for them on every turn. A multi-turn agent conversation that fills 80K tokens of history pays for 80K input tokens on every subsequent model call, not once. Across a 20-turn conversation, that's 1.6M input tokens paid for the same accumulated history. Prompt caching (chapter 2.2) reduces this significantly when context is stable, but the baseline shape is "context cost compounds with conversation length."

The corollary: context is the most expensive resource an agent burns. Not model calls, not tool calls — tokens of accumulated context. Optimizing how much context an agent carries forward (summarization, truncation, subagent isolation as in chapter 1.3) is usually the highest-leverage cost optimization available.

The lost-in-the-middle effect

The harder intuition: position inside the context window affects how reliably the model can use what's there. Multiple studies in 2023–2025 documented the pattern that has come to be called "lost in the middle": when you put 30 documents in context and ask the model to use one of them, the model is much more likely to use a document positioned at the start or end of the context than one positioned in the middle.

The shape of the effect:

Recall accuracy by document position in a 30-doc context ↑ 100 │██ ██ │██ ██ 80 │██████ ██████ │██████ ██████ 60 │████████ ████████ │████████ ↓ "lost middle" ████████ 40 │████████ ████████████████████ ████████ │████████ ████████████████████ ████████ 20 │████████ ████████████████████ ████████ │████████ ████████████████████ ████████ 0 └─────────────────────────────────────────→ start end of context of context Recall is high at the start (the model anchored on the first few) and at the end (recency effect — these were seen most recently). In the middle, performance can drop by 20-40 percentage points.

The effect is well-established, varies in magnitude by model and task, and shrinks (but does not disappear) with newer models. Frontier models in 2026 handle long contexts better than 2023 models, but the middle-of-context penalty is still measurable. Plan around it.

What to do about it

Three practical implications for how you arrange information in your prompts:

Put the most important content at the start and end. If your system prompt has critical rules ("never reveal API keys"), put them either at the very top or very bottom — not buried in the middle of a 2000-token system prompt. If your retrieved context has 10 chunks ranked by relevance, put the top-ranked chunks at the boundaries, not the middle.

Restate critical instructions near the user's actual question. If the system prompt says "respond in JSON only" and then 50K tokens of conversation history follow, the model may forget the JSON constraint by the time it generates the response. The fix: include a short restatement at the end of the user message ("Respond in JSON only, as specified in the system prompt"). The redundancy looks ugly; it's worth it.

Don't fill the context just because you can. Adding more documents to retrieval-augmented generation often hurts quality at some point, not because the model can't read them, but because the relevant document is now buried in the middle and the model can't focus on it. Most production RAG settles at top-5 or top-10 chunks, not because that's all the model can handle, but because that's the sweet spot before lost-in-the-middle starts costing more than the marginal context adds.

The mental model: attention as a budget

The mechanism underlying lost-in-the-middle: every token in context has to be attended to when generating the next token. The attention mechanism is fundamentally about weighing all input tokens — and with 100,000 tokens of input, each one gets a small slice of attention by default. Tokens at the start get "anchored" early in processing (the model's representation builds from left to right in a sense); tokens at the end are most recent in the sequence and get strong positional weighting. Middle tokens are neither anchored nor recent.

You don't need to understand the math to use the intuition: think of attention as a limited budget the model spreads across input tokens, with structural advantages for the start and end. Information you want the model to actually use should sit where the budget is biggest.

┌─────────────────────────────────────────────────────────────┐ │ WHERE TO PUT WHAT │ │ │ │ Start of context: │ │ ─ System prompt with critical rules │ │ ─ Most important retrieved documents │ │ ─ Tool definitions (model needs them throughout) │ │ │ │ Middle of context: │ │ ─ Bulk content the model needs to read but probably won't │ │ refer back to constantly │ │ ─ Historical conversation turns the model can summarize │ │ │ │ End of context (just before model generation): │ │ ─ User's most recent message │ │ ─ Restated critical instructions ("respond in JSON") │ │ ─ Any "do this next" prompt for the assistant │ └─────────────────────────────────────────────────────────────┘

Long context vs RAG: a deeper question than it looks

"Just dump everything in context, the window is huge" is a tempting alternative to retrieval-augmented generation. Sometimes it's the right answer, sometimes not, and the trade-off has three axes:

Cost. 200K tokens of input on every turn is expensive — call it $0.60 per turn on Sonnet at 2026 pricing. RAG with 5 retrieved chunks costs about $0.015 per turn. The difference compounds.
Latency. Processing 200K tokens takes meaningful time even with optimized inference — typically 3–10 seconds before the first generated token. RAG with smaller context produces first-token-out in under a second.
Quality. Sometimes long context wins (when the right answer requires synthesizing across many documents). Sometimes RAG wins (when the right document needs to be at the start/end of context for the lost-in-the-middle reason). Measure on your eval set.

The honest answer: long context is a tool, not a replacement for retrieval. Use it when you need cross-document reasoning over a known corpus. Use RAG when the corpus is too large to fit, or when cost and latency matter, or when the bulk of queries only need a few chunks.

Question

Has lost-in-the-middle gotten better with newer models?

Yes, but not gone. Frontier 2026 models handle 100K+ contexts much better than 2023 models did — recall at middle positions is dramatically improved, especially with techniques like contextual training and improved attention variants. But on benchmark tests of mid-context retrieval, you can still measure a gap between start/end and middle positions of 5–15 percentage points depending on task.

For agent design: treat lost-in-the-middle as a real effect that you can mitigate (by positioning content well) but should not assume away. The cost of placing important content at the boundaries is essentially zero; the cost of not placing it there is occasional quality regression you can't reproduce.

Question

What about the "200K context but only 8K effective" claims I've seen?

Those claims come from "needle in a haystack" benchmarks where the model is asked to retrieve one specific fact from a long document. They're directionally correct — recall degrades with context length — but often overstate the magnitude for typical use. The real picture: for tasks that need cross-document reasoning, modern long-context models work well. For tasks that need to find one buried fact, they work less well, and retrieval-augmented approaches usually win.

The pragmatic rule: if your task is "find and use," use RAG. If your task is "synthesize across," consider long context. Either way, measure on your specific eval set rather than trusting general claims.

Question

Does prompt caching change any of this?

Prompt caching (chapter 2.2 covers it in depth) reduces the cost dimension dramatically — cached tokens are billed at 10% of normal input price. It does not change the quality dimension; the model still has to attend to all tokens, and lost-in-the-middle still applies. Caching changes the economics of long context but not the cognitive shape of working with it.

STEP 3

Sampling: the model is a distribution, not a function.

The most useful single shift in mental model: the model doesn't pick a single output for a given input. For each next token, it produces a probability distribution over the whole vocabulary, and a separate sampling step picks one of those tokens. Different sampling settings, different output — same input, same model. Internalizing this dissolves a class of confusion ("why did I get a different answer this time?") and gives you a control surface for production agents.

What the model actually outputs

For each step of generation, the model emits a vector of logits — one number per vocabulary entry, typically 100,000–200,000 numbers. Softmax those and you get a probability distribution over what the next token might be:

Prompt: "The capital of France is"
Model's distribution over next token:
  " Paris"   →  0.847
  " the"     →  0.043
  " a"       →  0.022
  " located" →  0.018
  " known"   →  0.011
  " Lyon"    →  0.008
  ... 200,000 more tokens, mostly near zero ...

Sampling picks one of these. Then the process repeats for the next token,
conditioned on the new sequence.

The model's output is not " Paris" — it's the entire distribution. Sampling is what reduces the distribution to a single token. Different sampling choices, different paths down the probability tree.

Temperature, in plain English

Temperature is a single number — typically 0 to 2 — that controls how sharply concentrated the distribution is on the most probable token.

At temperature 0: take the single highest-probability token, every time. The distribution gets collapsed to a one-hot vector before sampling. Sometimes called "greedy decoding." Closest the model gets to deterministic behavior.

At temperature 1: sample from the actual distribution as-is. The model's natural variability.

At temperature 2: flatten the distribution. The high-probability tokens get sampled less often, low-probability tokens get sampled more often. Output becomes more varied, also less coherent.

temperature 0 temperature 1 temperature 2 P=1.0 │█ │ │ │█ │█ │ │█ │█ │ │█ │█ │ │█ │██ │█ │█ │██ │██ │█ │████ │████ │█ │██████ │██████ │█ │████████ │████████ │█ │██████████ │██████████ P=0.0 │█ · · · · · · · · · │██████████ · · · · │██████████████ └──────────────────── └──────────────────── └──────────────────── one peak. Always broad. Most likely flat. Anything could picks the highest. wins, but variety come out. is real. The model's underlying distribution is the same in all three cases. Temperature is just a transformation applied before sampling.

Temperature 0 still isn't deterministic

Here is the most surprising thing about sampling to people coming from traditional software: even at temperature 0, the same prompt can produce different outputs. Not always, but enough to matter in production.

Why? The deterministic path "always take the highest-probability token" depends on the probability vector being identical across runs. In practice, the probability vector can differ across runs because of:

Floating-point non-determinism in batched inference. Modern inference servers batch requests together for GPU efficiency. The exact batch your request lands in affects the order of additions in the matrix multiplications, which (because floating-point addition isn't associative) can shift logits by tiny amounts. Usually meaningless. Occasionally enough to tip the top-1 from one token to another.
Model snapshot updates. Provider deploys a new minor version. Same model name in the API call, slightly different weights, slightly different distribution. Your "deterministic" runs diverge.
Server-side processing variations. Caching, routing, fallback machinery — all of these can introduce tiny perturbations even when you specify temperature=0.

The practical impact: treat temperature=0 as "low variance," not "no variance". If you need actual reproducibility for tests or eval reruns, you need to combine temperature=0 with explicit seeds (where the provider supports them) and ideally a pinned model snapshot, and even then expect occasional drift.

When to use which temperature

The decision is task-driven, not preference-driven. Rough guide:

Classification, extraction

0 to 0.2

There's a correct answer. You want the highest-probability one. Variety hurts.

Tool-calling decisions

0 to 0.3

Want predictable behavior on the same input. Some variability is fine; randomness isn't.

Code generation

0.2 to 0.4

Low but not zero — some creativity for novel problems, mostly the highest-probability path for familiar ones.

Summarization, rewriting

0.3 to 0.7

Variety in phrasing is genuinely desirable. Same content, different ways to express it.

Creative writing, brainstorming

0.7 to 1.0

You want surprise. The highest-probability tokens are often the most generic.

top_p (nucleus sampling): the other dial

top_p is an alternative way to constrain sampling. Where temperature reshapes the whole distribution, top_p truncates it: sample only from the smallest set of tokens whose cumulative probability exceeds top_p. At top_p = 0.9, you're sampling from the top tokens that together account for 90% of probability mass — usually a few dozen out of 200,000.

The intuition: top_p caps how unlikely a token can be and still get sampled. It guarantees that you don't accidentally pick a wildly improbable token, while still allowing diversity among the merely "probable enough" ones.

Most production agents leave top_p at its default (1.0, i.e., no truncation) and rely on temperature alone. Combining both is occasionally useful: temperature 0.7 with top_p 0.9 gives you "diverse but never wild." Rarely worth tuning unless you have a specific failure mode you're trying to mitigate.

Seeds and the limits of reproducibility

Both Anthropic and OpenAI support a seed parameter. The promise: same seed + same prompt + same model + same parameters → same output. The reality is closer to "very likely the same output, but not guaranteed" — providers explicitly document seeds as best-effort, not contractual. The non-determinism sources from earlier (batching, snapshot drift) still apply at the margins.

What seeds do get you: meaningful reproducibility for tests within a short window. If you run your eval suite at 9am with a fixed seed and again at 10am, you'll almost certainly get the same results. If you run it next month, after a model snapshot has rolled out, you may not.

The practical use of seeds: set them in tests and CI runs so noise on individual examples doesn't pollute your eval signal. Don't set them in production — production benefits from the variety, and the determinism wouldn't be reliable anyway.

The mental model upgrade: stop thinking of the model as "wrong" when it produces a different answer to the same prompt. Think of it as sampling from a distribution. The right question is "is the distribution centered on the right answer with appropriate confidence?" — and that's measurable on your eval set. A model that's right 80% of the time is right 80% of the time on average; running it once might land in the 20%. That's not a bug, it's sampling.

Question

If temperature 0 isn't deterministic, why use it at all?

Because "low variance" is almost always what you want for production agents. The runs that diverge at temperature 0 are tipping near the boundary between two similarly-probable tokens — exactly the cases where the model is uncertain. The output isn't "random"; it's picking between two reasonable continuations. For classification or extraction tasks, this is fine. For creative tasks where you actively want variety, you'd be at a higher temperature anyway.

Don't reach for temperature 0 expecting deterministic replay. Reach for it expecting "highest-probability path, most of the time." That's the right framing.

Question

My eval scores fluctuate ±2 points between runs. Is that sampling noise?

Almost certainly yes, and chapter 3.1's noise-floor measurement quantifies exactly this. The right response isn't to chase determinism — it's to run the noisy metrics multiple times and report the mean. Two sigma above the mean is the threshold for "real change"; below that is sampling.

If your eval set has 50 examples and three of them are right at the boundary where the model's top-1 token might flip from "correct" to "incorrect," then ±2 points per run is exactly what you'd expect. Multi-run measurement absorbs this.

Question

Does temperature affect tool calling?

Yes — and almost always you want it low for tool calls. At higher temperatures, the model might choose to call a different tool than it would have at temperature 0, or pass slightly different arguments. For most agents this is undesirable: you want predictable behavior on the tool dispatch step. Set temperature to 0–0.3 if your loop is heavy on tool calls.

Some agent frameworks set temperature 0 for the tool-decision step and higher for the final synthesis step — a useful pattern when you want deterministic tool selection but more varied final-answer phrasing.

STEP 4

Hallucinations are predictable failure modes, not random errors.

"Hallucination" is overloaded. People use it for everything from minor factual errors to confidently-invented citations to outright fabrications. Treating it as a single phenomenon means you can't debug it; treating it as three or four distinct mechanisms means you can name the one in front of you and choose the right fix.

Here are the four mechanisms that account for most of what gets called "hallucination" in production agents. Each one is a predictable consequence of how the model works, and each one has a distinct fix.

Mechanism 1: Continuation bias

The model is fundamentally a next-token predictor. Given a partially-written response, it generates the token that best continues the pattern. This is a feature when the pattern is good. It becomes a failure mode when the model has committed to a confident-sounding start that doesn't have factual support.

Concretely: ask the model "What year did the Treaty of Westphalia end?" and it might start "The Treaty of Westphalia ended in" — and now the next-token distribution heavily favors a specific year. If the model has been trained on enough sources to know it's 1648, you get the right answer. If the model is uncertain, the distribution still favors some year (because what else completes the sentence?), and it samples one. The wrong year sounds exactly as confident as the right year would have, because the syntactic shape is identical.

The mechanism: once the model has committed to a sentence shape, it will complete the shape even if it has to invent the content. There's no "wait, I don't actually know this" branch in the next-token computation by default.

The fix: prompts that explicitly authorize uncertainty. "If you don't know the answer, say 'I don't know'" works better than you'd expect — it gives the model a permitted continuation that isn't "invent a confident answer." Combine with grounding: "Answer based only on the provided sources. If the answer isn't in the sources, say so." Cuts continuation-bias hallucinations substantially.

Mechanism 2: Training-distribution gaps

The model knows what it's seen during training. For topics, people, products, or events that didn't appear in training (or appeared rarely), the model is operating without reliable signal. The distribution over plausible tokens flattens, sampling becomes more variable, and outputs become factually unreliable in proportion to how poorly the topic was represented in training.

The symptom: confidently wrong on niche topics, accurate on common topics, with no clear signal to the user which is which. A user asking about a major historical figure gets correct answers; the same user asking about a minor regional figure gets confident fiction. The model doesn't expose its uncertainty in either case.

The mechanism: the model interpolates plausibly from sparse training data, and "plausibly" looks identical to "factually" from the outside.

The fix: retrieval-augmented generation (chapter 1.2). Don't ask the model what it knows; give it the relevant documents and ask it to answer from those. The model is much better at "synthesize from this provided text" than "recall from training." Grounding shifts the workload from memory (unreliable) to reading (reliable).

Mechanism 3: Instruction conflict

The model has been trained to follow user instructions and to give helpful answers. When these conflict — when following the instruction would produce a less helpful-sounding answer, or vice versa — the model navigates the conflict and sometimes lands wrong.

Concrete example: a user asks "list the API endpoints in alphabetical order" and provides a list of 20 endpoints. The model sorts most of them correctly but quietly fixes a typo in one endpoint name on the way through. The user said "list them" — not "list them and silently correct spelling" — but the model's helpfulness training nudged it toward producing the "obviously corrected" version. The user might miss this; the model might be wrong about which spelling is correct; either way, the user gets an output that differs from the input in ways they didn't ask for.

The mechanism: the model is balancing multiple objectives during generation, and when they pull in different directions, the resolution can introduce content the user didn't request.

The fix: be explicit about which dimension matters. "Preserve the input exactly; do not correct or normalize." "Be terse; do not add explanations unless asked." Instructions that close off the path the helpfulness training would otherwise push toward.

Mechanism 4: Confabulation under pressure

The fourth and trickiest mechanism: the model produces text that sounds like a memory or a fact but is actually being generated on the spot, with no underlying retrieval or computation behind it. This is what people usually mean when they say "hallucination" in the alarming sense — invented citations, made-up function signatures, fabricated quotes.

This happens most often when the model is asked for specific structured factual claims — a citation, a function signature, a date, a phone number — that it doesn't actually know. Because the syntactic shape of the answer is well-defined ("Author (Year)" for citations, "function_name(arg1, arg2)" for code), the model can produce something that fits the shape perfectly without having the content.

The mechanism: the model generates outputs that conform to the requested format even when it doesn't have the content to fill them, because format-conformity is what its training rewarded.

The fix is multi-layered. Grounding (Mechanism 2's fix) helps when the source material is provided. Verification (chapter 1.2's grounding-check pattern) helps when you can validate claims against sources after generation. Explicit prompting helps marginally: "Only cite sources that exist in the provided context; if you don't have a real citation, say so." For high-stakes outputs — medical, legal, code execution — combine all three and treat any unverified claim as suspect.

Reasoning models vs chat models: the new axis

From 2024 onward, providers have shipped a distinct class of "reasoning" models — Claude with extended thinking, OpenAI's o-series and GPT-5 reasoning mode, DeepSeek-R1 and similar. These models generate "thinking" tokens internally before producing their visible response. The thinking is hidden from the user but consumes real tokens and real compute.

The intuition for when reasoning helps:

Helps: multi-step logic problems, math, code review, complex planning, anything where the model benefits from "working out" an answer before committing to it. The thinking tokens are essentially the model talking to itself, exploring branches, catching its own mistakes.
Doesn't help (and costs more): simple factual lookup, classification, extraction, conversational responses. There's nothing to "reason about" — the model either knows the answer or it doesn't, and the extra thinking tokens are wasted compute.
Actively hurts (counterintuitively): tasks where the model knows the answer immediately but extended thinking introduces overthinking or second-guessing. Some calibrated, well-trained behaviors get noisier when you make the model "think more" first.

The practical impact for agent design: match the model to the task within the agent. Use a reasoning model for the planning step (deciding what to do next), use a faster non-reasoning model for the synthesis step (writing the final answer), use the cheapest model that works for the classification steps (deciding what tool to call). Chapter 2.2's cost ladder is built on exactly this insight.

The honest caveat: this advice is the state of practice as of early 2026 and the boundary between "reasoning" and "chat" models is blurring. Sonnet 4.5 and GPT-5.5 both have hybrid modes where they can think when needed and not when not, controlled by a single parameter. The trend is toward "one model, adjustable thinking budget" rather than "two model families." But the underlying mental model — extended thinking helps on hard problems, hurts on easy ones — still applies.

Summarizing: hallucination triage in 30 seconds

When you see a hallucination in production, run through the four mechanisms in order:

┌─────────────────────────────────────────────────────────────┐ │ HALLUCINATION TRIAGE │ │ │ │ Did the model commit to a sentence shape and then │ │ invent content to complete it? │ │ → Mechanism 1: continuation bias. │ │ Fix: authorize "I don't know" in the prompt. │ │ │ │ Is the topic obscure, niche, or post-training-cutoff? │ │ → Mechanism 2: training-distribution gap. │ │ Fix: retrieval-augmented generation. │ │ │ │ Did the output differ from the input in ways the user │ │ didn't ask for? │ │ → Mechanism 3: instruction conflict. │ │ Fix: explicit "preserve / don't add / be terse" rules. │ │ │ │ Did the model produce a specific structured claim │ │ (citation, signature, date) with no underlying source? │ │ → Mechanism 4: confabulation. │ │ Fix: grounding + post-generation verification. │ └─────────────────────────────────────────────────────────────┘

Almost every hallucination in a production agent is one of these four, or a combination. Naming the mechanism is half the fix — it directs you to the specific intervention rather than vague "make the prompt better" guesswork.

Question

If I just use a more powerful model, won't most of these hallucinations go away?

Some — not all. More powerful models have larger training corpora (smaller Mechanism 2 gaps), better calibration (less continuation bias in Mechanism 1), and better instruction-following (less Mechanism 3 conflict). But none of these mechanisms is fundamentally solved by scaling. Confabulation in particular — Mechanism 4 — persists across all model sizes, because the model is doing exactly what its training rewarded: producing format-compliant outputs.

The mental model: scaling reduces hallucination rates but doesn't eliminate them. Agent design has to assume hallucinations will happen and engineer around them with verification, grounding, and uncertainty acknowledgment. Don't bet on the next model release fixing what your retrieval pipeline should fix today.

Question

Is "hallucination" even the right word for these?

It's the word the field uses, so you'll keep encountering it. But it's misleading — it implies the model is doing something unusual or pathological when it produces these outputs, when actually it's doing exactly what it was trained to do (predict plausible next tokens). The outputs aren't "hallucinations" in any meaningful sense; they're predictable failure modes of plausible-text generation. The framing matters because it changes the fix: you don't fix a hallucination by telling the model not to hallucinate; you fix it by changing the conditions that produced it.

Question

When does extended thinking actually pay for itself?

Three rough patterns where the cost is justified:

Complex planning with many constraints — extended thinking lets the model enumerate, evaluate, and reject options before committing to one.
Math and logic puzzles with explicit step-by-step structure — the model uses thinking tokens to literally do the steps.
Self-correction tasks like code review or fact-checking — the model can identify issues during thinking that wouldn't surface in a single-pass response.

The marker that extended thinking is wasted: the model's "thinking" output (if you can see it via streaming) is just paraphrasing the question or rambling. That's a signal the task didn't need it and you should drop back to a faster model.

End of chapter 0.1

Deliverable

A working mental model of LLMs that lets you predict roughly what a model will do on a given input and explain failure modes when they happen. You see tokens as the real unit and tokenize inputs before optimizing them. You treat the context window as a finite, position-sensitive resource and place important information accordingly. You think of the model as a distribution and choose temperature based on the task, not preference. You recognize hallucinations by their mechanism — continuation bias, training-gap, instruction conflict, or confabulation — and apply the matching fix. Every other chapter in this guide assumes these intuitions, and now they're yours.

Run a tokenizer on a representative prompt; understand where the tokens are going
Measure your effective context length on a real workload; compare to nominal limit
Place critical content at start/end of context; never bury it in the middle
Choose temperature deliberately per task (0–0.3 for tool calls, 0.5+ for creative)
Stop treating temperature 0 as "deterministic"; treat it as "low-variance"
Use seeds in tests, not in production
For every hallucination, name the mechanism before fixing it
Authorize "I don't know" in prompts to defuse continuation bias
Use RAG to handle training-distribution gaps
Add explicit "preserve / don't add" rules to defuse instruction conflict
Match model class (reasoning vs chat) to task complexity, not by default