Tokens & Tokenization

Concepts · AI Foundations

Tokens and tokenization: why text becomes numbers.

A language model never sees letters or words. It sees integers. The conversion step — tokenization — is invisible until it explains a bill that is higher than expected, a context window that fills faster than you counted, a model that is oddly bad at spelling, or a chatbot that costs three times as much for non-English users. This entry makes that hidden layer visible so its consequences stop being surprising.

STEP 1

Why models cannot use text directly.

Neural networks do arithmetic on numbers. They cannot multiply the letter "h" by a weight. So before any text reaches the model it passes through a tokenizer: a fixed piece of code that chops the text into pieces called tokens and maps each to an integer ID. The model only ever sees that sequence of integers; when it generates, it emits integers, and the tokenizer maps them back to text.

The obvious idea — one token per word — was abandoned for good reasons. A word-level vocabulary cannot represent words it never saw (new slang, typos, product names, code identifiers) and would need to be impossibly large to cover every language. The opposite extreme — one token per character — makes sequences far too long and forces the model to relearn spelling from scratch. Modern systems sit in between, with subword tokens.

STEP 2

Subword tokens: the practical compromise.

The dominant approach (commonly byte-pair encoding, "BPE") builds its vocabulary by scanning a huge corpus and repeatedly merging the most frequent adjacent pieces. Frequent strings earn their own dedicated token; rare strings get split into smaller, reusable parts. The result, roughly:

"Hello"            -> 1 token
"Hello world"      -> 2 tokens   ( Hello | world )
"tokenization"     -> 2-3 tokens ( token | ization )
"antidisestablish" -> several pieces
"GPT"              -> often 1 token (very frequent string)
"   "  (3 spaces)  -> real tokens (whitespace is not free)
"日本語"            -> ~3 tokens (about one per character)
"naïve" / curly "  -> more tokens than the plain-ASCII form

Three durable rules of thumb follow. First, a token is roughly 3 to 4 characters of typical English — about three quarters of a word on average — so token count is close to, but never equal to, word count. Second, common English and common code tokenize very efficiently because they were frequent in the tokenizer's training corpus. Third, anything rare for that corpus — non-English scripts, accented letters, emoji, unusual symbols — gets shattered into many small tokens.

Tokenization is deterministic but not predictable by eye. Same input, same tokens, every time — yet the boundaries follow corpus statistics, not grammar. You cannot reliably guess a token count by looking; you have to run the tokenizer. Provider tokenizer playgrounds let you paste text and see the boundaries coloured.

STEP 3

Why this controls cost and context.

Two of the most practical facts about using LLMs are direct consequences of tokenization.

You are billed per token, not per word or character. API pricing is quoted per million input tokens and per million output tokens. Your real cost is a tokenizer-output count you cannot eyeball — and it is asymmetric. Because a chunk of English is roughly 0.75 words per token but the same content in many other languages takes two to four times as many tokens, an identical chatbot can cost 2–4x more per turn for non-English users, while also exhausting its context budget faster for them. Compact formatting (fewer spaces, less boilerplate) is genuinely cheaper, because whitespace and punctuation are real tokens too.

The context window is measured in tokens. A "200K context window" means 200,000 tokens of input plus output, not 200,000 words and not 200,000 characters. Estimating it in words will overshoot; estimating it for non-English or code-heavy text without measuring will mislead you. When a long document mysteriously will not fit, the explanation is almost always that its token count is well above its word count.

STEP 4

The failure modes that trace back to tokens.

Several puzzling behaviours have a single root in tokenization, and recognising the signature saves hours of misdiagnosis:

Bad at spelling and character counting. Ask a model how many "r"s are in "strawberry" and it may stumble. It never saw the letters — it saw a couple of subword tokens. Character-level tasks fight against the representation the model actually operates on.
Truncated output mid-word. A response cut off at a strange point usually hit the max_tokens limit, not a content boundary. The limit is counted in tokens, so a budget that looks generous in words can run out earlier than expected, especially with non-English text.
Inconsistent behaviour on look-alike inputs. A prompt that works with plain quotes can behave differently with "smart" curly quotes because the two characters tokenize differently — and the model saw the ASCII form vastly more often during training. Inputs that look identical to a human can be different sequences to the model.

The single highest-leverage habit: when a prompt is misbehaving on cost, length, or weird edge cases, look at its tokenization before changing anything else. Tokens are the real unit the model operates on; reasoning about text in words or characters is reasoning about a layer the model never sees.