What is a large language model?
You have used a chatbot, so you know what an LLM does. This entry is about what it is — the surprisingly simple core idea (predict the next chunk of text), why making that idea very large produced abilities nobody explicitly programmed, and what this tells you about when to trust it. Get this right and the rest of modern AI stops being mysterious.
The one-sentence definition.
A large language model (LLM) is a very large neural network trained to do one thing: given a stretch of text, predict what comes next. That is the entire training objective. Show it "The capital of France is" and it learns to assign high probability to " Paris." Do this over a substantial fraction of the public internet, books, and code, and the network is forced to absorb an enormous amount of structure about language, facts, and reasoning patterns — because predicting the next word well requires all of that.
"Large" is not marketing. It refers to two concrete quantities: the number of adjustable weights (the parameters, often billions) and the amount of text it was trained on (often trillions of words). The headline finding of the last decade is that scaling both, together, keeps making the model better in ways that do not plateau as early as researchers expected.
Generation is prediction in a loop.
"Predict the next word" sounds too weak to produce essays and working code. The trick is the loop. The model does not plan a whole answer; it predicts one small chunk (a token — roughly a word-piece), appends it to the text, and runs again on the now-slightly-longer text. Repeat hundreds of times and a full response emerges, one token at a time.
prompt: "Write a haiku about the sea."
step 1 → "Vast"
step 2 → "Vast blue"
step 3 → "Vast blue horizon"
... each step: feed everything so far back in, predict the next token ...
final → a complete haiku
This is why an LLM can sound coherent over long passages: every new token is chosen in light of everything written so far, including its own previous output. It is also why the model has no fixed plan and can be steered mid-stream — there is no internal outline, only a running prediction conditioned on the growing text.
One more subtlety: the model does not output a single word. It outputs a probability for every possible next token, and a separate sampling step picks one. This is why the same prompt can give different answers — covered in the entry on temperature and sampling.
Where the abilities came from: scale and emergence.
Here is the genuinely surprising part. Nobody wrote code for "translate French," "summarise this contract," or "debug Python." The training objective was only ever next-token prediction. Yet at sufficient scale, the model becomes able to do these things — a phenomenon often called emergent behaviour: capabilities that are weak or absent in small models appear, sometimes fairly abruptly, as scale increases.
The intuition for why: to predict the next token across the whole internet, a model that has merely memorised cannot win — the space of text is far too large. The pressure of the objective forces it to internalise reusable structure: grammar, factual associations, arithmetic patterns, the shape of an argument, the conventions of code. Those internalised structures are precisely what we later use as "abilities." Translation falls out because the training data contained parallel text; reasoning patterns fall out because the data contained a lot of reasoning. The model learned them not because it was told to, but because they help predict the next token.
A related leap is in-context learning: you can show the model a couple of examples of a task in the prompt itself and it will follow the pattern, with no retraining. That, too, was never explicitly programmed; it emerged from scale.
"Emergent" does not mean magical or conscious. It means "not directly specified, but produced by optimising a simple objective at large scale." It is closer to how complex weather arises from simple physical laws than to anything mystical.
What the definition tells you about trust.
Hold the definition firmly — "a system optimised to produce plausible continuations of text" — and the model's strengths and failures become predictable rather than surprising.
- Fluency is guaranteed; truth is not. The objective rewards text that looks like a good continuation. Usually the most plausible continuation is also the correct one, which is why it is often right. But when the model lacks the knowledge, the most plausible-sounding continuation is still produced — confidently and in the same tone. This is the root of "hallucination," covered in its own entry.
- Its knowledge is frozen at training time. It learned from a snapshot of text. Events after that cutoff are simply absent unless supplied in the prompt.
- It reasons by pattern, not by proof. It is strong on problems whose shape resembles its training data and weaker on genuinely novel multi-step logic, because it is matching learned patterns, not executing a verified procedure.
- It is steerable through context. Because every token is conditioned on the preceding text, the prompt is a powerful control surface — the basis of prompting and of giving models tools and documents to work from.
So an LLM is not a database and not a reasoning engine. It is a very large, very capable next-token predictor whose fluency is reliable and whose factual accuracy must be earned through grounding, verification, and good prompt design. Every practical technique later in this wiki — retrieval, tools, agents — exists to compensate for exactly the gap this definition predicts.