Training vs Inference

Concepts · AI Foundations

Training vs inference: building the model vs using it.

"The AI learned that from my chat" is one of the most common misconceptions about LLMs, and it comes from blurring two completely separate phases. This entry draws the line cleanly: training is the expensive, one-time process that creates the model's fixed weights; inference is the cheap, repeated process of running those frozen weights to answer a request. Once you see the boundary, a lot of confusing behaviour — and a lot of cost and privacy questions — becomes obvious.

STEP 1

Training: where the weights are formed.

Training is the phase that turns a random network into a useful model by adjusting its billions of weights. It happens once (per model version), before you ever interact with the model, on a large cluster of specialised hardware over weeks or months. Modern LLM training has three broad stages:

Pretraining. The model reads an enormous corpus of text and is trained only to predict the next token. This is where the bulk of the cost and the bulk of the knowledge come from — the model absorbs grammar, facts, and reasoning patterns. The output is a "base model": fluent, knowledgeable, but not yet good at following instructions or behaving as an assistant.
Fine-tuning / instruction tuning. The base model is further trained on curated examples of instructions paired with good responses, teaching it to be a helpful assistant rather than a raw text-continuation engine.
Preference tuning (RLHF and successors). Humans (or models acting as judges) rank competing responses; the model is nudged toward the preferred ones. RLHF — reinforcement learning from human feedback — is the best-known method. This stage shapes tone, helpfulness, and safety. It refines behaviour; it does not teach much new knowledge.

The essential point: training is when learning happens, and it ends. When training finishes, the weights are frozen into a fixed file. That file is the model.

STEP 2

Inference: running the frozen model.

Inference is what happens every time you send a prompt. The frozen weights are loaded, your text flows through the network once per generated token, and a response comes out. Crucially, inference does not change the model. No weight is updated. The next user gets exactly the same weights you did.

This single fact dissolves the most common misconception:

Chatting with an LLM does not teach it. Within one conversation it appears to "remember" earlier messages only because the entire conversation is fed back in as input on every turn (that is the context window). Close the conversation and that context is gone. Your chat did not modify any weight. (Providers may separately log conversations and use them in a future training run if their policy allows — but that is a deliberate, separate data pipeline, not the model learning live.)

So an LLM during inference is effectively a very large fixed function: same weights, same prompt, same settings → same probability distribution. (It still feels non-deterministic because of the sampling step and infrastructure details — see the temperature and sampling entry — but the underlying function is fixed.)

STEP 3

Why the asymmetry matters: cost, speed, and what you control.

Training and inference have wildly different economics, and the asymmetry drives most practical decisions:

Training: enormous, one-time, centralised. Pretraining a frontier model can cost millions of dollars in compute and is done by the model provider, not by you. You almost never train an LLM from scratch.
Inference: small, repeated, per-request. Each request is comparatively cheap, but you pay it again on every single call, forever. At scale, total inference cost dwarfs the one-time training cost. This is why optimising prompt length, caching, and model choice is where engineering effort actually goes.

The asymmetry also tells you what you can and cannot change. You cannot change the weights at inference time. What you can change is the input: the prompt, the examples you include, the documents you retrieve and paste in, the tools you offer. This is the entire reason prompting, retrieval-augmented generation, and tool use exist — they are how you influence a model whose weights you are not allowed to touch.

STEP 4

"But can't it learn new things?" — the three real options.

People reasonably ask how a model ever incorporates new information if inference cannot change it. There are exactly three mechanisms, and distinguishing them prevents most confusion:

In-context (no weight change). Put the new information directly in the prompt — paste the document, give an example, attach retrieved text. The model "uses" it for this request only. Fast, cheap, reversible, forgotten when the context ends. This covers the large majority of real applications.
Fine-tuning (deliberate new weights). Run an additional, smaller training pass on your own examples to produce a new variant with permanently adjusted weights. Useful for steady style or format needs; it is a real training process with real cost, not something that happens by chatting.
A new model version (full retrain). The provider trains a new model on newer data. This is why a model's knowledge cutoff only advances when a new version ships — not gradually as people use it.

Keep the boundary crisp: training creates the brain and then stops; inference uses the brain without altering it; anything that feels like "the model learning from you in the moment" is really context being fed in, not weights being changed. Carry that distinction and questions about cost, privacy, knowledge cutoffs, and why your clever prompt did not "stick" mostly answer themselves.