Retrieval-augmented generation (RAG) explained

Concepts · Core Building Blocks

Retrieval-augmented generation (RAG) explained.

A model only knows what was in its training data and what is in the current prompt. RAG fixes the gap by fetching relevant text at query time and inserting it into the prompt, so the model answers from provided documents instead of memory. This entry covers why RAG exists, the retrieve→augment→generate flow, what each stage really does, and the failure modes that decide whether a RAG system is trustworthy.

STEP 1

The problem RAG solves.

Three structural limits of a bare LLM, all with the same root cause — the model answers from frozen training weights:

Knowledge cutoff. It cannot know events, prices, or docs created after training.
No private data. It never saw your codebase, your wiki, your customer's account.
Confident gap-filling. Asked about something it half-knows, it produces plausible, confident, wrong text. This is the training-distribution-gap hallucination mechanism.

RAG's insight: the model is far better at reading and synthesizing provided text than at recalling from training. So do not ask what it knows — hand it the relevant documents and ask it to answer from those. Grounding shifts the workload from unreliable memory to reliable reading.

STEP 2

Retrieve → Augment → Generate.

User question
   |
   v
[1 RETRIEVE]  search a knowledge source for passages
              relevant to the question  -->  top-k chunks
   |
   v
[2 AUGMENT]   build a prompt:
              system rules + retrieved chunks + the question
   |
   v
[3 GENERATE]  model answers using ONLY the supplied chunks,
              ideally citing which chunk each claim came from
   |
   v
Grounded answer (+ citations)

RAG is not one model or one library — it is this pattern. The model itself is unchanged; what changes is that you assemble its context dynamically per question. Connecting to the section: retrieval is effectively a tool the model's context is wired to, and the augment step is just prompting with freshly fetched material.

STEP 3

What each stage actually does.

Retrieve

You cannot put the whole corpus in the prompt — it would not fit and would trigger lost-in-the-middle. So an index is searched for the few passages most likely to contain the answer. Modern RAG usually uses semantic (embedding-based) search so "how do I reset my password" matches a doc titled "Account recovery steps" even with no shared keywords. Chunking and vector search are big enough to get their own entry; here, treat retrieve as "find the top-k relevant chunks."

Augment

The retrieved chunks are inserted into the prompt with explicit framing: "Answer using only the context below. If the answer is not in the context, say you don't know. Cite the chunk you used." This instruction is what converts raw retrieval into grounded generation — without it the model blends retrieved text with training memory and you lose the guarantees.

Generate

The model writes the answer conditioned on the question plus the chunks. Done well, it synthesizes across passages and attributes claims to sources, which makes answers checkable — the single biggest reason RAG is trusted in production.

STEP 4

RAG vs alternatives.

vs fine-tuning. Fine-tuning teaches behavior and style; it is a poor and expensive way to inject facts, and updating a fact means retraining. RAG updates instantly — change the document, the next answer reflects it. Use fine-tuning for "how to respond," RAG for "what is true right now."
vs long context. The 2025–2026 consensus is more nuanced than "RAG wins." For a small, stable, bounded corpus, pasting the whole thing into a long-context window can match or beat RAG on answer quality, since the model sees everything and nothing is lost to a bad retrieval. RAG stays far cheaper, lower-latency, and fresher for large or changing corpora, and it sidesteps lost-in-the-middle position bias. So the modern answer is to route: simple, dynamic, or large → RAG; a small fixed set that needs whole-document reasoning → long context; agentic systems often blend both. RAG is not dead — the framing just shifted from "which wins" to "which to route to."

STEP 5

Where RAG breaks.

"We added RAG" does not guarantee correctness. Most RAG failures are retrieval failures, not generation failures:

Retrieval miss. The right passage was never fetched, so the model answers from the wrong chunks or from memory. If retrieval misses, generation cannot recover. This is the dominant failure; evaluate retrieval recall separately from answer quality.
Bad chunking. The answer was split across two chunks and only one was retrieved, so the model sees half the story.
Stale index. The source document changed but the index was not rebuilt. RAG is only as fresh as its last indexing run.
Ungrounded generation. Chunks were retrieved correctly but the model ignored them and used training memory anyway. Mitigate with explicit grounding instructions and a post-generation check that every claim traces to a retrieved chunk.
Poisoned context. Retrieved content is untrusted data. A document containing "ignore previous instructions" is a prompt-injection vector. Delimit and label retrieved text as data, never instructions.

Debug RAG in two halves. First ask: was the right chunk retrieved at all? (Inspect the retrieved set directly.) Only if yes ask: did the model use it correctly? Conflating these two is why teams spend weeks tuning prompts when the real bug is that the retriever never returned the answer.

Takeaway

Deliverable

You can explain RAG as retrieve→augment→generate: fetch the top-k relevant passages, build a prompt that instructs the model to answer only from them and cite them, then generate a checkable grounded answer. You know it exists because models read provided text far better than they recall training, and that it beats fine-tuning for facts and beats long-context dumping on cost and quality at scale. Above all you debug it in two halves — retrieval first, generation second — because most RAG bugs are the retriever not finding the answer, which no prompt can fix.