Chunking & vector search intuition.
RAG's retrieve step rests on two mechanics: splitting documents into chunks, and finding the relevant ones by meaning rather than keywords. This entry builds the intuition for both — why we chunk and how chunk size is a tradeoff, what an embedding vector is, how similarity search works, and the practical knobs that decide whether retrieval actually finds the answer.
Why chunk at all.
You cannot index "the document" as one unit and you cannot stuff whole documents into the prompt. So before anything is searchable, every source is split into smaller passages — chunks — typically a few hundred tokens each. Two reasons:
- Retrieval precision. A user asks one narrow question. You want the one paragraph that answers it, not a 40-page PDF. Small units let search return a focused passage.
- Context budget. The prompt has finite room and suffers lost-in-the-middle. Retrieving 5 tight chunks beats retrieving 3 enormous documents that crowd out the answer.
Chunk size is a tradeoff, not a setting.
CHUNK TOO SMALL (e.g. 1 sentence)
+ very precise match
- answer is fragmented across many chunks; each lacks context
- "It supports this" -- "it" = ? the chunk lost the subject
CHUNK TOO LARGE (e.g. whole section)
+ each chunk is self-contained
- one chunk covers many topics -> diluted, weaker match
- wastes context budget; buries the relevant sentence
SWEET SPOT (often a paragraph / heading-bounded passage)
- one coherent idea, enough context to stand alone
Two refinements that matter in practice. Overlap: let consecutive chunks share a sentence or two so an answer straddling a boundary is not cut in half. Structure-aware splitting: split on headings, paragraphs, or code blocks rather than every N characters, so a chunk is a coherent unit instead of an arbitrary slice. There is no universal best size — it depends on your documents and queries, and it is something you measure, not guess.
Embeddings: meaning as coordinates.
Keyword search fails when the user and the document use different words for the same idea ("reset password" vs "account recovery"). The fix is to represent meaning numerically. An embedding model turns a piece of text into a vector — a list of, say, 1,536 numbers — positioned so that texts with similar meaning land near each other in that space.
"reset my password" -> [ 0.02, -0.91, 0.33, ... ] --.
"recover account access"-> [ 0.04, -0.88, 0.31, ... ] --+-- close together
"chocolate cake recipe" -> [-0.77, 0.12, -0.50, ... ] -- far away
You do not interpret the individual numbers; no single dimension means "formality" or "topic." What matters is relative position: semantically similar text clusters, unrelated text is distant. This is the same embedding idea as in the Foundations section — here it is the engine of retrieval.
Vector search: nearest neighbours.
The retrieval pipeline becomes:
- Index (offline). Chunk every document, embed each chunk, store the vectors in a vector database alongside the original text.
- Query (online). Embed the user's question with the same model, then find the stored vectors closest to it.
- Return. The top-k closest chunks' original text becomes the retrieved context for RAG.
"Closest" is measured by a similarity metric — most commonly cosine similarity, which compares the direction of two vectors (1.0 = same meaning, 0 = unrelated). At scale, exact nearest-neighbour search is too slow, so vector databases use approximate nearest-neighbour (ANN) indexes: slightly less exact, dramatically faster, almost always the right trade.
Query and documents must be embedded with the same model. Vectors from different models live in incompatible spaces; comparing them produces meaningless distances. If you change embedding models, you must re-embed the entire corpus.
Why pure vector search is not enough.
Semantic search has real blind spots, and knowing them is what separates a demo from a system:
- Exact tokens. Product codes, error numbers, names, an exact API symbol — embeddings smear these into "approximately similar," and "ERR_4012" vs "ERR_4021" can look close. Keyword/BM25 search is precise here. Production retrieval often uses hybrid search: combine keyword and vector results.
- Top-k is a tradeoff. Too small and the answer chunk is missed; too large and you flood the context with noise and trigger lost-in-the-middle. Common starting point is k=5, then tuned on real queries.
- Reranking. Vector search is fast but coarse. A common upgrade: retrieve a generous candidate set with ANN, then reorder it with a slower, more accurate reranker model and keep the top few. Cheap retrieval for recall, expensive reranking for precision.
- Evaluate retrieval on its own. Build a small set of questions with known correct chunks and measure: is the right chunk in the top-k? Most "the AI gave a wrong answer" bugs are actually "the right chunk was never retrieved" — and you can only see that if you measure retrieval separately from generation.
Deliverable
You understand why documents are chunked and that chunk size is a precision-versus-context tradeoff solved with overlap and structure-aware splitting. You can explain an embedding as text mapped to a vector where nearby means similar in meaning, and vector search as embedding the query with the same model and returning its nearest neighbours by cosine similarity (approximate at scale). You know pure semantic search misses exact tokens, so production uses hybrid search and reranking — and that the only way to trust retrieval is to evaluate it directly, separately from the generated answer.