GraphRAG & Multi-Hop Retrieval

Deep Dive · Retrieval & RAG

GraphRAG & multi-hop retrieval: when top-k chunks cannot answer the question.

Flat vector RAG retrieves the k chunks most similar to the query. That is exactly the right tool for "what does the doc say about X" and exactly the wrong tool for "what are the main themes across the corpus" or "who approved the change that broke the service the customer complained about." Some questions need aggregation or traversal, not similarity. GraphRAG is the answer to those questions — and the cost of building it is the reason it should be the answer to only those questions.

STEP 1

Two query classes flat RAG structurally cannot serve.

Top-k similarity search has a hard ceiling that no reranker or larger k removes, because the failure is not retrieval quality — it is the shape of the operation.

Global / thematic queries. "What are the dominant themes across these 4,000 incident reports?" The answer is not in any chunk; it is a property of the whole corpus. Top-k returns the five chunks most similar to the word "themes," which is noise. Similarity cannot aggregate.
Relational multi-hop queries. "Which engineer reviewed the migration that introduced the regression the on-call paged about?" Answering this requires chaining regression → migration → reviewer across documents that share almost no surface vocabulary. Each hop's evidence lives in a different chunk, and no single chunk is similar to the full question. Similarity cannot traverse.

The diagnostic question: is the answer contained in a small set of passages, or is it a function of how many passages relate to each other? If the former, flat RAG (see advanced-rag-architectures) is correct and graphs are over-engineering. If the latter, no amount of chunk retrieval will assemble it — you need structure the index does not currently have.

STEP 2

GraphRAG: structure extracted at index time.

Microsoft's GraphRAG (released 2024, github.com/microsoft/graphrag) makes the corpus's latent structure explicit. At index time an LLM reads every chunk and extracts an entity–relationship graph; a community-detection algorithm (Leiden) partitions that graph into nested clusters of densely connected entities; then the LLM writes a natural-language summary of each community. The corpus is no longer a bag of chunks — it is a graph plus a hierarchy of summaries-of-summaries.

# graphrag/index_pipeline.py  (conceptual)
def build_graph_index(chunks):
    triples = []
    for ch in chunks:
        # LLM call per chunk: the expensive part.
        triples += llm_extract(ch, schema="(entity, relation, entity)")

    g = build_graph(triples)              # merge co-referent entities
    communities = leiden(g)               # hierarchical clusters

    summaries = {}
    for c in communities:                  # bottom-up: leaf summaries
        summaries[c.id] = llm_summarize(  # roll up into parents
            entities=c.members, edges=c.internal_edges)
    return g, communities, summaries      # the queryable artefact

Query time then splits by question class. Local search anchors on the entities named in the query, walks their neighbourhood (related entities, connecting relations, the chunks each was extracted from), and answers from that focused subgraph — this is the multi-hop relational path. Global search is a map-reduce over community summaries: each relevant community summary independently produces a partial answer (map), and those partials are reduced into one corpus-level response — this is the thematic-aggregation path that flat RAG cannot perform at all.

STEP 3

Multi-hop retrieval without a precomputed graph.

A full extracted graph is not the only way to traverse. The lighter pattern is an iterative retrieve–reason loop: retrieve for the current sub-question, let the model identify what is still missing, generate the next query from that gap, and repeat until the chain closes. The "graph" here is constructed on the fly, per query, by reasoning — nothing is built at index time.

# multihop/iterative.py
def multihop_answer(question, retrieve, llm, max_hops=4):
    facts, sub_q = [], question
    for hop in range(max_hops):
        ctx = retrieve(sub_q, k=5)         # flat vector search
        step = llm.reason(question, facts, ctx)
        facts += step.new_facts
        if step.is_sufficient:           # chain closed
            return llm.synthesize(question, facts)
        sub_q = step.next_query          # the next hop
    return llm.synthesize(question, facts)  # best effort

This buys most of the relational benefit with none of the index-time graph cost, at the price of higher query-time latency and token spend (several LLM round-trips per question). Use it when relational queries are a minority of traffic; a precomputed graph only earns its construction cost when traversal is the dominant workload. Knowledge-graph-backed retrieval more broadly — querying a curated KG you already maintain, or one materialised from structured records — sidesteps extraction entirely and is the cheapest option when the graph already exists.

STEP 4

The cost and staleness tax.

GraphRAG's power is bought at index time, and the bill is real. Extraction is one or more LLM calls per chunk, plus summarization per community and per hierarchy level — for a large corpus this is materially expensive, often orders of magnitude more than embedding the same text once. That cost is not paid once and forgotten.

Staleness is the under-discussed failure. Flat vector indexes accept an incremental upsert — new doc, new embeddings, done. A graph does not update locally: a single new document can introduce entities that re-shape community boundaries, which invalidates the summaries those communities feed. Corpora that change daily either pay continual re-extraction or serve answers from a graph that no longer matches the source. The maintenance cost, not the build cost, is what most teams underestimate.

The payoff justifies the tax only under a specific profile: a corpus that is large (so themes are not skim-able), relationally dense (entities genuinely interconnect), and relatively stable (re-extraction is occasional, not continuous), serving a real share of global or multi-hop queries. Remove any one of those and the economics invert.

STEP 5

When NOT to use it, and the decision heuristic.

Most question-answering over a collection of independent documents — product docs, a support knowledge base, policy PDFs — does not need a graph. The answers live in one or a few passages; flat hybrid retrieval with a reranker is faster, cheaper, trivially incremental, and just as accurate. Reaching for a graph because it sounds principled is the same premature-sophistication error called out in memory-stores: the graph is the highest-cost option and the one most often adopted before its cost is justified.

The crisp heuristic:

Flat RAG — the answer is contained in a small set of passages. The default. Do not move off it without evidence.
Iterative multi-hop (no precomputed graph) — some queries need traversal, but they are a minority and the corpus changes often. Pay the cost at query time, per question.
GraphRAG (precomputed graph) — global/thematic or dense multi-hop queries are a sustained share of traffic, the corpus is large and relatively stable, and you have measured that flat retrieval loses real accuracy on them.
Hybrid — the realistic production answer: route flat-answerable queries to the cheap path and only thematic/relational ones into the graph or the iterative loop. Most corpora that warrant a graph still answer the majority of questions without one.

Decide with a query-class audit, not intuition. Sample real queries, label each as passage-local, multi-hop, or thematic, and measure flat-RAG accuracy per class. If the passage-local class dominates and scores well — the common case — you do not need a graph. Build one only for the classes that demonstrably fail, and route everything else around it.