Retrieval-Augmented Memory

M4
Deep Dive · Memory & Context Engineering

Retrieval-augmented memory: embedding + retrieval as the recall mechanism.

Long-term memory is inert until something fetches the right slice and splices it into the working context. That fetch is a retrieval problem — the same machinery as RAG, but with memory-specific twists: the query is the agent's own evolving state, recency and salience matter as much as similarity, and a bad recall actively poisons reasoning rather than just being unhelpful.

STEP 1

Recall is retrieval. Reuse the stack, change the inputs.

If you have read the Field Guide's Retrieval chapter, the mechanics are familiar: chunk, embed, index, hybrid search, rerank. Memory recall reuses all of it. What changes is what you index and how you query:

  • The corpus is the agent's own history, not a static document set. It grows every turn and must stay bounded (see context-compaction).
  • The query is not a user question. It is a constructed cue derived from the current goal, the active sub-task, and the scratchpad — the agent's "state of mind," not its last utterance.
  • Ranking is not pure similarity. A memory's value is similarity and recency and salience — the Generative Agents retrieval score, adapted.
STEP 2

Construct the recall query from agent state, not the last message.

Embedding the user's last message and calling it a memory query is the most common mistake. The relevant memory is often related to the task, not the phrasing. Build an explicit cue:

# memory/recall_query.py
def build_cue(state) -> str:
    # A compact natural-language description of what the
    # agent is trying to do RIGHT NOW — this is what we embed.
    return (
        f"Goal: {state.goal}\n"
        f"Current step: {state.active_subtask}\n"
        f"Open questions: {'; '.join(state.open_loops)}"
    )

For tasks where the literal user wording also matters (a specific error string, an identifier), run hybrid retrieval — lexical over the raw utterance, dense over the constructed cue — and fuse, exactly as in the Retrieval chapter. The cue handles "what am I doing"; the lexical arm catches "the exact token the user typed."

STEP 3

Score by relevance, recency, and salience.

Pure cosine similarity retrieves the most similar memory, which is not the most useful one. A memory that is slightly less similar but was written five turns ago and was flagged important should usually win over a stale, marginal near-duplicate. The composite score:

# memory/score.py
import math

def recency(age_seconds: float, half_life: float) -> float:
    return 0.5 ** (age_seconds / half_life)

def score(mem, sim: float, now: float,
          w=(0.6, 0.25, 0.15)) -> float:
    rec = recency(now - mem.last_used, half_life=7 * 86400)
    sal = mem.salience                      # 0..1, set at write
    a, b, c = w
    return a * sim + b * rec + c * sal

def recall(store, cue, embed, now, k=5) -> list:
    cand = store.search(embed(cue), k=30)        # wide net
    ranked = sorted(
        cand, key=lambda h: score(h.mem, h.sim, now),
        reverse=True)
    for h in ranked[:k]:
        h.mem.last_used = now    # used → refreshes recency
    return ranked[:k]

The last_used refresh creates a useful dynamic: memories that keep getting retrieved stay "warm" and easy to recall; memories never retrieved cool down and become eviction candidates. This is an access-frequency signal for free, and it is the read-side counterpart to the decay logic in memory-stores.

The weights are workload-dependent. A coding agent on a long task wants recency high (the last 20 turns dominate). A personal assistant recalling user preferences wants salience high (a stated preference from a month ago still matters). Tune w with the eval harness in evaluating-memory, do not guess once and freeze it.

STEP 4

Threshold before truncate, and budget what survives.

Two non-negotiable gates between the ranked list and the prompt:

  • Absolute relevance floor. Take top-k only after dropping everything below a minimum composite score. "Top 5" with no floor will, on an off-topic turn, inject five irrelevant memories with confident framing. An empty recall is correct and safe when nothing is relevant.
  • Working-set budget. Recalled memories compete for the long_term budget from context-budgeting. If the survivors exceed it, summarize them into a single compacted block rather than dropping arbitrary ones.
recall("deploy the tier migration")
  candidates: 30
  after composite scoring + floor(0.35): 3 survive
    [sem]  prod deploys gated on staging check   0.71
    [epi]  last migration rollback was turn 41   0.52
    [proc] migration apply checklist             0.48
  18 below floor, dropped (correctly)
  3 memories, 420 tok — under long_term budget, no compaction

Retrieval drift: as the agent's state evolves mid-task, an unchanged cue keeps recalling the same early memories while the relevant ones have moved on. Rebuild the cue from current state every turn, not once at task start. A stale cue is a silent, compounding failure — it looks like the memory system is "working" because it returns results.

STEP 5

Render retrieved memory so the model trusts it correctly.

How recalled memory is formatted into the prompt changes whether the model treats it as ground truth, as a hint, or as a distractor. Three rules:

  • Label provenance and kind. "[recalled semantic memory, written turn 12]" lets the model weight it appropriately and lets you debug which memory drove a decision.
  • Mark uncertainty. A reflected conclusion is an inference, not a fact. Render it as "previously inferred" so the model does not treat a guess as gospel.
  • Keep it out of the authoritative system block. Retrieved memory is evidence, not policy. Putting a recalled (and possibly wrong) memory where the model expects immutable instructions is how memory poisoning becomes memory obedience.

Retrieval-augmented memory is the load-bearing mechanism of the whole stack: types decide what is stored, stores decide where, compaction keeps it bounded — but recall is what actually puts the right past in front of the model at the right moment. If recall is wrong, every other layer is wasted effort.