Retrieval-augmented memory: embedding + retrieval as the recall mechanism.
Long-term memory is inert until something fetches the right slice and splices it into the working context. That fetch is a retrieval problem — the same machinery as RAG, but with memory-specific twists: the query is the agent's own evolving state, recency and salience matter as much as similarity, and a bad recall actively poisons reasoning rather than just being unhelpful.
Recall is retrieval. Reuse the stack, change the inputs.
If you have read the Field Guide's Retrieval chapter, the mechanics are familiar: chunk, embed, index, hybrid search, rerank. Memory recall reuses all of it. What changes is what you index and how you query:
- The corpus is the agent's own history, not a static document set. It grows every turn and must stay bounded (see
context-compaction). - The query is not a user question. It is a constructed cue derived from the current goal, the active sub-task, and the scratchpad — the agent's "state of mind," not its last utterance.
- Ranking is not pure similarity. A memory's value is similarity and recency and salience — the Generative Agents retrieval score, adapted.
Construct the recall query from agent state, not the last message.
Embedding the user's last message and calling it a memory query is the most common mistake. The relevant memory is often related to the task, not the phrasing. Build an explicit cue:
# memory/recall_query.py def build_cue(state) -> str: # A compact natural-language description of what the # agent is trying to do RIGHT NOW — this is what we embed. return ( f"Goal: {state.goal}\n" f"Current step: {state.active_subtask}\n" f"Open questions: {'; '.join(state.open_loops)}" )
For tasks where the literal user wording also matters (a specific error string, an identifier), run hybrid retrieval — lexical over the raw utterance, dense over the constructed cue — and fuse, exactly as in the Retrieval chapter. The cue handles "what am I doing"; the lexical arm catches "the exact token the user typed."
Score by relevance, recency, and salience.
Pure cosine similarity retrieves the most similar memory, which is not the most useful one. A memory that is slightly less similar but was written five turns ago and was flagged important should usually win over a stale, marginal near-duplicate. The composite score:
# memory/score.py import math def recency(age_seconds: float, half_life: float) -> float: return 0.5 ** (age_seconds / half_life) def score(mem, sim: float, now: float, w=(0.6, 0.25, 0.15)) -> float: rec = recency(now - mem.last_used, half_life=7 * 86400) sal = mem.salience # 0..1, set at write a, b, c = w return a * sim + b * rec + c * sal def recall(store, cue, embed, now, k=5) -> list: cand = store.search(embed(cue), k=30) # wide net ranked = sorted( cand, key=lambda h: score(h.mem, h.sim, now), reverse=True) for h in ranked[:k]: h.mem.last_used = now # used → refreshes recency return ranked[:k]
The last_used refresh creates a useful dynamic: memories that keep getting retrieved stay "warm" and easy to recall; memories never retrieved cool down and become eviction candidates. This is an access-frequency signal for free, and it is the read-side counterpart to the decay logic in memory-stores.
The weights are workload-dependent. A coding agent on a long task wants recency high (the last 20 turns dominate). A personal assistant recalling user preferences wants salience high (a stated preference from a month ago still matters). Tune w with the eval harness in evaluating-memory, do not guess once and freeze it.
Threshold before truncate, and budget what survives.
Two non-negotiable gates between the ranked list and the prompt:
- Absolute relevance floor. Take top-k only after dropping everything below a minimum composite score. "Top 5" with no floor will, on an off-topic turn, inject five irrelevant memories with confident framing. An empty recall is correct and safe when nothing is relevant.
- Working-set budget. Recalled memories compete for the
long_termbudget fromcontext-budgeting. If the survivors exceed it, summarize them into a single compacted block rather than dropping arbitrary ones.
recall("deploy the tier migration")
candidates: 30
after composite scoring + floor(0.35): 3 survive
[sem] prod deploys gated on staging check 0.71
[epi] last migration rollback was turn 41 0.52
[proc] migration apply checklist 0.48
18 below floor, dropped (correctly)
3 memories, 420 tok — under long_term budget, no compaction
Retrieval drift: as the agent's state evolves mid-task, an unchanged cue keeps recalling the same early memories while the relevant ones have moved on. Rebuild the cue from current state every turn, not once at task start. A stale cue is a silent, compounding failure — it looks like the memory system is "working" because it returns results.
Render retrieved memory so the model trusts it correctly.
How recalled memory is formatted into the prompt changes whether the model treats it as ground truth, as a hint, or as a distractor. Three rules:
- Label provenance and kind. "[recalled semantic memory, written turn 12]" lets the model weight it appropriately and lets you debug which memory drove a decision.
- Mark uncertainty. A reflected conclusion is an inference, not a fact. Render it as "previously inferred" so the model does not treat a guess as gospel.
- Keep it out of the authoritative system block. Retrieved memory is evidence, not policy. Putting a recalled (and possibly wrong) memory where the model expects immutable instructions is how memory poisoning becomes memory obedience.
Retrieval-augmented memory is the load-bearing mechanism of the whole stack: types decide what is stored, stores decide where, compaction keeps it bounded — but recall is what actually puts the right past in front of the model at the right moment. If recall is wrong, every other layer is wasted effort.