Context Compaction & Hierarchical Memory

Deep Dive · Memory & Context Engineering

Context compaction: summarization, eviction, and hierarchical memory.

A long-running agent's trajectory grows without bound; the window does not. Compaction is the set of techniques that shrink history while preserving the information needed to keep going. Done well it is invisible; done badly it silently amnesia-s the agent mid-task. This is the operation that makes truly long-horizon agents possible.

STEP 1

The compaction ladder: cheapest survivable technique first.

Compaction is not one move. It is a ladder of increasingly lossy techniques; apply the gentlest one that keeps you under budget, and only escalate when it is not enough.

Truncation — drop the oldest turns wholesale. Cheapest, lossiest, and only safe if the dropped content was already persisted to long-term memory. Never truncate as your only mechanism.
Deduplication / pruning — remove redundant tool outputs, repeated retrievals, superseded plan versions. High value, near-zero risk: you are deleting noise, not signal.
Summarization — replace a span of turns with a model-generated synopsis. The workhorse. Lossy but information-preserving if scoped correctly.
Hierarchical summarization — summaries of summaries, so even very old history survives as a thin trace. The technique that unlocks effectively unbounded horizons.

Order matters. Prune before you summarize (do not pay a model to summarize duplicate junk). Summarize before you truncate (do not delete what you have not yet distilled). Truncate only what is both summarized and persisted.

STEP 2

Scoped summarization that preserves what the agent needs.

A generic "summarize the conversation" prompt loses exactly the things an agent needs: open loops, decisions and their reasons, error causes, hard constraints. Compaction summarization must be task-structured, not prose-structured.

# memory/compact.py
COMPACT = """Compress the following agent turns into a
structured state summary. Preserve, do not narrate:

DECISIONS:  choices made and the reason for each
FACTS:      durable facts established (mark source turn)
OPEN:       unfinished sub-tasks / unanswered questions
ERRORS:     failures seen and their root cause
CONSTRAINTS:rules that must still hold

Drop: pleasantries, superseded plans, raw tool dumps
already reflected in FACTS. Be terse. No prose.

TURNS:
{turns}"""

def summarize_span(turns: list[dict], llm) -> str:
    body = "\n".join(fmt(t) for t in turns)
    return llm(COMPACT.format(turns=body), max_tokens=600)

The structured headers are the contract. A summary that keeps OPEN and CONSTRAINTS verbatim-faithful can replace 40 turns of history and the agent continues correctly. A summary that paraphrases them loses the thread — the classic "the agent forgot it was not allowed to touch prod" failure.

Never summarize the current open loop. Compaction operates on resolved or aged history. The active sub-task and the last few turns stay verbatim in the working set — summarizing the thing you are mid-way through is how agents lose their place.

STEP 3

Hierarchical memory: the MemGPT-style tiering.

The decisive idea, articulated by Packer et al. in MemGPT (2023): treat the context window like RAM and external storage like disk, with the agent itself (via tools) managing paging between tiers. Three tiers:

TIER 0  WORKING   in-window, verbatim, fully attended
                  → current task, last k turns, scratchpad
TIER 1  RECALL    out-of-window, fast retrieval
                  → recent summaries, indexed episodics
TIER 2  ARCHIVE   out-of-window, cold, summary-of-summaries
                  → old-task traces, thin durable residue

paging: WORKING --evict--> summarize --> RECALL --age--> ARCHIVE
        ARCHIVE --recall on demand--> RECALL --promote--> WORKING

Crucially, the agent has tools to move data between tiers — it can choose to "remember this" (write to recall), "what did we decide about X" (search archive), or "page this back in." Memory management becomes part of the agent's action space, not just an invisible framework behind it. That is what makes the horizon effectively unbounded: the agent can always reach further back, it just pays a retrieval round-trip to do so.

STEP 4

Triggering: compact on pressure, not on a timer.

Compaction is expensive (a model call) and lossy. Trigger it from budget pressure, with hysteresis so you do not thrash on the boundary.

# memory/compactor.py
class Compactor:
    def __init__(self, budget, hi=0.85, lo=0.60):
        self.budget = budget
        self.hi, self.lo = hi, lo   # hysteresis band

    def maybe_compact(self, history) -> list:
        used = ntok(history) / self.budget.working
        if used < self.hi:
            return history          # under pressure threshold
        keep_recent = tail_until(history, self.lo)
        old = history[:-len(keep_recent)]
        if not old:
            return history          # nothing safe to compact
        persist_to_longterm(old)    # write BEFORE you shrink
        digest = summarize_span(old, llm)
        return [{"role": "system",
                 "content": "[compacted history]\n" + digest}] \
               + keep_recent

The hysteresis band (compact at 85% full, down to 60%) prevents the pathological case where every turn nudges over the line and triggers a fresh, expensive summarization that barely buys headroom. Compact in chunks that buy real slack.

STEP 5

Verify the compaction, because lossy is not the same as broken.

Every compaction is a place where the agent can silently lose a fact and you will only find out three turns later when it does something forbidden. Treat compaction like any other lossy transform: verify it.

Constraint survival check. Maintain an explicit list of hard constraints out-of-band. After compaction, assert each one still appears (semantically) in the digest; if not, re-inject it verbatim.
Open-loop survival check. Count open sub-tasks before and after. A drop is a red flag, not a feature.
Round-trip eval. Offline, ask known questions answerable only from pre-compaction history, against the post-compaction state. Track the answer-rate as a first-class metric (see evaluating-memory).

The cheapest robust safeguard: keep hard constraints and the current open-loop list in a small, never-compacted pinned block at the top of the working set. Compaction then physically cannot eat them, no matter how aggressive it gets. Spend tokens here — it is the highest-leverage budget you have.