Durable State & Resumability

O1
Operation · AgentOps: Deploy & Operate

Durable state and resumability: an agent that survives the process that ran it.

A production agent will be killed mid-loop — by a deploy, an OOM, a spot reclaim, a 3am pod eviction. The only question is whether it resumes where it was or starts a fresh, half-completed task that now double-charges a customer. This essay is about making the agent loop a durable computation: state that outlives the process, replayable history, and a crisp line between what must persist and what you recompute.

STEP 1

The agent loop is a long-lived computation pretending to be a request.

The default mental model — "call the agent, await the answer" — is a lie that holds only until the first crash. A real agent loop is minutes-to-hours of think → call tool → observe → repeat, with side effects landing in the world along the way. If the holding process dies at step 14 of 30, an in-memory loop loses everything: the plan, the scratchpad, the fact that it already filed the refund. Durability is not a feature you add later; it is the data model you choose on day one.

The fork is: recompute or persist. Recomputing the whole trajectory from the original prompt is appealing (no storage) and wrong — LLM calls are non-deterministic and side-effecting, so a "replay" re-asks the model and re-fires tools. The only sound design persists the loop's decisions and observations as they happen, so resume reads history rather than re-deriving it.

STEP 2

Event-sourced history is the natural representation.

Model the loop as an append-only log of events, not a mutable blob you overwrite each turn. Each model decision, each tool invocation, each observation is one immutable record. State is a fold over the log; resume is "replay the log into memory, then continue." This is the same insight as event sourcing, and it is why durable-execution engines (Temporal, Restate, DBOS, AWS Step Functions) all converge on it.

# runtime/journal.py — append-only, fsync'd, monotonic seq
def record(run_id, seq, kind, payload):
    row = {"run_id": run_id, "seq": seq,
           "kind": kind,           # PLAN | TOOL_CALL | OBS | DONE
           "payload": payload,
           "ts": now()}
    db.append("journal", row)      # durable BEFORE the effect

def load_state(run_id):
    events = db.scan("journal", run_id, order="seq")
    return reduce(apply_event, events, State.empty())

Write the journal entry for an intended tool call before you execute the tool, not after. On resume you then know "we intended call N and don't have its result" — which is exactly the state an idempotent retry needs (see idempotency-and-retries). Logging only completed calls loses the most dangerous in-flight ones.

STEP 3

What must persist vs. what you recompute.

Persisting everything is slow and expensive; persisting too little loses the task. The discriminator is determinism and cost-to-rederive:

  • Must persist: every LLM output, every tool call's arguments and result, the resolved plan, human approvals, and the seq counter. These are non-deterministic or side-effecting — they cannot be honestly recomputed.
  • Recompute freely: derived views, the rendered prompt string, token counts, embeddings of already-stored text. Pure functions of persisted state; storing them is just a cache.
  • Persist a pointer, not the bytes: large tool payloads (a 40MB CSV) go to object storage; the journal holds the URI and a content hash. The log stays small and replay stays fast.

The rule of thumb: if regenerating it would call a model or touch the outside world, persist it; if it is a pure function of what you already persisted, recompute it.

STEP 4

Resume is replay-up-to-the-frontier, then continue.

Crash recovery is not "start over" and not "guess." It is: load the journal, fold it into state, find the frontier (highest completed seq), and re-enter the loop at the next step. The subtle case is a journal that ends with an intended-but-unconfirmed tool call — the process died between "I will call refund()" and recording its result.

# runtime/resume.py
def resume(run_id):
    st = load_state(run_id)
    if st.pending_call:                 # intent logged, result not
        # DO NOT blindly re-run: reconcile via idempotency key
        res = tool_status(st.pending_call.idem_key)
        if res is None:                   # provably never happened
            res = execute(st.pending_call)
        record(run_id, st.seq + 1, "OBS", res)
    return continue_loop(load_state(run_id))

The dangerous bug is a resume that re-issues a side-effecting call because its result was not journaled. Durable state without idempotency keys turns every crash into a duplicate action. The two essays are a single design: journal-before-effect plus idempotent-effect is the contract.

STEP 5

Redeploys are crashes you schedule — design for them.

The most common "crash" in production is your own deploy. Treat in-flight runs as a first-class migration problem. Three workable strategies, in order of preference:

  • Drain: stop scheduling new runs, let in-flight ones reach a checkpoint boundary, then redeploy. Cleanest; needs bounded step latency and a max-drain timeout after which you fall back to resume.
  • Checkpoint-and-resume: the journal already makes any pod fungible. New code picks up the run via resume(). Requires the journal schema to be forward/backward compatible across the deploy window.
  • Version-pin the run: a run started under prompt/model version v7 resumes under v7, not whatever just shipped — otherwise the agent's "memory" and its current brain disagree (see rollout-and-versioning).
STEP 6

When durable execution is overkill.

Not every agent needs an event-sourced runtime. A sub-30-second, read-only, single-tool agent (a RAG question-answerer) can be a plain stateless request: if it dies, the user retries, nothing was written, nobody is double-charged. The machinery here earns its keep precisely when loops are long, side-effecting, or expensive to restart. Adopting Temporal-grade durability for a 5-second classifier is cargo-culting; skipping it for a multi-hour agent that moves money is negligence. Durability cost should track the cost of losing the run, not the sophistication of the framework you admire.