Durable state and resumability: an agent that survives the process that ran it.
A production agent will be killed mid-loop — by a deploy, an OOM, a spot reclaim, a 3am pod eviction. The only question is whether it resumes where it was or starts a fresh, half-completed task that now double-charges a customer. This essay is about making the agent loop a durable computation: state that outlives the process, replayable history, and a crisp line between what must persist and what you recompute.
The agent loop is a long-lived computation pretending to be a request.
The default mental model — "call the agent, await the answer" — is a lie that holds only until the first crash. A real agent loop is minutes-to-hours of think → call tool → observe → repeat, with side effects landing in the world along the way. If the holding process dies at step 14 of 30, an in-memory loop loses everything: the plan, the scratchpad, the fact that it already filed the refund. Durability is not a feature you add later; it is the data model you choose on day one.
The fork is: recompute or persist. Recomputing the whole trajectory from the original prompt is appealing (no storage) and wrong — LLM calls are non-deterministic and side-effecting, so a "replay" re-asks the model and re-fires tools. The only sound design persists the loop's decisions and observations as they happen, so resume reads history rather than re-deriving it.
Event-sourced history is the natural representation.
Model the loop as an append-only log of events, not a mutable blob you overwrite each turn. Each model decision, each tool invocation, each observation is one immutable record. State is a fold over the log; resume is "replay the log into memory, then continue." This is the same insight as event sourcing, and it is why durable-execution engines (Temporal, Restate, DBOS, AWS Step Functions) all converge on it.
# runtime/journal.py — append-only, fsync'd, monotonic seq def record(run_id, seq, kind, payload): row = {"run_id": run_id, "seq": seq, "kind": kind, # PLAN | TOOL_CALL | OBS | DONE "payload": payload, "ts": now()} db.append("journal", row) # durable BEFORE the effect def load_state(run_id): events = db.scan("journal", run_id, order="seq") return reduce(apply_event, events, State.empty())
Write the journal entry for an intended tool call before you execute the tool, not after. On resume you then know "we intended call N and don't have its result" — which is exactly the state an idempotent retry needs (see idempotency-and-retries). Logging only completed calls loses the most dangerous in-flight ones.
What must persist vs. what you recompute.
Persisting everything is slow and expensive; persisting too little loses the task. The discriminator is determinism and cost-to-rederive:
- Must persist: every LLM output, every tool call's arguments and result, the resolved plan, human approvals, and the seq counter. These are non-deterministic or side-effecting — they cannot be honestly recomputed.
- Recompute freely: derived views, the rendered prompt string, token counts, embeddings of already-stored text. Pure functions of persisted state; storing them is just a cache.
- Persist a pointer, not the bytes: large tool payloads (a 40MB CSV) go to object storage; the journal holds the URI and a content hash. The log stays small and replay stays fast.
The rule of thumb: if regenerating it would call a model or touch the outside world, persist it; if it is a pure function of what you already persisted, recompute it.
Resume is replay-up-to-the-frontier, then continue.
Crash recovery is not "start over" and not "guess." It is: load the journal, fold it into state, find the frontier (highest completed seq), and re-enter the loop at the next step. The subtle case is a journal that ends with an intended-but-unconfirmed tool call — the process died between "I will call refund()" and recording its result.
# runtime/resume.py def resume(run_id): st = load_state(run_id) if st.pending_call: # intent logged, result not # DO NOT blindly re-run: reconcile via idempotency key res = tool_status(st.pending_call.idem_key) if res is None: # provably never happened res = execute(st.pending_call) record(run_id, st.seq + 1, "OBS", res) return continue_loop(load_state(run_id))
The dangerous bug is a resume that re-issues a side-effecting call because its result was not journaled. Durable state without idempotency keys turns every crash into a duplicate action. The two essays are a single design: journal-before-effect plus idempotent-effect is the contract.
Redeploys are crashes you schedule — design for them.
The most common "crash" in production is your own deploy. Treat in-flight runs as a first-class migration problem. Three workable strategies, in order of preference:
- Drain: stop scheduling new runs, let in-flight ones reach a checkpoint boundary, then redeploy. Cleanest; needs bounded step latency and a max-drain timeout after which you fall back to resume.
- Checkpoint-and-resume: the journal already makes any pod fungible. New code picks up the run via
resume(). Requires the journal schema to be forward/backward compatible across the deploy window. - Version-pin the run: a run started under prompt/model version
v7resumes underv7, not whatever just shipped — otherwise the agent's "memory" and its current brain disagree (seerollout-and-versioning).
When durable execution is overkill.
Not every agent needs an event-sourced runtime. A sub-30-second, read-only, single-tool agent (a RAG question-answerer) can be a plain stateless request: if it dies, the user retries, nothing was written, nobody is double-charged. The machinery here earns its keep precisely when loops are long, side-effecting, or expensive to restart. Adopting Temporal-grade durability for a 5-second classifier is cargo-culting; skipping it for a multi-hour agent that moves money is negligence. Durability cost should track the cost of losing the run, not the sophistication of the framework you admire.