Idempotency, Retries & Side-Effect Safety

O3
Operation · AgentOps: Deploy & Operate

Idempotency, retries and side-effect safety: the agent will do it twice.

An agent loop retries — on transient model errors, on tool timeouts, on crash-and-resume, and because the model itself sometimes just calls the same tool again. Every one of those is a chance to send the email twice, charge the card twice, file the ticket twice. The "exactly-once" you want does not exist at the network layer; it is something you construct with idempotency keys, retry classification, and a side-effect ledger. This essay is that construction.

STEP 1

Every write tool is at-least-once until you make it idempotent.

Distributed systems give you at-most-once (might be lost) or at-least-once (might be duplicated). Agents make the duplication path far more likely than ordinary services because there are four independent retry sources stacked on top of each other: the HTTP client's retry, the loop's tool-error retry, the durable runtime's crash-resume, and the model re-emitting a tool call it already made. An unguarded write tool is not "usually fine"; it is a duplicate waiting for the first timeout.

The only robust answer is to make the effect idempotent: the second identical call is a no-op that returns the first call's result. That requires a stable identity for "this intended action," carried as an idempotency key the tool layer enforces — not hope that retries won't happen.

STEP 2

The idempotency key is derived from intent, not generated at call time.

A key generated fresh inside the tool wrapper changes on every retry and protects nothing. The key must be a deterministic function of the logical action, so that the retried call computes the same key and collides with the first attempt's ledger entry.

# tools/idem.py — key is a function of intent, not of the attempt
def idem_key(run_id, step_seq, tool, args):
    # run_id + step pins it to one decision in one run;
    # arg-hash defends against the model re-deciding identically
    h = sha256(canonical(args)).hexdigest()[:16]
    return f"{run_id}:{step_seq}:{tool}:{h}"

def do_write(key, fn, args):
    hit = ledger.get(key)
    if hit and hit.status == "DONE":
        return hit.result            # replay, do NOT re-execute
    ledger.put(key, status="PENDING")  # claim before effect
    res = fn(**args, idempotency_key=key)  # pass downstream too
    ledger.put(key, status="DONE", result=res)
    return res

Two layers of defense in the key: run_id:step_seq ties it to one decision in the durable journal so crash-resume reuses it; the args hash defends against the model independently re-emitting an identical call. Pass the key downstream too — Stripe, payment, and messaging APIs accept idempotency keys; let theirs and yours agree.

Canonicalize args before hashing (sorted keys, normalized numbers, stripped volatile fields like a client timestamp). A key that flips because the model reformatted whitespace in an argument is not an idempotency key — it is a duplicate generator with extra steps.

STEP 3

Classify failures: retryable, poison, or ambiguous.

"Just retry on error" is how a poison input gets attempted 50 times and how an ambiguous timeout becomes a duplicate. Every tool failure must be classified before you decide to retry:

  • Retryable — transient and provably without effect: connection refused, 429, 503, DNS blip. Retry with capped exponential backoff and jitter, bounded attempts.
  • Poison — deterministic and will never succeed: 400 validation error, "account not found," schema mismatch. Retrying is pure waste and burns budget. Stop, surface to the agent as a real observation, do not retry.
  • Ambiguous — the dangerous class: a timeout or dropped connection after the request may have been processed. You do not know if the effect happened. Never blindly retry; reconcile against the ledger or the downstream system's status endpoint first.

The ambiguous case is where money gets lost or duplicated. A 200 you never received looks identical to a request that never landed. The only safe move is reconciliation by idempotency key — "did action K already take effect?" — not a retry and not a guess.

STEP 4

The side-effect ledger is the source of truth, not the model's memory.

The agent's context is a terrible record of what it did to the world — it gets compacted, summarized, and sometimes hallucinated. The authoritative record of external effects is a separate, durable side-effect ledger: one row per attempted write, keyed by idempotency key, with status (PENDING / DONE / FAILED), the result, and a timestamp. It answers three operationally critical questions the context cannot: did this happen, exactly once?, and what was the result on the first try? On crash-resume the ledger — not the LLM — decides whether to execute or replay.

STEP 5

Dedupe windows and the honest limits of "exactly-once."

True exactly-once delivery is impossible across an unreliable network; what you build is exactly-once effect via idempotent operations plus a dedupe window. The window is the retention horizon of the ledger: collide within it and you replay; fall outside it and a very-late retry can duplicate.

  • Size the dedupe window to comfortably exceed your worst-case end-to-end retry horizon (including crash-resume after a multi-hour outage), not the average.
  • Prefer downstream-native idempotency where it exists (payment, email providers): their window and durability beat anything you bolt on.
  • For effects with no idempotency support (a fire-and-forget webhook to a partner), make the operation logically idempotent — "set state to X," not "increment" — or gate it behind human approval.
STEP 6

When this machinery is more than the action deserves.

Idempotency keys, classification, and a ledger are real cost: a write before every effect, a reconciliation path, retention to operate. For read-only tools (search, fetch, query) it is mostly unnecessary — a duplicated read wastes tokens, not money, so naive retry is fine. The full apparatus is mandatory exactly when an action is externally visible, irreversible, or moves value. Spend idempotency where a duplicate is an incident; skip it where a duplicate is just a wasted GET — and never let the cost of doing it right be the reason a payment tool ships without a key.