Operations / AgentOps: Deploy & Operate

AgentOps: Deploy & Operate

Running agents in production: rollout, versioning, scaling, idempotent retries, cost control, incident response.

  1. Durable State & Resumability
    Make the agent loop a durable computation — event-sourced history, journal-before-effect, and resume that replays rather than re-derives, so a crash or redeploy never restarts a half-done task.
  2. Concurrency, Queues & Scaling
    Agents are batch jobs, not requests: a queue with leased workers, per-tenant concurrency caps, journal-as-state for horizontal scale, and bounded fan-out are what survive production load.
  3. Idempotency, Retries & Side-Effect Safety
    Four stacked retry sources mean every write tool will fire twice unless you construct exactly-once with intent-derived idempotency keys, failure classification, and a durable side-effect ledger.
  4. Cost Control at the Loop Level
    Agent cost is unbounded by default; treat the per-task token/step/dollar ceiling as a fail-closed circuit breaker, then tune model cascades, prompt and tool caching, and early-exit against a quality metric.
  5. Rollout, Versioning & Pinning
    Behavior is the (model, prompt, tools) triple; pin it to dated snapshots, stamp it on every run, and promote new versions only through shadow/canary plus an eval gate with instant config-flip rollback.
  6. Incident Response & Runaway Containment
    A runaway agent fails open and keeps acting; detect from rate and progress, contain with in-loop fail-closed kill switches the resume path respects, rely on pre-installed blast-radius bounds, and turn every incident into a regression test.