Operations / AgentOps: Deploy & Operate
AgentOps: Deploy & Operate
Running agents in production: rollout, versioning, scaling, idempotent retries, cost control, incident response.
- Durable State & ResumabilityMake the agent loop a durable computation — event-sourced history, journal-before-effect, and resume that replays rather than re-derives, so a crash or redeploy never restarts a half-done task.
- Concurrency, Queues & ScalingAgents are batch jobs, not requests: a queue with leased workers, per-tenant concurrency caps, journal-as-state for horizontal scale, and bounded fan-out are what survive production load.
- Idempotency, Retries & Side-Effect SafetyFour stacked retry sources mean every write tool will fire twice unless you construct exactly-once with intent-derived idempotency keys, failure classification, and a durable side-effect ledger.
- Cost Control at the Loop LevelAgent cost is unbounded by default; treat the per-task token/step/dollar ceiling as a fail-closed circuit breaker, then tune model cascades, prompt and tool caching, and early-exit against a quality metric.
- Rollout, Versioning & PinningBehavior is the (model, prompt, tools) triple; pin it to dated snapshots, stamp it on every run, and promote new versions only through shadow/canary plus an eval gate with instant config-flip rollback.
- Incident Response & Runaway ContainmentA runaway agent fails open and keeps acting; detect from rate and progress, contain with in-loop fail-closed kill switches the resume path respects, rely on pre-installed blast-radius bounds, and turn every incident into a regression test.