Concurrency, queues and scaling: agents are jobs, not requests.
An HTTP handler that holds a connection open for a 40-minute agent loop is a design error you will discover under load. Agents are long-running, bursty, stateful, and expensive per unit of work — that profile is a batch job, and the operational shape that survives production is a queue with workers, per-tenant limits, and fan-out you can afford. This essay is about that shape.
The request/response framing breaks at the first concurrency spike.
Synchronous "POST /agent, await result" couples three things that have wildly different time constants: the client's patience (seconds), the agent's runtime (minutes), and your capacity (fixed). Under a burst, every slow agent pins a worker thread and a socket; the pool exhausts; healthy short requests queue behind 30-minute ones; load balancers time out and clients retry, multiplying the very load that is failing you. The fix is not a bigger thread pool; it is decoupling acceptance from execution.
The durable submission pattern: the API call enqueues a run and returns a run_id immediately. Workers pull from the queue and execute against the durable journal (see durable-state-and-resumability). The client polls or subscribes for completion. Acceptance is O(milliseconds) and never blocked by execution depth.
Queue + workers is the reference architecture.
# api: accept fast, never block on the loop def submit(req): run_id = new_id() journal.record(run_id, 0, "PLAN", req) queue.enqueue("agent-runs", run_id, tenant=req.tenant, priority=req.tier) return {"run_id": run_id, "status": "queued"} # worker: bounded concurrency, lease + heartbeat async def worker(slot): while True: job = await queue.lease("agent-runs", ttl=120) async with heartbeat(job): # renew lease while alive await run_loop(job.run_id) # resumable; safe to redeliver queue.ack(job)
The non-negotiable properties: leases with heartbeats (a dead worker's job becomes redeliverable, not lost — and because the run is resumable, redelivery is correct, not a duplicate), bounded worker concurrency (a fixed number of in-flight loops per worker, sized to provider rate limits and memory, not "as many as arrive"), and visibility (queue depth and oldest-message-age are your true load signal — not CPU).
Scale workers off queue depth and age, not CPU. An agent worker is mostly blocked on model and tool latency; CPU stays low while latency explodes. Autoscaling on CPU will under-provision exactly when you are drowning. Alert on oldest-message-age crossing your SLA.
Per-tenant concurrency limits, or one customer is everyone's outage.
A single global queue is a noisy-neighbor incident waiting to happen: one tenant submitting 5,000 runs starves every other tenant and burns the shared provider rate limit, turning their isolated burst into your platform-wide latency spike. You need fairness, enforced as a per-tenant in-flight cap, not just a global one.
# fair dispatch: cap concurrent runs per tenant def dispatchable(job): inflight = counter.get("inflight", job.tenant) cap = plan_limit(job.tenant) # e.g. free=2, pro=20 if inflight >= cap: queue.requeue(job, delay=backoff) # yield, don't drop return False counter.incr("inflight", job.tenant) return True
Implement this as work-stealing across per-tenant sub-queues, or a global queue with an admission gate as above. The cap is a product decision (plan tiers) and a safety device: it bounds the blast radius of one tenant's runaway loop (see incident-response-for-agents).
Statefulness is the real scaling tax — pay it with the journal.
Horizontal scaling is trivial for stateless services and hard for agents because an agent loop has state: plan, history, scratchpad. The trap is keeping that state in worker memory, which sticks a run to a pod and makes every restart a data loss. The discipline that buys horizontal scale: workers are stateless; the journal is the state. Any worker can pick up any run by replaying its journal. That single decision converts agents from sticky, un-rebalanceable workloads into fungible jobs you can autoscale, drain, and bin-pack like anything else.
In-memory caches for "the conversation so far" are the most common scaling regression. They work in dev (one process), pass review, and then sticky-route every run, defeat autoscaling, and lose state on deploy. If it must survive a process, it lives in the journal — not a process-local dict.
Fan-out is a cost multiplier with no natural backpressure.
Multi-agent and parallel-subtask patterns spawn N children per parent. The arithmetic is unforgiving: a top-level agent that fans out to 8 researchers, each making 15 model calls, is 120 calls for one user request — and a recursive planner that fans out at each level is exponential. Fan-out has no built-in backpressure: nothing stops a planner from deciding it needs 200 sub-agents.
- Bound the fan-out factor structurally (max children per node) and the recursion depth — both as hard runtime limits, not prompt requests the model can ignore.
- Charge children against the parent's budget, not a fresh budget each, so fan-out depletes a shared per-task ceiling (see
cost-control-in-the-loop). - Enqueue children as jobs through the same fair dispatcher — so 200 sub-agents queue behind the per-tenant cap instead of all hitting the provider at once.
When a queue is the wrong tool.
Queues add latency (enqueue, lease, poll) and operational surface (dead-letter handling, redelivery semantics, the polling API). A genuinely interactive agent where the human is staring at the screen waiting for a 4-second reply should stream synchronously — a queue there just adds a second of dead air for no durability the user cares about. The architecture in this essay earns its complexity when runs are long, bursty, or fanned-out. Use a queue when work outlives the requester's attention span; stream when it does not — and never block a request thread on a loop that can outlive the load balancer's timeout.