Concurrency, Queues & Scaling

Operation · AgentOps: Deploy & Operate

Concurrency, queues and scaling: agents are jobs, not requests.

An HTTP handler that holds a connection open for a 40-minute agent loop is a design error you will discover under load. Agents are long-running, bursty, stateful, and expensive per unit of work — that profile is a batch job, and the operational shape that survives production is a queue with workers, per-tenant limits, and fan-out you can afford. This essay is about that shape.

STEP 1

The request/response framing breaks at the first concurrency spike.

Synchronous "POST /agent, await result" couples three things that have wildly different time constants: the client's patience (seconds), the agent's runtime (minutes), and your capacity (fixed). Under a burst, every slow agent pins a worker thread and a socket; the pool exhausts; healthy short requests queue behind 30-minute ones; load balancers time out and clients retry, multiplying the very load that is failing you. The fix is not a bigger thread pool; it is decoupling acceptance from execution.

The durable submission pattern: the API call enqueues a run and returns a run_id immediately. Workers pull from the queue and execute against the durable journal (see durable-state-and-resumability). The client polls or subscribes for completion. Acceptance is O(milliseconds) and never blocked by execution depth.

STEP 2

Queue + workers is the reference architecture.

# api: accept fast, never block on the loop
def submit(req):
    run_id = new_id()
    journal.record(run_id, 0, "PLAN", req)
    queue.enqueue("agent-runs", run_id,
                  tenant=req.tenant, priority=req.tier)
    return {"run_id": run_id, "status": "queued"}

# worker: bounded concurrency, lease + heartbeat
async def worker(slot):
    while True:
        job = await queue.lease("agent-runs", ttl=120)
        async with heartbeat(job):       # renew lease while alive
            await run_loop(job.run_id)   # resumable; safe to redeliver
        queue.ack(job)

The non-negotiable properties: leases with heartbeats (a dead worker's job becomes redeliverable, not lost — and because the run is resumable, redelivery is correct, not a duplicate), bounded worker concurrency (a fixed number of in-flight loops per worker, sized to provider rate limits and memory, not "as many as arrive"), and visibility (queue depth and oldest-message-age are your true load signal — not CPU).

Scale workers off queue depth and age, not CPU. An agent worker is mostly blocked on model and tool latency; CPU stays low while latency explodes. Autoscaling on CPU will under-provision exactly when you are drowning. Alert on oldest-message-age crossing your SLA.

STEP 3

Per-tenant concurrency limits, or one customer is everyone's outage.

A single global queue is a noisy-neighbor incident waiting to happen: one tenant submitting 5,000 runs starves every other tenant and burns the shared provider rate limit, turning their isolated burst into your platform-wide latency spike. You need fairness, enforced as a per-tenant in-flight cap, not just a global one.

# fair dispatch: cap concurrent runs per tenant
def dispatchable(job):
    inflight = counter.get("inflight", job.tenant)
    cap = plan_limit(job.tenant)        # e.g. free=2, pro=20
    if inflight >= cap:
        queue.requeue(job, delay=backoff)  # yield, don't drop
        return False
    counter.incr("inflight", job.tenant)
    return True

Implement this as work-stealing across per-tenant sub-queues, or a global queue with an admission gate as above. The cap is a product decision (plan tiers) and a safety device: it bounds the blast radius of one tenant's runaway loop (see incident-response-for-agents).

STEP 4

Statefulness is the real scaling tax — pay it with the journal.

Horizontal scaling is trivial for stateless services and hard for agents because an agent loop has state: plan, history, scratchpad. The trap is keeping that state in worker memory, which sticks a run to a pod and makes every restart a data loss. The discipline that buys horizontal scale: workers are stateless; the journal is the state. Any worker can pick up any run by replaying its journal. That single decision converts agents from sticky, un-rebalanceable workloads into fungible jobs you can autoscale, drain, and bin-pack like anything else.

In-memory caches for "the conversation so far" are the most common scaling regression. They work in dev (one process), pass review, and then sticky-route every run, defeat autoscaling, and lose state on deploy. If it must survive a process, it lives in the journal — not a process-local dict.

STEP 5

Fan-out is a cost multiplier with no natural backpressure.

Multi-agent and parallel-subtask patterns spawn N children per parent. The arithmetic is unforgiving: a top-level agent that fans out to 8 researchers, each making 15 model calls, is 120 calls for one user request — and a recursive planner that fans out at each level is exponential. Fan-out has no built-in backpressure: nothing stops a planner from deciding it needs 200 sub-agents.

Bound the fan-out factor structurally (max children per node) and the recursion depth — both as hard runtime limits, not prompt requests the model can ignore.
Charge children against the parent's budget, not a fresh budget each, so fan-out depletes a shared per-task ceiling (see cost-control-in-the-loop).
Enqueue children as jobs through the same fair dispatcher — so 200 sub-agents queue behind the per-tenant cap instead of all hitting the provider at once.

STEP 6

When a queue is the wrong tool.

Queues add latency (enqueue, lease, poll) and operational surface (dead-letter handling, redelivery semantics, the polling API). A genuinely interactive agent where the human is staring at the screen waiting for a 4-second reply should stream synchronously — a queue there just adds a second of dead air for no durability the user cares about. The architecture in this essay earns its complexity when runs are long, bursty, or fanned-out. Use a queue when work outlives the requester's attention span; stream when it does not — and never block a request thread on a loop that can outlive the load balancer's timeout.