Cost control at the loop level: the per-task ceiling is a circuit breaker.
A traditional service has a roughly fixed cost per request. An agent does not: the same input can cost $0.02 or $40 depending on how many loop iterations, sub-agents, and retries it spawns. Cost is not a billing concern you reconcile monthly — it is a runtime safety property you enforce inside the loop, because an unbounded loop is simultaneously a financial incident and a runaway. This essay treats the per-task cost ceiling as a first-class circuit breaker.
The loop is an unbounded cost amplifier by default.
Three structural properties make agent cost unbounded unless you bound it: the loop iterates an undetermined number of times, the context grows every turn (so each turn is more expensive than the last — cost is super-linear in steps), and fan-out multiplies all of that. A planner stuck in a "reflect → re-plan → reflect" cycle does not crash; it quietly bills you every iteration until someone notices the graph. The default behavior of an agent without a budget is to spend without limit.
So budget is not an optimization to do later. It is the same control as a timeout in any other system: a hard bound on resource consumption that fails the operation closed rather than letting it consume without limit.
Budget the task in tokens, steps, and dollars — and trip the breaker.
A per-task budget object travels with the run and is decremented by every model and tool call. When any dimension is exhausted the loop does not continue "just one more step" — it trips: stop, return partial work, escalate.
# cost/budget.py — one budget per task, checked every step class Budget: def __init__(self, usd, steps, tokens): self.usd, self.steps, self.tokens = usd, steps, tokens def charge(self, call): self.usd -= call.cost_usd self.tokens -= call.total_tokens self.steps -= 1 if self.usd <= 0 or self.steps <= 0 or self.tokens <= 0: raise BudgetExceeded(self.snapshot()) # trip # children share the PARENT budget — fan-out can't escape it def spawn_child(parent_budget): return parent_budget # same object, not a fresh one
The single most common cost incident: each sub-agent gets a fresh budget. Then an 8-wide, 3-deep tree multiplies your intended ceiling by ~24 and the "limit" limited nothing. Children must draw down the parent's budget — the ceiling is per task, not per agent.
Model tiering and cascade: do not pay frontier price for easy steps.
Most loop iterations are mechanical: route this, extract that, decide whether to continue. Spending top-tier model price on every one is the largest avoidable line item. A cascade routes each call to the cheapest model that can do it, escalating only on low confidence or hard sub-tasks:
- Tier the work, not the agent. Routing, classification, and extraction go to a small model; planning and synthesis go to the frontier model. The decision is per-call, driven by step type.
- Cascade with a confidence gate. Try the cheap model; if its self-rated confidence or a cheap validator is below threshold, escalate. You pay frontier price only on the fraction that needs it.
- Measure the realized blended cost, not the headline per-token price. A cascade that escalates 80% of the time is a frontier model with extra latency — instrument the escalation rate.
Caching is the highest-ROI lever — prompt and tool results both.
Agents are unusually cacheable because they re-send a large, stable prefix (system prompt, tool schemas, retrieved corpus) every single turn, and they re-issue identical read tool calls across retries and sub-agents.
- Prompt prefix caching. Order the context stable-first (system → tools → durable context → volatile turns) so the cacheable prefix is maximal. On a 30-turn loop this is often a 5–10× cost reduction on input tokens for near-zero effort — the single biggest win in this essay.
- Tool-result caching. Deterministic read tools (a doc fetch, a price lookup) keyed by canonical args, with a TTL matched to data volatility. Cuts both cost and latency, and de-noises retries.
- Semantic/result cache for whole sub-tasks. If two runs ask the same sub-question, the second can replay the first's answer. Highest savings, but validate staleness — a wrong cached answer is more expensive than the call you saved.
Early-exit: the cheapest token is the one you do not generate.
Loops over-run. The model keeps "improving" an answer that was done three turns ago, or re-verifies something already verified. Early-exit conditions end the loop as soon as the task is actually complete, rather than when the step budget runs out:
- Goal-satisfied check — a cheap, explicit "is the original task now answered?" test. If yes, exit; do not let the model gild it.
- No-progress detection — if the last K steps did not change state or reduce open loops, the agent is spinning; stop and escalate rather than spend the rest of the budget confirming it is stuck.
- Diminishing-returns gate — if successive iterations improve a measurable quality signal by less than ε, the marginal tokens are not worth it.
No-progress detection is also a runaway detector. The loop that "isn't improving" and the loop that "is burning money in a cycle" are the same loop viewed through cost vs. safety lenses — one mechanism, two incident classes prevented (see incident-response-for-agents).
When aggressive cost control quietly degrades the product.
Every lever here trades quality for money, and over-tuned they fail the user invisibly: a cascade that under-escalates ships subtly worse answers; an over-eager early-exit returns half-finished work that looks complete; an over-long cache TTL serves stale facts confidently. Cost control without a quality eval in the loop is just a slower path to a worse product nobody flagged. Set the per-task ceiling as a hard fail-closed circuit breaker — non-negotiable — but tune cascades, caches, and early-exit against a quality metric, never against the invoice alone.