You cannot manage a cost you cannot attribute, and a budget without a circuit breaker is a wish.
An undifferentiated monthly model bill is operationally useless: it tells you that you spent money, not which feature, tenant, or user spent it, and it arrives a month after the spending decision was irreversible. Cost attribution is the discipline of tagging every model and tool call with the dimensions you make decisions on, so spend becomes a per-feature, per-tenant signal — and budgets only protect you if exceeding them does something at runtime, not in a report.
The provider bill is at the wrong granularity to act on.
The invoice is aggregated by API key and month — the two dimensions you can do nothing with. The questions that drive decisions are: which feature is unprofitable, which tenant is subsidized by the others, which user is a cost outlier, which code path regressed after the last deploy. None of these are answerable from the bill. Attribution means carrying the dimensions you decide on through every call, so spend is queryable along feature, tenant, user, and version — the axes of your actual decisions.
Stamp a cost context on every call and propagate it through fan-out.
A cost-context object is created when a task starts and travels with it — including into every sub-agent and tool call. Each model response's usage is recorded against that context. The hard part is propagation: a sub-agent that loses the context attributes its spend to nobody, and fan-out is exactly where the expensive spend hides.
# cost/attribution.py — context travels with the task, into children class CostContext: def __init__(self, tenant, feature, user, version): self.tenant, self.feature = tenant, feature self.user, self.version = user, version def record(self, call): emit("spend", tenant=self.tenant, feature=self.feature, user=self.user, version=self.version, usd=call.cost_usd) def spawn_child(ctx): return ctx # SAME context — unattributed spend is the bug
The classic attribution hole: sub-agents and async tool calls drop the context, so 30–60% of spend lands in an "unknown" bucket — and the unknown bucket is disproportionately the expensive deep work. Untagged spend is not a rounding error; it is usually the part you most needed to see.
Attribute to the dimension you can act on, not the one easy to log.
Logging the model name is easy and nearly useless for decisions. Attribute along the axes that map to a lever: feature (kill or reprice it), tenant (renegotiate or rate-limit), user cohort (detect abuse or a power user to learn from), version (catch a deploy that doubled cost). The test for a useful attribution dimension: a clear bad reading on it triggers a specific action this week. If nothing would happen, the tag is telemetry theater.
A budget is a runtime control or it is a wish.
A monthly budget you compare to actuals in a dashboard does nothing — by the time the dashboard is red the money is gone. A real budget is a counter, scoped per tenant or feature, decremented in the request path, that changes behavior when crossed. The enforcement, not the number, is the budget.
- Per-tenant ceilings so one customer's runaway cannot consume the shared pool and starve everyone else.
- Soft then hard thresholds — at the soft line, degrade (cheaper model, smaller context, disable optional tools); at the hard line, fail closed and alert.
- Enforced in the request path, checked before the expensive call, not reconciled after it.
The circuit breaker stops the bleed; alerting just narrates it.
Alerting tells a human about a cost incident already in progress; by the time anyone reads it, an unbounded loop has billed for an hour. A circuit breaker — trip on spend velocity, on a per-tenant ceiling, on cost-per-task exceeding N× the historical median — halts the spend automatically and degrades or fails the request closed. Alerting and breaking are complementary: the breaker bounds the loss, the alert explains it. A system with only alerting has chosen to pay for every incident in full.
When fine-grained attribution costs more than it saves.
Per-call attribution and a budget check on the hot path add latency, storage, and code surface; for a low-volume internal tool with a flat budget this instrumentation can cost more than the spend it governs. Attribute at the granularity your decisions actually use, and no finer. Always run the per-tenant circuit breaker — it is cheap and bounds catastrophe — but reserve per-user, per-call attribution for products where pricing and margin decisions genuinely depend on it.