The unit that matters is cost per successful task, and the success rate is in the denominator.
Teams instrument cost per token and cost per API call because those are easy to read off the bill. Both are the wrong unit. A token is not a deliverable; a successful completed task is. The metric that decides whether an agent is a business or a money fire is cost per successful outcome — and because failed attempts still cost full price while delivering nothing, the success rate sits in the denominator, where small changes swing the unit economics violently.
Cost per token hides the only number that matters.
An agent that costs $0.40 per run looks cheap until you learn it succeeds 60% of the time. The customer only pays you for successes, but you paid for every run — so your true cost per delivered outcome is $0.40 / 0.60 = $0.67, and that is before retries. Cost per token tells you nothing about this; it is a measure of effort, not of value delivered. The industry term converging in 2025–2026 is cost per successful task (CPST): total cost of all attempts divided by the count of attempts that actually produced the outcome.
The success-rate denominator is the dominant lever.
Write the unit out and the leverage is obvious. Raising success rate from 70% to 85% cuts cost per win by ~18% with no change to model price at all — usually a far larger move than any token optimization, and it improves the product simultaneously.
# cost per successful task — the denominator does the work attempts = 1000 cost_per_run = 0.40 # you pay this every attempt, win or lose success_rate = 0.70 retries_per_win = 1.3 # failed tries before a win still bill cost_per_win = (cost_per_run * retries_per_win) / success_rate # = 0.52 / 0.70 = $0.74 — not the $0.40 on the dashboard
Before optimizing the model, optimize the denominator. A reliability fix that lifts success rate compounds: it lowers cost per win and raises delivered value and shrinks the retry tax — three wins from one change. Token tuning only touches one term.
Define "success" before you measure cost, or you will measure nothing.
CPST is only meaningful if "success" is an objective, automatable verdict the customer would agree with — a resolved ticket, a merged PR, an approved invoice — not "the model returned a fluent response." If success is judged by the same model that did the work, you are measuring confidence, not correctness, and your denominator is inflated by silent failures. The eval that defines success is a prerequisite for the unit economics existing at all; without it you have a cost number with no unit.
Margin is price minus fully-loaded cost per win, not minus model cost.
Fully-loaded cost per successful task includes the failed attempts, the retries, the human escalations a fraction of tasks trigger, the eval and observability overhead, and a buffer for the expensive tail. Margin is the price the customer pays minus that, not minus the raw model spend.
- Include the escalation cost. If 8% of tasks fall back to a human at $6 of loaded labor, that adds ~$0.48 to the average cost per win — often larger than the model line.
- Include the eval/observability tax. The judge calls, traces, and monitoring that make the agent safe are real per-task cost, not overhead to ignore.
- Price off the loaded number. A product priced against raw model cost has negative margin the first time the tail or escalation rate moves.
The distribution, not the average, decides survival.
Agent cost per task is heavily right-skewed: most tasks are cheap, a few burn 50–100× the median through long loops, deep fan-out, and retry storms. The mean can look healthy while the 95th-percentile task destroys the month's margin. Underwrite the unit economics against the p95 cost month, not the average — a business that is only profitable at the mean is a business that loses money exactly when it is busiest.
When cost per token is the right metric after all.
Token cost is the correct lens for one job only: comparing two implementations of the same task at the same success rate — there, cheaper tokens are pure margin. The error is using it as the headline business metric. Optimize tokens only after success rate, retry rate, and the tail are instrumented; a cheaper token on a task that fails is a faster way to lose money.