The Latency Budget

V2
Playbook · Voice & Realtime Agents

The latency budget: where the sub-second turn actually goes.

In text, a slow agent is annoying. In voice, a slow agent is broken — humans read silence as a dropped call, a misunderstanding, or a dumb machine, and they start talking over it. The entire perceived intelligence of a voice agent is gated by one number: the gap between the caller finishing and the agent starting to speak. This essay is an accounting exercise: every millisecond, where it goes, and how to spend it.

STEP 1

The number that matters is response latency, and it is small.

Human turn-taking in natural conversation has a typical gap of roughly 200 ms; people notice and get uncomfortable well before a full second. The working target for a voice agent's response latency — caller stops → first audio out — is sub-second end-to-end, and the best native stacks push the model's contribution toward a few hundred milliseconds.

Be precise about what you are measuring. Response latency is end-of-user-speech to first-audio-out. It is not time-to-full-response and not model TTFT in isolation — it includes endpointing delay, which is often the single biggest slice and the one teams forget to count.

STEP 2

The budget, line by line.

The turn is a sum. You cannot improve a total you have not decomposed, so decompose it.

# response_latency = sum of these, ms
network_in        # mic/carrier → your edge
endpoint_detect   # "is the user done?" — often the biggest
stt_finalize      # cascade only; 0 for native S2S
model_ttft        # prompt → first output token/frame
tts_first_audio   # cascade only; first synthesized chunk
network_out       # your edge → speaker/carrier
playout_buffer    # jitter buffer before audio plays

For a cascade, every line is live. For native speech-to-speech, stt_finalize and tts_first_audio collapse into model_ttft — the structural reason native stacks are faster is not a better model, it is two fewer serial hops.

STEP 3

Endpointing is latency you are choosing to spend.

The agent cannot start until it believes the caller has finished. Naive silence detection waits a fixed timeout (say 700 ms of quiet) — that timeout is added directly to every single turn's latency, whether the caller paused mid-thought or actually finished.

This is why semantic endpointing matters for latency, not just for correctness: a model that predicts "this utterance is complete" from the words can fire in ~100–300 ms instead of waiting out a worst-case silence timer. You are trading a fixed pessimistic delay for a variable, usually-shorter one. See turn-taking-and-barge-in for the correctness side; here it is pure budget.

Tuning endpointing down to shave latency directly increases the rate at which you cut the caller off mid-sentence. This dial trades response latency against interruption rate. It has no free setting — measure both, pick the operating point per use case, and never tune one blind to the other.

STEP 4

Streaming and partials hide latency you cannot remove.

Some latency is irreducible. The trick is not always to remove it but to cover it so the caller never experiences silence.

  • Stream output. Start TTS / emit audio on the first tokens, not the full response. First-audio-out is the metric; total length is not.
  • Stream input. Run STT and begin reasoning on partial transcripts so the model is already "warm" when the caller stops.
  • Acknowledge fast. A 150 ms "mm-hm" or "let me check that" is perceived as responsive even when the real answer is two seconds out. Perceived latency, not wall-clock, is the user's reality.
  • Speculate. Begin drafting a response on the partial; if the final transcript matches, you have already paid the model cost.
STEP 5

The tool-call cliff.

The budget above describes a turn with no tool call. The moment the agent must hit a database or an API, you have blown straight through the sub-second budget — a 400 ms API call alone is the entire conversational gap. Voice makes tool latency a first-class conversational problem, not a backend detail.

no-tool turn:   ~500-900ms   acceptable
tool turn:      ~500ms + API + 2nd model pass
                = often 2-4s   → DEAD AIR unless covered

The fix is conversational, not just engineering: speak before and during the tool call ("one sec, pulling that up…"), run independent tool calls in parallel, and never let the audio channel go silent while a backend works. This is developed fully in voice-tooling-and-state.

STEP 6

The honest tradeoff.

Every millisecond you reclaim from endpointing or model size is a millisecond you spent on a higher interruption rate or a dumber answer — there is no latency win that is free, only a latency cost you have decided is worth it.