The latency budget: where the sub-second turn actually goes.
In text, a slow agent is annoying. In voice, a slow agent is broken — humans read silence as a dropped call, a misunderstanding, or a dumb machine, and they start talking over it. The entire perceived intelligence of a voice agent is gated by one number: the gap between the caller finishing and the agent starting to speak. This essay is an accounting exercise: every millisecond, where it goes, and how to spend it.
The number that matters is response latency, and it is small.
Human turn-taking in natural conversation has a typical gap of roughly 200 ms; people notice and get uncomfortable well before a full second. The working target for a voice agent's response latency — caller stops → first audio out — is sub-second end-to-end, and the best native stacks push the model's contribution toward a few hundred milliseconds.
Be precise about what you are measuring. Response latency is end-of-user-speech to first-audio-out. It is not time-to-full-response and not model TTFT in isolation — it includes endpointing delay, which is often the single biggest slice and the one teams forget to count.
The budget, line by line.
The turn is a sum. You cannot improve a total you have not decomposed, so decompose it.
# response_latency = sum of these, ms network_in # mic/carrier → your edge endpoint_detect # "is the user done?" — often the biggest stt_finalize # cascade only; 0 for native S2S model_ttft # prompt → first output token/frame tts_first_audio # cascade only; first synthesized chunk network_out # your edge → speaker/carrier playout_buffer # jitter buffer before audio plays
For a cascade, every line is live. For native speech-to-speech, stt_finalize and tts_first_audio collapse into model_ttft — the structural reason native stacks are faster is not a better model, it is two fewer serial hops.
Endpointing is latency you are choosing to spend.
The agent cannot start until it believes the caller has finished. Naive silence detection waits a fixed timeout (say 700 ms of quiet) — that timeout is added directly to every single turn's latency, whether the caller paused mid-thought or actually finished.
This is why semantic endpointing matters for latency, not just for correctness: a model that predicts "this utterance is complete" from the words can fire in ~100–300 ms instead of waiting out a worst-case silence timer. You are trading a fixed pessimistic delay for a variable, usually-shorter one. See turn-taking-and-barge-in for the correctness side; here it is pure budget.
Tuning endpointing down to shave latency directly increases the rate at which you cut the caller off mid-sentence. This dial trades response latency against interruption rate. It has no free setting — measure both, pick the operating point per use case, and never tune one blind to the other.
Streaming and partials hide latency you cannot remove.
Some latency is irreducible. The trick is not always to remove it but to cover it so the caller never experiences silence.
- Stream output. Start TTS / emit audio on the first tokens, not the full response. First-audio-out is the metric; total length is not.
- Stream input. Run STT and begin reasoning on partial transcripts so the model is already "warm" when the caller stops.
- Acknowledge fast. A 150 ms "mm-hm" or "let me check that" is perceived as responsive even when the real answer is two seconds out. Perceived latency, not wall-clock, is the user's reality.
- Speculate. Begin drafting a response on the partial; if the final transcript matches, you have already paid the model cost.
The tool-call cliff.
The budget above describes a turn with no tool call. The moment the agent must hit a database or an API, you have blown straight through the sub-second budget — a 400 ms API call alone is the entire conversational gap. Voice makes tool latency a first-class conversational problem, not a backend detail.
no-tool turn: ~500-900ms acceptable
tool turn: ~500ms + API + 2nd model pass
= often 2-4s → DEAD AIR unless covered
The fix is conversational, not just engineering: speak before and during the tool call ("one sec, pulling that up…"), run independent tool calls in parallel, and never let the audio channel go silent while a backend works. This is developed fully in voice-tooling-and-state.
The honest tradeoff.
Every millisecond you reclaim from endpointing or model size is a millisecond you spent on a higher interruption rate or a dumber answer — there is no latency win that is free, only a latency cost you have decided is worth it.