Voice Agent Failure Modes

Playbook · Voice & Realtime Agents

Voice agent failure modes: how they break, and what good degradation sounds like.

Text agents fail on a screen the user can re-read and a button they can re-click. Voice agents fail in real time, to a human who cannot scroll back, often on a phone, frequently when something already went wrong in their day. The failures are specific, recurrent, and recognizable — and the difference between a usable voice product and an infuriating one is almost entirely how it behaves when it is failing, not when it is working.

STEP 1

Hallucinated hearing: confidently answering the wrong question.

The compounding failure of any cascade. STT mishears, the LLM never sees the uncertainty, and the agent confidently answers a question the caller did not ask — then acts on it. The caller hears a fluent, assured response to the wrong thing and concludes the agent is either not listening or not intelligent.

The defense is not a better model; it is structural humility. Read back high-stakes slots (speech-stack), confirm before mutating (voice-tooling-and-state), and when STT confidence is low or the answer hinges on one slot, ask rather than assume. An agent that says "did you say four-four-seven?" beats one that ships to the wrong address with total confidence.

STEP 2

Dead air: the silence that reads as a dropped call.

The most common and most lethal voice failure. The backend is working, the model is thinking, a tool is in flight — and the line is silent. The caller's mental model has no "loading spinner"; silence means the call dropped, so they say "hello? … hello?", start over, or hang up. Every silent gap over roughly a second is a dropped-call risk.

FAILURE                CALLER'S READ
2s silent tool call    "call dropped" -> "hello?"
8s silent backend      "it's broken"  -> hang up
silent after barge-in  "did it hear me?" -> repeats louder
RULE: the channel is never silent > ~1s. ever.

The fix is the verbal cover discipline from voice-tooling-and-state, applied as a hard invariant: there is no code path that can leave the audio channel silent for longer than a second without a spoken acknowledgement.

STEP 3

The infinite apology loop.

A distinctive voice pathology: the agent misunderstands, apologizes, the caller rephrases (often louder and angrier), the agent misunderstands the now-distorted input, apologizes again. Each turn degrades the audio (frustrated speech is harder for STT) and the caller's patience simultaneously. Politeness without progress is a failure mode, not a mitigation.

# detect the loop; escalate, do not apologize again
if consecutive_failed_turns >= 2:
    # stop apologizing. change strategy.
    offer_constrained_choice()     # "press 1 for billing"
if consecutive_failed_turns >= 3:
    handoff_to_human(reason="repeated_nlu_failure")

Track consecutive non-progressing turns explicitly. After two, change strategy (offer a constrained choice, narrow the question, switch to DTMF). After three, escalate. The agent's job at that point is not to keep trying — it is to get the caller to someone or something that can actually help.

STEP 4

Latency stalls and the death spiral.

A slow turn causes the caller to start talking into the gap; that speech arrives mid-generation and triggers a barge-in; the agent stops, the caller's words overlap the tail of the old response, STT garbles the mix, the agent misunderstands, latency to recover grows — each failure makes the next more likely. Latency does not just annoy; it actively manufactures the turn-taking and recognition failures from the rest of this group.

This is why the latency budget (latency-budget) is a reliability concern, not a polish concern. Below the budget the conversation is stable; above it, the failures compound into a spiral the agent cannot talk its way out of.

STEP 5

Escalation and handoff: the failure mode you must design for.

Every voice agent will encounter calls it cannot handle — out of scope, repeatedly misheard, an angry or distressed caller, or simply a task that needs a human. The measure of a serious voice product is not that this never happens; it is that when it does, the handoff is clean.

Recognize the trigger. Explicit request ("agent" / "representative"), repeated failure, detected distress or anger, or an out-of-scope intent — each should route to a human, not loop.
Carry the context. Hand the human a summary, the verified slots, and what was attempted. Forcing the caller to repeat everything to a person is its own infuriating failure.
Fail toward a human, not toward silence. If the agent is unsure whether to escalate, escalate. A handoff is a graceful degradation; a stuck loop is an outage with a friendly voice.

Treat every production failure as a regression test, exactly as in text agents: capture the call audio, replay it against the stack, assert the agent now recovers or escalates instead of looping. A voice failure you cannot replay is a voice failure you will ship again.

STEP 6

The honest tradeoff.

You cannot make a voice agent that never fails; you can only choose how it fails — and a product that escalates early and honestly will always beat one that hides its failures behind a confident voice and an apology loop.