Turn-Taking & Barge-In

V3
Playbook · Voice & Realtime Agents

Turn-taking and barge-in: deciding who speaks, and stopping when interrupted.

Humans negotiate turns in a conversation without thinking — they hear the end of a thought, they yield to an interruption, they tolerate a brief overlap. A voice agent has to do all of this explicitly, in real time, from an audio stream, with no shared social model. Get it wrong and the agent either bulldozes the caller or sits frozen waiting for a sentence that already ended. This is the hardest interaction problem in voice, and it is mostly not a model problem.

STEP 1

VAD is not endpointing, and confusing them is the root bug.

Two distinct jobs, constantly conflated:

  • VAD (voice activity detection) answers "is there speech in this audio frame right now?" — a low-level, sub-100 ms signal. It does not know if a turn is over; a thinking pause is silence too.
  • Endpointing (end-of-turn detection) answers "is the caller done, such that the agent should now respond?" — a decision built on top of VAD plus, ideally, transcript and meaning.

Endpointing on raw VAD silence alone is the canonical voice-agent failure. A 600 ms silence threshold cuts off anyone who pauses to think ("my account number is… 4 4 7…"), and a 1500 ms one makes the agent feel asleep. There is no fixed silence value that is both responsive and patient — that is why fixed timeouts lose.

STEP 2

Semantic endpointing: use the words, not just the silence.

The 2025-era answer, shipped by LiveKit, AssemblyAI, and OpenAI's semantic-VAD turn detection, is to feed the partial transcript into a small model that asks "does this sound like a complete utterance?" — and modulate the silence threshold by the answer.

# turn/endpoint.py — silence threshold is dynamic
def end_of_turn(partial, silence_ms):
    p = complete_utterance_prob(partial)   # small model
    if p > 0.85:
        return silence_ms > 120    # sounds done: fire fast
    if p < 0.30:
        return silence_ms > 1400   # "uh, my number is..." wait
    return silence_ms > 600        # unsure: middle ground

"What's the weather" with falling intonation ends a turn in ~120 ms; "my card number is" with a trailing rise should buy the caller over a second of thinking room. Same silence, opposite decision — because the words carry the intent that silence alone cannot.

STEP 3

Barge-in: the caller can always interrupt, and the agent must yield.

If the caller starts talking while the agent is speaking, the agent must stop within a couple hundred milliseconds — not finish its sentence. An agent that talks over an interruption is experienced as rude and, worse, broken; people will hang up. Barge-in is non-negotiable in any serious voice product.

agent speaking ............ caller starts talking
                            |
                            +-- detect speech (VAD, <100ms)
                            +-- STOP agent audio immediately
                            +-- FLUSH unsent TTS / cancel response
                            +-- discard the killed turn from state
                            +-- listen; the caller now has the floor

The subtle part is state: the agent said three of ten sentences, then got cut off. What it actually spoke — not what it generated — is what the caller heard. Your conversation state must reflect the truncated reality, or the agent will reference things it never actually said aloud.

STEP 4

Echo cancellation is a turn-taking prerequisite, not an audio nicety.

The agent's own voice comes back through the caller's microphone (especially on speakerphone). Without acoustic echo cancellation, VAD sees the agent's own speech as "the caller is talking," triggers a false barge-in, and the agent interrupts itself into a stuttering loop. Robust AEC is load-bearing for turn-taking — a turn-taking bug report is very often an echo bug.

STEP 5

Overlap, backchannels, and full-duplex.

Real conversation is not strictly half-duplex. People say "mm-hm" and "right" while the other is talking — backchannels — and these are not barge-ins; treating every "uh-huh" as an interruption makes the agent stop constantly. A good turn manager distinguishes a continuer ("mm-hm", keep going) from a real bid for the floor ("wait — actually…", yield now).

Newer native models move toward genuine full-duplex: listening and speaking simultaneously, the way humans do. It is more natural and considerably harder to get right; until it is robust, most production stacks run a disciplined half-duplex turn manager with fast barge-in and explicit backchannel handling, because a predictable agent beats an unpredictable one.

STEP 6

The honest tradeoff.

Every turn-taking parameter is the same single dial — eagerness vs patience — and there is no global setting: a debt-collection IVR wants fast and decisive, a grief-support line wants slow and patient, and the only mistake is shipping one tuning for both.