Playbooks / Voice & Realtime Agents

Voice & Realtime Agents

Realtime voice agents — speech stack, turn-taking, barge-in, latency budgets, voice-specific tooling and state.

  1. Realtime Agent Architecture
    Cascade (STT→LLM→TTS) vs native speech-to-speech, the stateful audio transport, and the one decision everything else hangs on: where the agent loop lives.
  2. The Latency Budget
    The sub-second turn accounted for line by line: where the milliseconds go, why endpointing is the biggest slice, and perceived vs actual latency.
  3. Turn-Taking & Barge-In
    VAD vs endpointing, semantic end-of-turn detection, mandatory barge-in, echo cancellation as a prerequisite, and backchannels vs real interruptions.
  4. STT, TTS & Speech-to-Speech
    Streaming STT, the transcription-error tax, TTS time-to-first-audio, native audio models, and why 8 kHz telephony changes every benchmark.
  5. Tool Use & State in Voice
    Calling tools without dead air: preambles, async/parallel tool runs, confirm-by-ear before mutating, and slot state across an interruptible call.
  6. Voice Agent Failure Modes
    Hallucinated hearing, dead air, the infinite apology loop, the latency death spiral, and the escalation/handoff you must design for.