Playbooks / Voice & Realtime Agents
Voice & Realtime Agents
Realtime voice agents — speech stack, turn-taking, barge-in, latency budgets, voice-specific tooling and state.
- Realtime Agent ArchitectureCascade (STT→LLM→TTS) vs native speech-to-speech, the stateful audio transport, and the one decision everything else hangs on: where the agent loop lives.
- The Latency BudgetThe sub-second turn accounted for line by line: where the milliseconds go, why endpointing is the biggest slice, and perceived vs actual latency.
- Turn-Taking & Barge-InVAD vs endpointing, semantic end-of-turn detection, mandatory barge-in, echo cancellation as a prerequisite, and backchannels vs real interruptions.
- STT, TTS & Speech-to-SpeechStreaming STT, the transcription-error tax, TTS time-to-first-audio, native audio models, and why 8 kHz telephony changes every benchmark.
- Tool Use & State in VoiceCalling tools without dead air: preambles, async/parallel tool runs, confirm-by-ear before mutating, and slot state across an interruptible call.
- Voice Agent Failure ModesHallucinated hearing, dead air, the infinite apology loop, the latency death spiral, and the escalation/handoff you must design for.