Playbooks / Voice & Realtime Agents

Voice & Realtime Agents

Realtime voice agents — speech stack, turn-taking, barge-in, latency budgets, voice-specific tooling and state.

Realtime Agent Architecture

Cascade (STT→LLM→TTS) vs native speech-to-speech, the stateful audio transport, and the one decision everything else hangs on: where the agent loop lives.
The Latency Budget

The sub-second turn accounted for line by line: where the milliseconds go, why endpointing is the biggest slice, and perceived vs actual latency.
Turn-Taking & Barge-In

VAD vs endpointing, semantic end-of-turn detection, mandatory barge-in, echo cancellation as a prerequisite, and backchannels vs real interruptions.
STT, TTS & Speech-to-Speech

Streaming STT, the transcription-error tax, TTS time-to-first-audio, native audio models, and why 8 kHz telephony changes every benchmark.
Tool Use & State in Voice

Calling tools without dead air: preambles, async/parallel tool runs, confirm-by-ear before mutating, and slot state across an interruptible call.
Voice Agent Failure Modes

Hallucinated hearing, dead air, the infinite apology loop, the latency death spiral, and the escalation/handoff you must design for.