Realtime Agent Architecture

Playbook · Voice & Realtime Agents

Realtime agent architecture: the cascade, the native model, and where the loop lives.

A voice agent is the same agent loop you already know, wrapped in a hard real-time audio contract. The architectural choice is binary and consequential: chain three models in a pipeline (STT → LLM → TTS), or run one model that hears and speaks directly. Everything else in this group — latency, turn-taking, tooling, failure — is downstream of which one you picked and where you put the reasoning.

STEP 1

The cascade: three models in a streaming pipeline.

The classic architecture is a cascade: a streaming speech-to-text model transcribes the caller, the transcript drives a normal text LLM agent loop, and a text-to-speech model voices the reply. It is the same architecture as a chat agent with a microphone on the front and a speaker on the back.

Transparent. The transcript and the text turn are both inspectable — you can log, eval, and guardrail them with the exact tooling you already have for text agents.
Composable. Swap STT vendor, LLM, or TTS voice independently. Best-of-breed at each stage.
Lossy at the seams. Prosody, emotion, hesitation, and overlap are flattened to text and gone. The model never heard the caller; it read a transcript of them.

STEP 2

The native model: one speech-to-speech model.

A speech-to-speech model (OpenAI's Realtime API with gpt-realtime, Google's Gemini Live) takes audio in and emits audio out through a single model. There is no intermediate transcript on the critical path. It hears tone, interruption, and pacing directly, and can respond expressively because it was never reduced to text.

# cascade: three hops, three failure points
mic → STT → LLM (agent loop) → TTS → speaker

# native: one model, audio in / audio out
mic → speech-to-speech model (loop inside) → speaker

Native does not mean transcript-free in practice. You still want a transcript for logging, eval, and compliance — most realtime APIs emit one as a side channel. The difference is that it is observability, not the reasoning substrate. Do not put a guardrail on a transcript and assume it gates audio the model already spoke.

STEP 3

Where the agent loop lives.

This is the question that decides your architecture. The agent loop — plan, call tool, observe, continue — has to run somewhere, and voice makes its placement load-bearing.

Cascade: the loop is yours, in your orchestrator, between STT and TTS. Full control of tools, memory, and policy; you own the latency too.
Native, model-driven: the loop runs inside the realtime model — it decides when to call your tools via function calling over the session. Lowest latency, least control.
Native, externally-driven: the model handles speech and turn-taking, but hands reasoning-heavy steps to an external text agent. A hybrid that keeps audio quality and reclaims tool/policy control — at the cost of a round-trip you must hide.

STEP 4

The transport is part of the architecture.

Text agents move JSON over HTTP. Voice agents move continuous audio frames over a persistent connection — WebRTC or a WebSocket — for the entire call. The session is stateful and long-lived, not request/response.

SESSION  open WebRTC/WS ----------------------------- hang up
         |  audio frames in (20ms)  audio frames out  |
         |  events: speech_started, transcript, ...    |
         |  function_call / function_result           |
         |  one connection, one conversation state     |

Telephony adds another hop: a SIP/PSTN gateway (Twilio, Telnyx, LiveKit SIP) bridges the phone network to your media stream, usually as 8 kHz μ-law. The architecture now spans carrier → gateway → your media server → model, and every hop is latency and a failure domain.

STEP 5

The reference shape most production stacks converge on.

In practice, robust 2025-era stacks look like a media server holding the call, a turn detector deciding who speaks, the model (native or cascade) producing the reply, and a side channel doing transcript logging and async guardrails.

# the load-bearing components, not a framework
media_server   # holds WebRTC/SIP, jitter buffer, echo cancel
turn_detector  # VAD + semantic endpointing (see V3)
brain          # native S2S model OR cascade orchestrator
tool_runtime   # your functions, run off the audio path (V5)
side_channel   # transcript, eval, recording, async checks

Decide brain placement first, then everything else follows. Cascade when you need transcript-level control, eval parity with text agents, or heavy tool/policy logic. Native when conversational quality and sub-second response are the product. Hybrid when you genuinely need both — but only then; the round-trip is real.

STEP 6

When NOT to build a voice agent.

If the task tolerates a typed turn, voice buys you nothing and costs you a real-time deadline, an audio failure surface, and worse observability — text-first, voice only when the channel is the requirement.