STT, TTS & Speech-to-Speech

V4
Playbook · Voice & Realtime Agents

STT, TTS, and speech-to-speech: the stack between the ear and the mouth.

In a cascade, two models bracket your agent: speech-to-text on the way in, text-to-speech on the way out. Each is a lossy transducer with its own latency, accuracy, and expressiveness tradeoffs — and the errors they make do not stay contained, they propagate into the agent's reasoning. Native speech-to-speech removes the brackets but trades them for a different set of constraints. This essay is about choosing well at each layer and paying the transcription-error tax with eyes open.

STEP 1

STT: streaming vs batch, and why streaming wins for agents.

Batch STT transcribes a finished audio clip with maximum accuracy. Streaming STT emits partial hypotheses as the caller talks, revising them as more audio arrives. For an agent, streaming is not optional: it is what lets endpointing, speculation, and warm-starting work. You accept slightly lower final accuracy in exchange for the latency budget you cannot otherwise meet.

  • Partials are unstable. "I want to" → "I want to cancel" → "I want to cancel my… upgrade". Never act irreversibly on a partial; gate side effects on the finalized transcript.
  • Domain accuracy > benchmark WER. Word error rate on clean read speech tells you little about your customers' account numbers, product names, and accents on an 8 kHz phone line.
STEP 2

The transcription-error tax.

This is the defining liability of any cascade. STT does not return "I'm not sure"; it returns its best guess as confident text, and the LLM then reasons over that text as if it were ground truth. A misheard digit, a wrong drug name, a negation dropped — the agent now confidently acts on something the caller never said.

# the tax: confident text from uncertain audio
caller said:   "don't ship it to the old address"
STT produced:  "do ship it to the old address"
LLM reasons:   confidently. wrong order. no flag raised.

An LLM cannot recover information STT destroyed — it can only guess plausibly, which on names and numbers means guessing wrong with confidence. Do not "let the model sort it out." For high-stakes slots — amounts, IDs, yes/no, addresses — read the value back and get explicit confirmation. The confirmation is not politeness; it is the error-correcting code for a lossy channel.

STEP 3

TTS: latency, prosody, and the streaming requirement.

TTS quality is no longer the bottleneck — naturalness is largely solved by 2025 models. The agent-relevant axes are different:

  • Time-to-first-audio. The only TTS latency number that matters for response latency. A model with great quality but 800 ms to first sample is unusable in conversation.
  • Streaming synthesis. The TTS must accept text incrementally and emit audio incrementally, so you start speaking on the LLM's first clause, not its last token.
  • Prosody and control. Pacing, emphasis, and the ability to be interrupted cleanly mid-utterance matter more than another notch of fidelity.
  • Pronunciation control. Order IDs, currencies, and proper nouns need explicit handling — "$1,204.50" and "ACC-0042" must be spoken correctly every time, which usually means normalizing text before synthesis.
STEP 4

Native speech-to-speech: what you gain and what you give up.

A native audio model (OpenAI's gpt-realtime, Gemini Live) hears prosody and emotion directly and speaks expressively, with no transcription tax on the critical path and two fewer serial hops of latency. The costs are real and different:

cascade           native speech-to-speech
+ inspectable     - audio reasoning is opaque
+ swap any stage  - vendor-coupled, fewer voices
+ text guardrails - guardrail the audio, async, harder
- transcript tax  + hears tone, no STT tax
- 2 extra hops    + lowest latency

OpenAI reports gpt-realtime at 30.5% on the MultiChallenge audio instruction-following benchmark, up from 20.6% for its December-2024 predecessor — real progress, and also a reminder that audio-native instruction following still trails text. Cost-sensitive paths can drop to gpt-realtime-mini; you trade headroom for price.

STEP 5

The telephony reality: 8 kHz changes everything.

A phone call is narrowband — 8 kHz μ-law, designed for 1970s human speech, not for STT. Models that score brilliantly on 16 kHz studio audio degrade on a real PSTN call with background noise, codec artifacts, and a caller on a Bluetooth headset in a car. Always evaluate your speech stack on audio that matches the channel you will actually ship on, not on clean samples.

STEP 6

The honest tradeoff.

Cascade buys you inspectability and per-stage swap-ability at the price of a permanent transcription-error tax; native buys you expressiveness and the lowest latency at the price of opaque audio reasoning and harder guardrails — there is no stack without a tax, only a tax you have chosen to pay.