Tool use and state in voice: calling tools without dead air, confirming by ear.
A text agent can think for four seconds and nobody notices. A voice agent that goes silent for four seconds has, as far as the caller is concerned, frozen or hung up. Tool calls are where voice agents most visibly break, because the backend latency that is invisible in chat becomes an audible hole in the conversation. The job is to make tool use conversational: never silent, confirmable by ear, and recoverable across a stateful call.
Speak before, during, and after the tool call.
The single highest-leverage technique in voice tooling: emit a verbal acknowledgement before the tool latency lands, so the caller hears a human-shaped pause, not a dead line.
caller: "what's my balance?"
t+0.2s agent: "Let me pull that up..." <-- cover starts
t+0.3s [tool: get_balance() dispatched]
t+1.6s [tool returns]
t+1.8s agent: "Your balance is $2,481.10."
silence the caller actually experienced: ~0ms
OpenAI's Realtime API exposes this directly as preambles — the model speaks a short filler while a function call is in flight. The preamble must be generic enough to be true regardless of the result ("let me check that"), never a guess at the answer ("looks like you're all paid up" before the tool returns is how agents lie).
Tool calls are async; the conversation does not block on them.
In a text agent the loop is naturally serial: call tool, wait, continue. In voice you cannot freeze the audio channel while waiting. The tool call runs as an async task; the agent stays live on the line, able to acknowledge, answer a clarifying sub-question, or absorb a barge-in while the backend works.
# voice/tool_runtime.py — never block the audio loop async def on_tool_call(call, session): await session.say(filler_for(call)) # cover now task = asyncio.create_task(run(call)) # off the path while not task.done(): if session.user_started_talking(): await session.yield_floor() # caller wins await asyncio.sleep(0.05) return task.result()
Run independent tool calls in parallel, never sequentially — three serial 400 ms calls is a 1.2 s hole; three parallel ones is 400 ms. Voice removes the slack that hides serial tool latency in text agents, so the parallelism you could be lazy about there is mandatory here.
Confirm side effects by ear, before they happen.
A voice agent cannot show a confirmation dialog. The confirmation is the spoken turn, and for any irreversible action it is mandatory: read the consequential parameters back in the caller's terms and require an explicit yes before the tool fires.
# confirm BEFORE the mutating call, not after agent: "So that's cancelling the 7pm reservation for four at Nopa tonight — shall I?" caller: "yes" → only NOW dispatch cancel_reservation(...)
This is the same read-back that fixes the STT tax (see speech-stack): on a lossy audio channel, the confirmation turn is the error-correcting code for both mishearing and wrong intent. Skip it only for read-only, reversible actions.
State must survive a stateful, interruptible call.
A voice call is one long stateful session, and the caller can interrupt, change their mind, and revisit a slot ("actually, make it 8pm") at any point. The agent's state is not "the transcript so far" — it is the structured set of slots and pending actions, and it must track what was confirmed, what is tentative, and what the caller actually heard.
- Slot state, not transcript state. Track
{party: 4, time: 8pm (was 7pm), date: today}, not a wall of text to re-parse every turn. - Confirmed vs tentative. A value the caller corrected after you read it back is confirmed; one only mentioned in passing is tentative and must be confirmed before any mutation.
- Spoken vs generated. If a barge-in truncated the agent mid-sentence, state must reflect what was said aloud, or the agent will reference a number the caller never heard.
Slow and failed tools are conversational events.
A backend that takes eight seconds, or fails, is not an exception to log and swallow — the caller is on the line and will hear it. Give every tool call a conversational timeout and a spoken contingency.
- Timeout with a spoken exit. If the tool exceeds its budget, say so honestly ("that's taking longer than usual — can I call you back, or hold a moment?") rather than narrating filler forever.
- Degrade, don't freeze. On failure, offer the next best path (a callback, a human, a retry) — never silence, never an infinite "still working on that."
- Idempotency under interruption. A barge-in or hang-up mid-tool must not double-charge or double-book; mutating calls need idempotency keys exactly as in text agents, but the failure here is also audible.
The honest tradeoff.
Every cover phrase you add to hide tool latency is conversational time the caller did not ask for; padding the call with filler is its own failure mode — the goal is fast tools, and verbal cover is a tax on slowness, not a substitute for fixing it.