AI voice agents in your PBX: streaming pipeline under 600ms
How we built the SIVO STT → LLM → TTS pipeline with real barge-in, JSON output and end-to-end latency ~600ms. Technical decisions and pitfalls avoided.
A usable conversational AI voice agent must hit three things at once:
- Human latency (≤700ms from end-of-utterance to first bot audio).
- Real barge-in (caller interrupts the bot, bot aborts immediately).
- Structured output to integrate with IVR, webhooks and CRMs.
This is the recipe that got us to ~600ms end-to-end in production.
General architecture
Caller audio
↓ mod_audio_fork (UDP → bidirectional WebSocket)
ai-voice service (Node.js + drachtio)
├→ VAD (client) → speech_start / speech_end
├→ Streaming STT (Deepgram / ElevenLabs Scribe / Whisper)
├→ Streaming LLM (OpenAI compatible: Groq, OpenAI, Cerebras...)
└→ Streaming TTS (ElevenLabs v2/v3 WS, OpenAI TTS)
↓ PCM base64 chunks
HTTP POST → app backend
↓ writes .r8 to shared volume
ESL → uuid_execute playback to the channel
mod_audio_fork is unidirectional (FS reads channel audio). TTS playback goes via an alternative route: ai-voice POSTs PCM to backend, backend writes it as .r8 and triggers playback via ESL. More indirection, but breaks the dependency on a less stable bidirectional module.
STT: why streaming, not batch
Whisper batch yields better quality but adds 1.5-2s latency. Discarded for live agents. The serious streaming options:
- Deepgram Nova-2/3: WebSocket auth via header, first partial in ~250ms. Best cost/quality for production.
- ElevenLabs Scribe v2 Realtime: WebSocket auth via
xi-api-keyheader (not query params). Shines in noisy environments and non-native accents.
Pitfall that cost us two days: ElevenLabs switched auth from ?api_key= to header at some point. If you still pass it as query string, you get auth_error with no clear hint.
LLM: the key decision is TTFT
Time-To-First-Token is the metric that decides perceived latency, not total throughput. Real production measurements:
| Provider | TTFT | Tokens/s |
|---|---|---|
| OpenAI GPT-4o | 667-2400ms | 80 |
| OpenAI GPT-4o-mini | 350-800ms | 120 |
| Groq Llama 3.1 70B | ~120ms | 320 |
| Cerebras Llama 3.1 70B | ~150ms | 450 |
Groq is the default for voice production. Cerebras the alternative if you need higher sustained throughput.
TTS: choice depends on whether you want audio tags
Two serious ElevenLabs models:
eleven_multilingual_v2: WebSocket streaming (stream-inputendpoint). 30+ languages. No audio tags.eleven_v3: HTTP streaming only (WS returns 403). Supports[laughs],[sighs],[whispers]. Superior quality.
System prompt is conditional on TTS model:
if TTS model contains "v3":
"You may use [laughs], [sighs], [whispers] for natural inflection."
otherwise:
"Don't use brackets. Express emotion through punctuation: '...!?'"
Without this, v2 ends up literally reading “open bracket laughs close bracket” aloud. Embarrassing.
Sentence-level pipelining
The trick that shaves ~40% off latency: don’t wait for the LLM’s full response before starting TTS. SIVO uses a JsonResponseExtractor that parses the JSON incrementally as it streams from the LLM and fires TTS upon detecting .!?\n inside the response field:
{
"response": "Got it, can you confirm your ID number?| ← sentence boundary
"action": "continue",
"variables": { ... }
}
While sentence 1 plays, sentence 2 is being synthesized. While sentence 2 plays, sentence 3 is being generated by the LLM. We overlap the three stages.
Barge-in: the cooldown detail
Naive barge-in implementation:
VAD detects speech_start → abort in-flight TTS + LLM
Problem: the bot’s own playback feeds back into the caller’s mic (especially over mobile speakers). VAD detects speech_start, aborts the bot, the bot processes “its own echo” as an utterance, replies “I didn’t catch that”… infinite loop.
Three-layer mitigation:
greetingPlayingflag during the first 150ms afteruuid_audio_fork start(settling time).bargeInCooldownUntil: after each playback, ignore speech_start forplayMs + 300ms.speechMinMs: 400: ignore utterances under400ms(taps, sighs, ambient noise).
After these three, false positives drop from ~30% to less than 2%.
Handling LLM abort on barge-in
When TTS aborts, the LLM also has to abort (cancel the fetch with AbortController). Important: don’t treat as fatal error. The catch must check state.ttsAborted:
try {
await streamLLM(...);
} catch (err) {
if (state.ttsAborted) {
logger.debug('LLM aborted by barge-in (expected)');
return;
}
logger.error('LLM real failure', err);
}
Without this, the log fills with “errors” that are normal barge-ins.
The hard rule: where NOT to start transcription
This deserves triple underline. Audio capture (uuid_audio_fork) only starts on:
bridge-agent-start(when the call reaches the human agent).- Entering an IVR
ai_agentnode.
Never during IVR / DTMF menu / queue / hold music. Why?
- Cost: transcribing “Press 1 for X, 2 for Y” on every call multiplies STT spend by N.
- Privacy: the caller hasn’t consented yet (consent comes when reaching the agent or entering the bot).
- Compliance: GDPR Art. 6 — no clear legal basis to transcribe menu prompts.
It’s a hard rule, not an optimization. SIVO fails a CI test if the IVR codepath can start transcription.
Production result
With Groq Llama 3.1 70B + Deepgram Nova-2 + ElevenLabs v2:
- LLM TTFT: ~120ms
- First TTS audio from last token arrival: ~80ms
- VAD detect → first bot audio: ~600ms (P50), 850ms (P95)
- Barge-in false positives: less than 2%
- Cost per minute conversed:
$0.012 (Groq) + $0.013 (Deepgram) + $0.18 (ElevenLabs v2) = **$0.20/min**
For a “FAQ + lead qualification + transfer to human on doubt” use case it’s operationally solid.
What’s left
- Speaker separation in STT (used for human agent transcription, pending in AI agents for warm-transfer scenarios).
- Local Whisper on EU GPU for customers with strict compliance who can’t send audio to USA.
- Token/cost metrics in Grafana broken down by AI agent (we have per-call, aggregate cost dashboards missing).
→ If your team is building a voice agent, talk to us — we share the full benchmarks.