A usable conversational AI voice agent must hit three things at once:

Human latency (≤700ms from end-of-utterance to first bot audio).
Real barge-in (caller interrupts the bot, bot aborts immediately).
Structured output to integrate with IVR, webhooks and CRMs.

This is the recipe that got us to ~600ms end-to-end in production.

General architecture

Caller audio
  ↓ mod_audio_fork (UDP → bidirectional WebSocket)
ai-voice service (Node.js + drachtio)
  ├→ VAD (client) → speech_start / speech_end
  ├→ Streaming STT (Deepgram / ElevenLabs Scribe / Whisper)
  ├→ Streaming LLM (OpenAI compatible: Groq, OpenAI, Cerebras...)
  └→ Streaming TTS (ElevenLabs v2/v3 WS, OpenAI TTS)
       ↓ PCM base64 chunks
HTTP POST → app backend
  ↓ writes .r8 to shared volume
ESL → uuid_execute playback to the channel

mod_audio_fork is unidirectional (FS reads channel audio). TTS playback goes via an alternative route: ai-voice POSTs PCM to backend, backend writes it as .r8 and triggers playback via ESL. More indirection, but breaks the dependency on a less stable bidirectional module.

STT: why streaming, not batch

Whisper batch yields better quality but adds 1.5-2s latency. Discarded for live agents. The serious streaming options:

Deepgram Nova-2/3: WebSocket auth via header, first partial in ~250ms. Best cost/quality for production.
ElevenLabs Scribe v2 Realtime: WebSocket auth via xi-api-key header (not query params). Shines in noisy environments and non-native accents.

Pitfall that cost us two days: ElevenLabs switched auth from ?api_key= to header at some point. If you still pass it as query string, you get auth_error with no clear hint.

LLM: the key decision is TTFT

Time-To-First-Token is the metric that decides perceived latency, not total throughput. Real production measurements:

Provider	TTFT	Tokens/s
OpenAI GPT-4o	667-2400ms	80
OpenAI GPT-4o-mini	350-800ms	120
Groq Llama 3.1 70B	~120ms	320
Cerebras Llama 3.1 70B	~150ms	450

Groq is the default for voice production. Cerebras the alternative if you need higher sustained throughput.

TTS: choice depends on whether you want audio tags

Two serious ElevenLabs models:

eleven_multilingual_v2: WebSocket streaming (stream-input endpoint). 30+ languages. No audio tags.
eleven_v3: HTTP streaming only (WS returns 403). Supports [laughs], [sighs], [whispers]. Superior quality.

System prompt is conditional on TTS model:

if TTS model contains "v3":
  "You may use [laughs], [sighs], [whispers] for natural inflection."
otherwise:
  "Don't use brackets. Express emotion through punctuation: '...!?'"

Without this, v2 ends up literally reading “open bracket laughs close bracket” aloud. Embarrassing.

Sentence-level pipelining

The trick that shaves ~40% off latency: don’t wait for the LLM’s full response before starting TTS. SIVO uses a JsonResponseExtractor that parses the JSON incrementally as it streams from the LLM and fires TTS upon detecting .!?\n inside the response field:

{
  "response": "Got it, can you confirm your ID number?|       ← sentence boundary
  "action": "continue",
  "variables": { ... }
}

While sentence 1 plays, sentence 2 is being synthesized. While sentence 2 plays, sentence 3 is being generated by the LLM. We overlap the three stages.

Barge-in: the cooldown detail

Naive barge-in implementation:

VAD detects speech_start → abort in-flight TTS + LLM

Problem: the bot’s own playback feeds back into the caller’s mic (especially over mobile speakers). VAD detects speech_start, aborts the bot, the bot processes “its own echo” as an utterance, replies “I didn’t catch that”… infinite loop.

Three-layer mitigation:

greetingPlaying flag during the first 150ms after uuid_audio_fork start (settling time).
bargeInCooldownUntil: after each playback, ignore speech_start for playMs + 300ms.
speechMinMs: 400: ignore utterances under 400ms (taps, sighs, ambient noise).

After these three, false positives drop from ~30% to less than 2%.

Handling LLM abort on barge-in

When TTS aborts, the LLM also has to abort (cancel the fetch with AbortController). Important: don’t treat as fatal error. The catch must check state.ttsAborted:

try {
  await streamLLM(...);
} catch (err) {
  if (state.ttsAborted) {
    logger.debug('LLM aborted by barge-in (expected)');
    return;
  }
  logger.error('LLM real failure', err);
}

Without this, the log fills with “errors” that are normal barge-ins.

The hard rule: where NOT to start transcription

This deserves triple underline. Audio capture (uuid_audio_fork) only starts on:

bridge-agent-start (when the call reaches the human agent).
Entering an IVR ai_agent node.

Never during IVR / DTMF menu / queue / hold music. Why?

Cost: transcribing “Press 1 for X, 2 for Y” on every call multiplies STT spend by N.
Privacy: the caller hasn’t consented yet (consent comes when reaching the agent or entering the bot).
Compliance: GDPR Art. 6 — no clear legal basis to transcribe menu prompts.

It’s a hard rule, not an optimization. SIVO fails a CI test if the IVR codepath can start transcription.

Production result

With Groq Llama 3.1 70B + Deepgram Nova-2 + ElevenLabs v2:

LLM TTFT: ~120ms
First TTS audio from last token arrival: ~80ms
VAD detect → first bot audio: ~600ms (P50), 850ms (P95)
Barge-in false positives: less than 2%
Cost per minute conversed: $0.012 (Groq) + $0.013 (Deepgram) + $0.18 (ElevenLabs v2) = **$0.20/min**

For a “FAQ + lead qualification + transfer to human on doubt” use case it’s operationally solid.

What’s left

Speaker separation in STT (used for human agent transcription, pending in AI agents for warm-transfer scenarios).
Local Whisper on EU GPU for customers with strict compliance who can’t send audio to USA.
Aggregate token/cost metrics broken down by AI agent (we have per-call, aggregate dashboards missing).

→ If your team is building a voice agent, talk to us — we share the full benchmarks.

#ai #llm #voice #streaming

Share: · ·

Written by

Iván Jerez

SIVO team.

AI voice agents in your PBX: streaming pipeline under 600ms