Voice Agent Architecture: Realtime API and On-Device ASR/TTS
A systematic deep dive into Voice Agent four-layer architecture, latency optimization (streaming ASR, streaming TTS, prompt caching), echo cancellation and interruption handling, on-device ASR/TTS deployment (whisper.cpp, piper, llama.cpp), and a comparison of OpenAI Realtime API, LiveKit Agents, and Pipecat frameworks.
Voice Agent Architecture: Realtime API and On-Device ASR/TTS
Voice Agents are one of the hottest AI application tracks in 2024-2025. The maturity of frameworks like OpenAI Realtime API, LiveKit Agents, and Pipecat has moved "talking AI assistants" from demo to product. But Voice Agents are far more complex than text Agents -- latency, echo, interruption, streaming synthesis, network jitter, on-device deployment -- any of these can break the user experience. This article provides a production-engineering deep dive into the layered architecture, latency optimization, echo/interruption handling, and on-device deployment strategies for Voice Agents.
Capability Layers
A complete Voice Agent has four layers:
Layer 1: Audio I/O
- Microphone capture (VAD to detect speech start/end)
- Speaker playback (TTS streaming synthesis)
- Audio format conversion (PCM/Opus/AAC)
- Noise suppression (AEC, NS, AGC)
Layer 2: ASR (Automatic Speech Recognition)
- Real-time streaming recognition
- Multi-language switching
- Speaker diarization
- Wake word detection
Layer 3: Agent Reasoning
- LLM understanding and decision
- Tool calling
- Memory and context
Layer 4: TTS (Text-to-Speech)
- Streaming synthesis
- Emotion and tone control
- Multi-voice switching
- Real-time speed adjustment
All four layers must be designed around "low latency plus streaming"; a stall in any layer makes the whole interaction feel unnatural.
Latency: The Lifeline
Latency standards (human ear sensitivity):
- < 200ms: seamless, feels like talking to a person
- 200-500ms: acceptable, but with a "machine" feel
- 500-1000ms: clearly choppy, breaks conversation flow
1000ms: unusable
Typical latency breakdown:
| Stage | Latency |
|---|---|
| Microphone to ASR | 100-300ms |
| ASR to LLM | 50-200ms |
| LLM first token (streaming) | 200-500ms |
| TTS streaming synthesis | 100-300ms |
| TTS to speaker | 50-100ms |
| Total | 500-1400ms |
Optimization target: P50 latency below 600ms, P95 below 1200ms.
OpenAI Realtime API Architecture
OpenAI released the Realtime API in late 2024, currently the easiest-to-use Voice Agent interface:
from openai import AsyncOpenAI
import asyncio
client = AsyncOpenAI()
async def voice_agent():
async with client.beta.realtime.connect(model="gpt-4o-realtime-preview") as conn:
await conn.session.update(session={
"modalities": ["text", "audio"],
"voice": "alloy",
"instructions": "You are a customer service assistant.",
"tools": [query_order_tool, send_email_tool],
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
},
})
async def send_mic():
async for audio_chunk in microphone_stream():
await conn.input_audio_buffer.append(audio=audio_chunk)
async def receive_response():
async for event in conn:
if event.type == "response.audio.delta":
play_audio(event.delta)
elif event.type == "response.function_call_arguments.done":
result = await execute_tool(event)
await conn.conversation.item.create(
item={
"type": "function_call_output",
"call_id": event.call_id,
"output": json.dumps(result),
}
)
await asyncio.gather(send_mic(), receive_response())
Realtime API strengths:
- High integration: ASR + LLM + TTS in a single API
- Server-side VAD: auto-detect speech end without client logic
- Streaming response: audio is returned token by token, low latency
- Built-in tool calling: Function Calling supported
Realtime API limitations:
- Network dependent: must connect to OpenAI, no offline support
- High cost: billed by audio duration, 10-100x more expensive than text
- Latency bound: affected by network round-trip, optimum 500-800ms
- Hard to privatize: data must go to OpenAI
LiveKit Agents Framework
LiveKit is an open-source real-time communication framework; LiveKit Agents is its Agent framework:
from livekit import agents
from livekit.agents import Agent, AgentSession
from livekit.agents.stt import openai_stt
from livekit.agents.tts import openai_tts
from livekit.agents.llm import openai_llm
from livekit.plugins import silero
class MyAgent(Agent):
def __init__(self):
super().__init__(instructions="You are a smart assistant")
async def on_enter(self):
self.session.say("Hi, how can I help?")
async def entrypoint(ctx: agents.JobContext):
session = AgentSession(
stt=openai_stt.STT(),
llm=openai_llm.LLM(model="gpt-4o"),
tts=openai_tts.TTS(voice="alloy"),
vad=silero.VAD.load(),
)
await session.start(
room=ctx.room,
agent=MyAgent(),
)
if __name__ == "__main__":
agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))
LiveKit strengths:
- Fully open source: deployable on private infrastructure
- WebRTC transport: low latency, network-jitter resilient
- Pluggable components: STT/LLM/TTS/VAD replaceable independently
- Multi-Agent rooms: supports multi-party multi-Agent collaboration
Pipecat Framework
Pipecat is Daily.co's Voice Agent framework, emphasizing the "pipeline" pattern:
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.services.openai_stt import OpenAISTTService
from pipecat.services.openai_tts import OpenAITTSService
from pipecat.services.openai_llm import OpenAILLMService
from pipecat.transports.network.fastapi_websocket import FastAPIWebsocketTransport
transport = FastAPIWebsocketTransport(websocket=websocket)
stt = OpenAISTTService(api_key=os.environ["OPENAI_API_KEY"])
llm = OpenAILLMService(model="gpt-4o")
tts = OpenAITTSService(voice="alloy")
pipeline = Pipeline([
transport.input(),
stt,
llm,
tts,
transport.output(),
])
runner = PipelineRunner(pipeline)
runner.run()
Pipecat strengths:
- Pipeline pattern: clear data flow, easy to debug
- Daily.co integration: built-in WebRTC transport
- Rich services: 50+ pre-integrated STT/TTS/LLM services
- Real-time visualization: pipeline debugger
On-Device ASR/TTS Deployment
For privacy-sensitive scenarios (healthcare, legal, internal enterprise), on-device deployment is mandatory:
from pywhispercpp.model import Model
model = Model("base.en", n_threads=4)
segments = model.transcribe("audio.wav")
for segment in segments:
print(segment.text)
from piper import PiperVoice
voice = PiperVoice.load("en_US-lessac-medium.onnx")
voice.synthesize("Hello, world!", "output.wav")
from llama_cpp import Llama
llm = Llama(model_path="llama-3.1-8b-instruct.gguf")
response = llm.create_chat_completion(
messages=[{"role": "user", "content": "Hi"}],
max_tokens=100,
)
On-device tradeoffs:
| Dimension | Cloud | On-device |
|---|---|---|
| Latency | 500-1000ms | 200-500ms |
| Cost | Pay per use | One-time hardware |
| Privacy | Data uploaded | Data local |
| Voice quality | High (commercial) | Medium (open source) |
| Maintenance | Zero | High (model updates) |
Typical on-device config:
- Hardware: Apple Silicon M2+, Qualcomm 8 Gen 2+, Intel 12th+
- ASR: whisper.cpp base model, ~1GB memory
- TTS: piper, ~100MB memory
- LLM: Qwen 2.5 3B / Llama 3.1 8B, ~5GB memory
- Total memory: ~6-8GB
Echo and Interruption: UX Challenges
Problem 1: TTS playback captures itself (Echo) When the Agent plays TTS, the microphone picks it up again, and ASR treats it as "user continues speaking."
Solutions:
- AEC (Acoustic Echo Cancellation): hardware or software echo cancel
- Mute mic during playback: pause recording while TTS plays
- Higher VAD threshold: raise VAD threshold during playback
class EchoCancellation:
def __init__(self):
self.is_playing = False
async def on_tts_start(self):
self.is_playing = True
await self.stt.pause()
async def on_tts_end(self):
self.is_playing = False
await self.stt.resume()
Problem 2: User interrupts Agent
User experience: while the Agent is speaking, the user wants to interject.
Solutions:
- Barging-in detection: stop TTS immediately when new voice detected
- VAD duration: treat 300ms of continuous voice as interruption
- Priority switching: user voice takes priority over Agent voice
class BargeInHandler:
def __init__(self, vad, tts):
self.vad = vad
self.tts = tts
async def on_voice_detected(self, confidence: float):
if confidence > 0.7:
await self.tts.interrupt()
Performance Optimization
1. Streaming ASR
Do not wait for the user to finish speaking before starting recognition. Streaming ASR starts understanding mid-speech:
from livekit.agents.stt import openai_stt
stt = openai_stt.STT(
model="whisper-1",
interim_results=True,
)
Streaming ASR can cut recognition latency from 1-2s to 200-500ms.
2. LLM First Token Latency
LLM TTFT (Time To First Token) directly determines Agent "reaction speed":
- Use smaller models: 3B / 7B models are 5-10x faster than 70B
- Prompt caching: cache identical system prompts, avoid recomputation
- Speculative decoding: small model predicts large model output
3. TTS Streaming Synthesis
TTS should not wait for the whole sentence. Stream while generating:
from livekit.agents.tts import openai_tts
tts = openai_tts.TTS(
model="tts-1",
voice="alloy",
streaming=True,
)
Streaming TTS can cut first-byte latency from 500ms to 100ms.
Failure Modes and Handling
| Failure mode | Symptom | Handling |
|---|---|---|
| Network jitter | Audio choppy | WebRTC plus adaptive bitrate |
| ASR misrecognition | Text error | Keyword confidence filter |
| LLM timeout | Long silence | Shorten prompt, retry |
| TTS failure | Silent | Fall back to pre-recorded audio |
| Interruption failure | Poor UX | Shorten VAD, optimize TTS cancel |
| Multi-speaker confusion | Wrong recognition | Speaker ID plus keyword lock |
Implementation Path
Week 1: Pick a framework (OpenAI Realtime API / LiveKit / Pipecat), get a demo running. Week 2: Implement a basic Voice Agent (STT -> LLM -> TTS), measure latency baseline. Week 3: Optimize latency (streaming ASR, streaming TTS, prompt cache). Week 4: Implement echo cancellation and interruption. Week 5: Build on-device ASR/TTS solution (if privacy required). Week 6: Performance monitoring (latency distribution, error rate, user satisfaction).
Summary
The core challenge of Voice Agents is latency and UX. From 500ms to 1500ms, the experience gap is the difference between "fluid conversation" and "obvious choppiness." OpenAI Realtime API suits quick validation; LiveKit Agents suits private deployment; Pipecat suits complex pipeline scenarios.
Echo and interruption are UX pain points: mute the mic during TTS playback, stop TTS immediately on new voice detection, shorten VAD time. These details determine whether users will keep using it.
On-device ASR/TTS is mandatory for privacy-sensitive scenarios, but requires trading off voice quality and hardware cost.
Reference tools: LiveKit Agents (open-source real-time Agent framework), Pipecat (Daily.co's pipeline framework), OpenAI Realtime Agents (OpenAI's official Realtime examples), ElevenLabs Python (high-quality TTS), and Microsoft VibeVoice (Microsoft's voice tooling) cover the core nodes of the Voice Agent toolchain.
Projects in this article
LiveKit Agents
11.2k ⭐LiveKit Agents is LiveKit's real-time voice and multimodal agent framework for phone, assistant, and interactive use cases that need low-latency experiences.
Pipecat
13.1k ⭐Pipecat is an open-source framework for voice and multimodal conversational AI, enabling real-time voice assistants, video bots, and multimodal agents with integrated TTS, STT, and LLM services.
OpenAI Realtime Agents
6.9k ⭐A demonstration of advanced agentic patterns built on top of OpenAI's Realtime API, showcasing real-time voice interaction and multi-agent collaboration.
ElevenLabs Python SDK
3.0k ⭐Official Python SDK for ElevenLabs voice AI services — text-to-speech, voice cloning, real-time streaming, and Conversational AI agents.
VibeVoice
49.8k ⭐Open-source frontier voice AI from Microsoft, providing high-quality speech synthesis and recognition for building real-time conversational voice agent applications.