Voice Agent Architecture: Realtime API and On-Device ASR/TTS

Voice Agents are one of the hottest AI application tracks in 2024-2025. The maturity of frameworks like OpenAI Realtime API, LiveKit Agents, and Pipecat has moved "talking AI assistants" from demo to product. But Voice Agents are far more complex than text Agents -- latency, echo, interruption, streaming synthesis, network jitter, on-device deployment -- any of these can break the user experience. This article provides a production-engineering deep dive into the layered architecture, latency optimization, echo/interruption handling, and on-device deployment strategies for Voice Agents.

Capability Layers

A complete Voice Agent has four layers:

Layer 1: Audio I/O

Microphone capture (VAD to detect speech start/end)
Speaker playback (TTS streaming synthesis)
Audio format conversion (PCM/Opus/AAC)
Noise suppression (AEC, NS, AGC)

Layer 2: ASR (Automatic Speech Recognition)

Real-time streaming recognition
Multi-language switching
Speaker diarization
Wake word detection

Layer 3: Agent Reasoning

LLM understanding and decision
Tool calling
Memory and context

Layer 4: TTS (Text-to-Speech)

Streaming synthesis
Emotion and tone control
Multi-voice switching
Real-time speed adjustment

All four layers must be designed around "low latency plus streaming"; a stall in any layer makes the whole interaction feel unnatural.

Latency: The Lifeline

Latency standards (human ear sensitivity):

< 200ms: seamless, feels like talking to a person
200-500ms: acceptable, but with a "machine" feel
500-1000ms: clearly choppy, breaks conversation flow
1000ms: unusable

Typical latency breakdown:

Stage	Latency
Microphone to ASR	100-300ms
ASR to LLM	50-200ms
LLM first token (streaming)	200-500ms
TTS streaming synthesis	100-300ms
TTS to speaker	50-100ms
Total	500-1400ms

Optimization target: P50 latency below 600ms, P95 below 1200ms.

OpenAI Realtime API Architecture

OpenAI released the Realtime API in late 2024, currently the easiest-to-use Voice Agent interface:

from openai import AsyncOpenAI
import asyncio

client = AsyncOpenAI()

async def voice_agent():
    async with client.beta.realtime.connect(model="gpt-4o-realtime-preview") as conn:
        await conn.session.update(session={
            "modalities": ["text", "audio"],
            "voice": "alloy",
            "instructions": "You are a customer service assistant.",
            "tools": [query_order_tool, send_email_tool],
            "turn_detection": {
                "type": "server_vad",
                "threshold": 0.5,
            },
        })
        
        async def send_mic():
            async for audio_chunk in microphone_stream():
                await conn.input_audio_buffer.append(audio=audio_chunk)
        
        async def receive_response():
            async for event in conn:
                if event.type == "response.audio.delta":
                    play_audio(event.delta)
                elif event.type == "response.function_call_arguments.done":
                    result = await execute_tool(event)
                    await conn.conversation.item.create(
                        item={
                            "type": "function_call_output",
                            "call_id": event.call_id,
                            "output": json.dumps(result),
                        }
                    )
        
        await asyncio.gather(send_mic(), receive_response())

Realtime API strengths:

High integration: ASR + LLM + TTS in a single API
Server-side VAD: auto-detect speech end without client logic
Streaming response: audio is returned token by token, low latency
Built-in tool calling: Function Calling supported

Realtime API limitations:

Network dependent: must connect to OpenAI, no offline support
High cost: billed by audio duration, 10-100x more expensive than text
Latency bound: affected by network round-trip, optimum 500-800ms
Hard to privatize: data must go to OpenAI

LiveKit Agents Framework

LiveKit is an open-source real-time communication framework; LiveKit Agents is its Agent framework:

from livekit import agents
from livekit.agents import Agent, AgentSession
from livekit.agents.stt import openai_stt
from livekit.agents.tts import openai_tts
from livekit.agents.llm import openai_llm
from livekit.plugins import silero

class MyAgent(Agent):
    def __init__(self):
        super().__init__(instructions="You are a smart assistant")
    
    async def on_enter(self):
        self.session.say("Hi, how can I help?")

async def entrypoint(ctx: agents.JobContext):
    session = AgentSession(
        stt=openai_stt.STT(),
        llm=openai_llm.LLM(model="gpt-4o"),
        tts=openai_tts.TTS(voice="alloy"),
        vad=silero.VAD.load(),
    )
    
    await session.start(
        room=ctx.room,
        agent=MyAgent(),
    )

if __name__ == "__main__":
    agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))

LiveKit strengths:

Fully open source: deployable on private infrastructure
WebRTC transport: low latency, network-jitter resilient
Pluggable components: STT/LLM/TTS/VAD replaceable independently
Multi-Agent rooms: supports multi-party multi-Agent collaboration

Pipecat Framework

Pipecat is Daily.co's Voice Agent framework, emphasizing the "pipeline" pattern:

from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.services.openai_stt import OpenAISTTService
from pipecat.services.openai_tts import OpenAITTSService
from pipecat.services.openai_llm import OpenAILLMService
from pipecat.transports.network.fastapi_websocket import FastAPIWebsocketTransport

transport = FastAPIWebsocketTransport(websocket=websocket)

stt = OpenAISTTService(api_key=os.environ["OPENAI_API_KEY"])
llm = OpenAILLMService(model="gpt-4o")
tts = OpenAITTSService(voice="alloy")

pipeline = Pipeline([
    transport.input(),
    stt,
    llm,
    tts,
    transport.output(),
])

runner = PipelineRunner(pipeline)
runner.run()

Pipecat strengths:

Pipeline pattern: clear data flow, easy to debug
Daily.co integration: built-in WebRTC transport
Rich services: 50+ pre-integrated STT/TTS/LLM services
Real-time visualization: pipeline debugger

On-Device ASR/TTS Deployment

For privacy-sensitive scenarios (healthcare, legal, internal enterprise), on-device deployment is mandatory:

from pywhispercpp.model import Model

model = Model("base.en", n_threads=4)
segments = model.transcribe("audio.wav")
for segment in segments:
    print(segment.text)

from piper import PiperVoice
voice = PiperVoice.load("en_US-lessac-medium.onnx")
voice.synthesize("Hello, world!", "output.wav")

from llama_cpp import Llama
llm = Llama(model_path="llama-3.1-8b-instruct.gguf")
response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hi"}],
    max_tokens=100,
)

On-device tradeoffs:

Dimension	Cloud	On-device
Latency	500-1000ms	200-500ms
Cost	Pay per use	One-time hardware
Privacy	Data uploaded	Data local
Voice quality	High (commercial)	Medium (open source)
Maintenance	Zero	High (model updates)

Typical on-device config:

Hardware: Apple Silicon M2+, Qualcomm 8 Gen 2+, Intel 12th+
ASR: whisper.cpp base model, ~1GB memory
TTS: piper, ~100MB memory
LLM: Qwen 2.5 3B / Llama 3.1 8B, ~5GB memory
Total memory: ~6-8GB

Echo and Interruption: UX Challenges

Problem 1: TTS playback captures itself (Echo) When the Agent plays TTS, the microphone picks it up again, and ASR treats it as "user continues speaking."

Solutions:

AEC (Acoustic Echo Cancellation): hardware or software echo cancel
Mute mic during playback: pause recording while TTS plays
Higher VAD threshold: raise VAD threshold during playback

class EchoCancellation:
    def __init__(self):
        self.is_playing = False
    
    async def on_tts_start(self):
        self.is_playing = True
        await self.stt.pause()
    
    async def on_tts_end(self):
        self.is_playing = False
        await self.stt.resume()

Problem 2: User interrupts Agent

User experience: while the Agent is speaking, the user wants to interject.

Solutions:

Barging-in detection: stop TTS immediately when new voice detected
VAD duration: treat 300ms of continuous voice as interruption
Priority switching: user voice takes priority over Agent voice

class BargeInHandler:
    def __init__(self, vad, tts):
        self.vad = vad
        self.tts = tts
    
    async def on_voice_detected(self, confidence: float):
        if confidence > 0.7:
            await self.tts.interrupt()

Performance Optimization

1. Streaming ASR

Do not wait for the user to finish speaking before starting recognition. Streaming ASR starts understanding mid-speech:

from livekit.agents.stt import openai_stt

stt = openai_stt.STT(
    model="whisper-1",
    interim_results=True,
)

Streaming ASR can cut recognition latency from 1-2s to 200-500ms.

2. LLM First Token Latency

LLM TTFT (Time To First Token) directly determines Agent "reaction speed":

Use smaller models: 3B / 7B models are 5-10x faster than 70B
Prompt caching: cache identical system prompts, avoid recomputation
Speculative decoding: small model predicts large model output

3. TTS Streaming Synthesis

TTS should not wait for the whole sentence. Stream while generating:

from livekit.agents.tts import openai_tts

tts = openai_tts.TTS(
    model="tts-1",
    voice="alloy",
    streaming=True,
)

Streaming TTS can cut first-byte latency from 500ms to 100ms.

Failure Modes and Handling

Failure mode	Symptom	Handling
Network jitter	Audio choppy	WebRTC plus adaptive bitrate
ASR misrecognition	Text error	Keyword confidence filter
LLM timeout	Long silence	Shorten prompt, retry
TTS failure	Silent	Fall back to pre-recorded audio
Interruption failure	Poor UX	Shorten VAD, optimize TTS cancel
Multi-speaker confusion	Wrong recognition	Speaker ID plus keyword lock

Implementation Path

Week 1: Pick a framework (OpenAI Realtime API / LiveKit / Pipecat), get a demo running. Week 2: Implement a basic Voice Agent (STT -> LLM -> TTS), measure latency baseline. Week 3: Optimize latency (streaming ASR, streaming TTS, prompt cache). Week 4: Implement echo cancellation and interruption. Week 5: Build on-device ASR/TTS solution (if privacy required). Week 6: Performance monitoring (latency distribution, error rate, user satisfaction).

Summary

The core challenge of Voice Agents is latency and UX. From 500ms to 1500ms, the experience gap is the difference between "fluid conversation" and "obvious choppiness." OpenAI Realtime API suits quick validation; LiveKit Agents suits private deployment; Pipecat suits complex pipeline scenarios.

Echo and interruption are UX pain points: mute the mic during TTS playback, stop TTS immediately on new voice detection, shorten VAD time. These details determine whether users will keep using it.

On-device ASR/TTS is mandatory for privacy-sensitive scenarios, but requires trading off voice quality and hardware cost.

Reference tools: LiveKit Agents (open-source real-time Agent framework), Pipecat (Daily.co's pipeline framework), OpenAI Realtime Agents (OpenAI's official Realtime examples), ElevenLabs Python (high-quality TTS), and Microsoft VibeVoice (Microsoft's voice tooling) cover the core nodes of the Voice Agent toolchain.

Voice Agent Architecture: Realtime API and On-Device ASR/TTS

Voice Agent Architecture: Realtime API and On-Device ASR/TTS

Capability Layers

Latency: The Lifeline

OpenAI Realtime API Architecture

LiveKit Agents Framework

Pipecat Framework

On-Device ASR/TTS Deployment

Echo and Interruption: UX Challenges

Performance Optimization

1. Streaming ASR

2. LLM First Token Latency

3. TTS Streaming Synthesis

Failure Modes and Handling

Implementation Path

Summary

Projects in this article

LiveKit Agents

Pipecat

OpenAI Realtime Agents

ElevenLabs Python SDK

VibeVoice