Voice Agent Architecture: Realtime API and On-Device ASR/TTS

A systematic deep dive into Voice Agent four-layer architecture, latency optimization (streaming ASR, streaming TTS, prompt caching), echo cancellation and interruption handling, on-device ASR/TTS deployment (whisper.cpp, piper, llama.cpp), and a comparison of OpenAI Realtime API, LiveKit Agents, and Pipecat frameworks.

AgentList · 2026年7月1日
Voice AgentRealtime APIASRTTSWebRTC

Voice Agent Architecture: Realtime API and On-Device ASR/TTS

Voice Agents are one of the hottest AI application tracks in 2024-2025. The maturity of frameworks like OpenAI Realtime API, LiveKit Agents, and Pipecat has moved "talking AI assistants" from demo to product. But Voice Agents are far more complex than text Agents -- latency, echo, interruption, streaming synthesis, network jitter, on-device deployment -- any of these can break the user experience. This article provides a production-engineering deep dive into the layered architecture, latency optimization, echo/interruption handling, and on-device deployment strategies for Voice Agents.

Capability Layers

A complete Voice Agent has four layers:

Layer 1: Audio I/O

  • Microphone capture (VAD to detect speech start/end)
  • Speaker playback (TTS streaming synthesis)
  • Audio format conversion (PCM/Opus/AAC)
  • Noise suppression (AEC, NS, AGC)

Layer 2: ASR (Automatic Speech Recognition)

  • Real-time streaming recognition
  • Multi-language switching
  • Speaker diarization
  • Wake word detection

Layer 3: Agent Reasoning

  • LLM understanding and decision
  • Tool calling
  • Memory and context

Layer 4: TTS (Text-to-Speech)

  • Streaming synthesis
  • Emotion and tone control
  • Multi-voice switching
  • Real-time speed adjustment

All four layers must be designed around "low latency plus streaming"; a stall in any layer makes the whole interaction feel unnatural.

Latency: The Lifeline

Latency standards (human ear sensitivity):

  • < 200ms: seamless, feels like talking to a person
  • 200-500ms: acceptable, but with a "machine" feel
  • 500-1000ms: clearly choppy, breaks conversation flow
  • 1000ms: unusable

Typical latency breakdown:

Stage Latency
Microphone to ASR 100-300ms
ASR to LLM 50-200ms
LLM first token (streaming) 200-500ms
TTS streaming synthesis 100-300ms
TTS to speaker 50-100ms
Total 500-1400ms

Optimization target: P50 latency below 600ms, P95 below 1200ms.

OpenAI Realtime API Architecture

OpenAI released the Realtime API in late 2024, currently the easiest-to-use Voice Agent interface:

from openai import AsyncOpenAI
import asyncio

client = AsyncOpenAI()

async def voice_agent():
    async with client.beta.realtime.connect(model="gpt-4o-realtime-preview") as conn:
        await conn.session.update(session={
            "modalities": ["text", "audio"],
            "voice": "alloy",
            "instructions": "You are a customer service assistant.",
            "tools": [query_order_tool, send_email_tool],
            "turn_detection": {
                "type": "server_vad",
                "threshold": 0.5,
            },
        })
        
        async def send_mic():
            async for audio_chunk in microphone_stream():
                await conn.input_audio_buffer.append(audio=audio_chunk)
        
        async def receive_response():
            async for event in conn:
                if event.type == "response.audio.delta":
                    play_audio(event.delta)
                elif event.type == "response.function_call_arguments.done":
                    result = await execute_tool(event)
                    await conn.conversation.item.create(
                        item={
                            "type": "function_call_output",
                            "call_id": event.call_id,
                            "output": json.dumps(result),
                        }
                    )
        
        await asyncio.gather(send_mic(), receive_response())

Realtime API strengths:

  • High integration: ASR + LLM + TTS in a single API
  • Server-side VAD: auto-detect speech end without client logic
  • Streaming response: audio is returned token by token, low latency
  • Built-in tool calling: Function Calling supported

Realtime API limitations:

  • Network dependent: must connect to OpenAI, no offline support
  • High cost: billed by audio duration, 10-100x more expensive than text
  • Latency bound: affected by network round-trip, optimum 500-800ms
  • Hard to privatize: data must go to OpenAI

LiveKit Agents Framework

LiveKit is an open-source real-time communication framework; LiveKit Agents is its Agent framework:

from livekit import agents
from livekit.agents import Agent, AgentSession
from livekit.agents.stt import openai_stt
from livekit.agents.tts import openai_tts
from livekit.agents.llm import openai_llm
from livekit.plugins import silero

class MyAgent(Agent):
    def __init__(self):
        super().__init__(instructions="You are a smart assistant")
    
    async def on_enter(self):
        self.session.say("Hi, how can I help?")

async def entrypoint(ctx: agents.JobContext):
    session = AgentSession(
        stt=openai_stt.STT(),
        llm=openai_llm.LLM(model="gpt-4o"),
        tts=openai_tts.TTS(voice="alloy"),
        vad=silero.VAD.load(),
    )
    
    await session.start(
        room=ctx.room,
        agent=MyAgent(),
    )

if __name__ == "__main__":
    agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))

LiveKit strengths:

  • Fully open source: deployable on private infrastructure
  • WebRTC transport: low latency, network-jitter resilient
  • Pluggable components: STT/LLM/TTS/VAD replaceable independently
  • Multi-Agent rooms: supports multi-party multi-Agent collaboration

Pipecat Framework

Pipecat is Daily.co's Voice Agent framework, emphasizing the "pipeline" pattern:

from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.services.openai_stt import OpenAISTTService
from pipecat.services.openai_tts import OpenAITTSService
from pipecat.services.openai_llm import OpenAILLMService
from pipecat.transports.network.fastapi_websocket import FastAPIWebsocketTransport

transport = FastAPIWebsocketTransport(websocket=websocket)

stt = OpenAISTTService(api_key=os.environ["OPENAI_API_KEY"])
llm = OpenAILLMService(model="gpt-4o")
tts = OpenAITTSService(voice="alloy")

pipeline = Pipeline([
    transport.input(),
    stt,
    llm,
    tts,
    transport.output(),
])

runner = PipelineRunner(pipeline)
runner.run()

Pipecat strengths:

  • Pipeline pattern: clear data flow, easy to debug
  • Daily.co integration: built-in WebRTC transport
  • Rich services: 50+ pre-integrated STT/TTS/LLM services
  • Real-time visualization: pipeline debugger

On-Device ASR/TTS Deployment

For privacy-sensitive scenarios (healthcare, legal, internal enterprise), on-device deployment is mandatory:

from pywhispercpp.model import Model

model = Model("base.en", n_threads=4)
segments = model.transcribe("audio.wav")
for segment in segments:
    print(segment.text)

from piper import PiperVoice
voice = PiperVoice.load("en_US-lessac-medium.onnx")
voice.synthesize("Hello, world!", "output.wav")

from llama_cpp import Llama
llm = Llama(model_path="llama-3.1-8b-instruct.gguf")
response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hi"}],
    max_tokens=100,
)

On-device tradeoffs:

Dimension Cloud On-device
Latency 500-1000ms 200-500ms
Cost Pay per use One-time hardware
Privacy Data uploaded Data local
Voice quality High (commercial) Medium (open source)
Maintenance Zero High (model updates)

Typical on-device config:

  • Hardware: Apple Silicon M2+, Qualcomm 8 Gen 2+, Intel 12th+
  • ASR: whisper.cpp base model, ~1GB memory
  • TTS: piper, ~100MB memory
  • LLM: Qwen 2.5 3B / Llama 3.1 8B, ~5GB memory
  • Total memory: ~6-8GB

Echo and Interruption: UX Challenges

Problem 1: TTS playback captures itself (Echo) When the Agent plays TTS, the microphone picks it up again, and ASR treats it as "user continues speaking."

Solutions:

  • AEC (Acoustic Echo Cancellation): hardware or software echo cancel
  • Mute mic during playback: pause recording while TTS plays
  • Higher VAD threshold: raise VAD threshold during playback
class EchoCancellation:
    def __init__(self):
        self.is_playing = False
    
    async def on_tts_start(self):
        self.is_playing = True
        await self.stt.pause()
    
    async def on_tts_end(self):
        self.is_playing = False
        await self.stt.resume()

Problem 2: User interrupts Agent

User experience: while the Agent is speaking, the user wants to interject.

Solutions:

  • Barging-in detection: stop TTS immediately when new voice detected
  • VAD duration: treat 300ms of continuous voice as interruption
  • Priority switching: user voice takes priority over Agent voice
class BargeInHandler:
    def __init__(self, vad, tts):
        self.vad = vad
        self.tts = tts
    
    async def on_voice_detected(self, confidence: float):
        if confidence > 0.7:
            await self.tts.interrupt()

Performance Optimization

1. Streaming ASR

Do not wait for the user to finish speaking before starting recognition. Streaming ASR starts understanding mid-speech:

from livekit.agents.stt import openai_stt

stt = openai_stt.STT(
    model="whisper-1",
    interim_results=True,
)

Streaming ASR can cut recognition latency from 1-2s to 200-500ms.

2. LLM First Token Latency

LLM TTFT (Time To First Token) directly determines Agent "reaction speed":

  • Use smaller models: 3B / 7B models are 5-10x faster than 70B
  • Prompt caching: cache identical system prompts, avoid recomputation
  • Speculative decoding: small model predicts large model output

3. TTS Streaming Synthesis

TTS should not wait for the whole sentence. Stream while generating:

from livekit.agents.tts import openai_tts

tts = openai_tts.TTS(
    model="tts-1",
    voice="alloy",
    streaming=True,
)

Streaming TTS can cut first-byte latency from 500ms to 100ms.

Failure Modes and Handling

Failure mode Symptom Handling
Network jitter Audio choppy WebRTC plus adaptive bitrate
ASR misrecognition Text error Keyword confidence filter
LLM timeout Long silence Shorten prompt, retry
TTS failure Silent Fall back to pre-recorded audio
Interruption failure Poor UX Shorten VAD, optimize TTS cancel
Multi-speaker confusion Wrong recognition Speaker ID plus keyword lock

Implementation Path

Week 1: Pick a framework (OpenAI Realtime API / LiveKit / Pipecat), get a demo running. Week 2: Implement a basic Voice Agent (STT -> LLM -> TTS), measure latency baseline. Week 3: Optimize latency (streaming ASR, streaming TTS, prompt cache). Week 4: Implement echo cancellation and interruption. Week 5: Build on-device ASR/TTS solution (if privacy required). Week 6: Performance monitoring (latency distribution, error rate, user satisfaction).

Summary

The core challenge of Voice Agents is latency and UX. From 500ms to 1500ms, the experience gap is the difference between "fluid conversation" and "obvious choppiness." OpenAI Realtime API suits quick validation; LiveKit Agents suits private deployment; Pipecat suits complex pipeline scenarios.

Echo and interruption are UX pain points: mute the mic during TTS playback, stop TTS immediately on new voice detection, shorten VAD time. These details determine whether users will keep using it.

On-device ASR/TTS is mandatory for privacy-sensitive scenarios, but requires trading off voice quality and hardware cost.

Reference tools: LiveKit Agents (open-source real-time Agent framework), Pipecat (Daily.co's pipeline framework), OpenAI Realtime Agents (OpenAI's official Realtime examples), ElevenLabs Python (high-quality TTS), and Microsoft VibeVoice (Microsoft's voice tooling) cover the core nodes of the Voice Agent toolchain.