Voice Agent Production Guide: LiveKit Agents from Prototype to Millions of Concurrent Calls
Voice AI agents are the next frontier. LiveKit (11k Stars, powering ChatGPT's Advanced Voice) offers a complete framework. This article breaks down the pipeline and walks through building production-ready voice agents.
Why Voice Agents Are Different
Voice agents must deliver end-to-end latency under 500ms. A 3-second silence on a phone call means a dropped conversation — unlike chatbots where 2-5s is acceptable.
Pipeline Architecture
User Audio → VAD → STT → Agent (LLM) → TTS → User Audio
↑_______________↓
Turn Detection (Interruption Handling)
Budget: VAD < 50ms, STT < 300ms, LLM < 200ms, TTS < 200ms. End-to-end target: 400-600ms.
LiveKit Agents Framework
livekit/agents (11.1k Stars, Apache 2.0) — powers ChatGPT's Advanced Voice mode. Built on LiveKit WebRTC SFU + Agent SDK (Python/Node.js) + plugin ecosystem.
STT plugins: Deepgram, OpenAI Whisper, Azure, and more. TTS plugins: Cartesia, ElevenLabs, OpenAI, Azure, Deepgram, and more. LLM plugins: OpenAI Realtime, GPT, Claude, Groq, Together, Ollama.
Tool / MCP support: Bring tools and MCP servers into the conversation via Function Calling.
Quick Start
pip install livekit-agents livekit-plugins-openai livekit-plugins-deepgram livekit-plugins-cartesia
from livekit import agents
from livekit.agents import AgentServer, AgentSession, Agent, inference
server = AgentServer()
@server.rtc_session(agent_name="support-agent")
async def entrypoint(ctx: agents.JobContext):
session = AgentSession(
stt=inference.STT(model="deepgram/nova-3", language="multi"),
llm=inference.LLM(model="openai/chat-latest"),
tts=inference.TTS(model="cartesia/sonic-3"),
)
await session.start(room=ctx.room, agent=Agent(instructions="Hello, how can I help?"))
Production Considerations
- Interruption handling: Combine acoustic VAD with semantic turn detection (
inference.TurnDetector()). - Agent dispatch: Use
lk dispatch createor the Python Server SDK to route calls. - SIP integration: Connect to PSTN via LiveKit Phone Numbers or SIP Trunk — inbound, outbound, DTMF, recording.
- Observability: Transcripts, OpenTelemetry traces, turn-by-turn telemetry.
- Keep-alive: Prompt after 15s silence to avoid perceived drop.
Deployment
- LiveKit Cloud: Managed, global edge nodes, 50h free monthly
- Self-hosted: Docker Compose for LiveKit Server + agents (data sovereignty)
Open-Source Comparison
| Feature | LiveKit | Pipecat | Vocode |
|---|---|---|---|
| Stars | 11.1k | ~13k | ~3.8k |
| MCP | Native | Community | None |
| SIP | Native | DIY | Limited |
| Cloud | Yes | No | No |
Summary
Three key decisions: STT/TTS selection (Deepgram + Cartesia for best latency/quality), interruption strategy (semantic turn detection required), deployment path (Cloud for validation, self-host for scale).
Projects in this article
LiveKit Agents
11.1k ⭐LiveKit Agents is LiveKit's real-time voice and multimodal agent framework for phone, assistant, and interactive use cases that need low-latency experiences.
LiveKit
19.4k ⭐Open source real-time audio/video infrastructure for AI agents. WebRTC transport, agent framework, SIP telephony, and real-time transcription.
Pipecat
13.0k ⭐Pipecat is an open-source framework for voice and multimodal conversational AI, enabling real-time voice assistants, video bots, and multimodal agents with integrated TTS, STT, and LLM services.
Open WebUI
142.6k ⭐Open WebUI is a feature-rich, user-friendly self-hosted AI platform supporting Ollama and OpenAI-compatible APIs, with RAG, agents, and MCP capabilities.