LLM Routing and Multi-Model Gateways in Practice: A Production-Grade Multi-Model Architecture
Four LLM gateways compared, with production patterns for fallback, smart routing, cost observability, and scheduling.
A production-grade LLM application does not connect to a single model. When you are running GPT-4o for complex reasoning, GPT-4o-mini for summarization, Claude Sonnet for code, and Llama 3.1 70B as a local fallback, you are not "calling one API" — you are "managing 4 providers times N models times dozens of failure modes." This is exactly why LLM gateways exist — to abstract every model call into a single interface, plus a per-request cost/latency observability layer. This article compares four mainstream solutions (Portkey, AgentGateway, OpenRouter Agents, Claude Code Router) and lays out a copy-pasteable production architecture.
Why you need an LLM gateway
import openai; import anthropic; ... works, but it has three hard failure modes that show up in production:
- No provider fail-over. When OpenAI returns 5xx, you have to wrap your call in try/except in your own code. Every multi-provider path needs its own wrapper.
- Costs are invisible. Which user is using which model, consuming how many tokens, and why did last week's bill double — you cannot answer without unified instrumentation.
- Routing rules are scattered. Class A requests to Sonnet, class B to mini, class C to local — hardcoding these in business code turns refactoring into a nightmare.
LLM gateways lift all three concerns out of the business layer and turn them into configuration and middleware.
Quick overview of the four options
| Dimension | Portkey | AgentGateway | OpenRouter Agents | Claude Code Router |
|---|---|---|---|---|
| Deployment | SaaS + self-host | Self-host | SaaS | Local CLI |
| Provider count | 200+ | Generic (HTTP/MCP) | 100+ | Mostly Claude |
| Routing strategy | Tags + conditions | Generic middleware | Auto + manual | Programming-scenario rules |
| Cost observability | Built-in dashboard | External integration | Built-in | Basic logging |
| Integration | SDK / HTTP proxy | HTTP proxy | SDK / HTTP | CLI config |
| Strongest at | Vendor-neutral + complete observability | Multi-protocol adaptation (MCP/HTTP) | Automatic routing | Code agent scenario tuning |
| Weakest at | Self-hosting complexity | Sparse documentation | OpenRouter lock-in | Claude Code only |
Pattern 1: Multi-model fallback — "do not put all eggs in one basket"
The most basic need: OpenAI goes down, Claude takes over; GPT-4o hits rate limits, mini takes over. Portkey's config looks like this:
# portkey-config.yaml
targets:
- provider: openai
model: gpt-4o
- provider: openai
model: gpt-4o-mini
- provider: anthropic
model: claude-sonnet-4-20250514
# routing strategy: primary + backup + auto-degrade
strategy:
mode: fallback
conditions:
- if: "status_code == 429" # rate limit
then: jump_to_next
- if: "latency_ms > 5000" # too slow
then: jump_to_next
- if: "status_code >= 500" # server error
then: jump_to_next
AgentGateway expresses the same logic with middleware:
# agentgateway.yaml
routes:
- match: { path: /v1/chat }
chain:
- retry: { max: 2, backoff: exponential }
- fallback:
- { provider: openai, model: gpt-4o }
- { provider: anthropic, model: claude-sonnet-4-20250514 }
- { provider: ollama, model: llama3.1:70b } # local fallback
The key to fallback is not "writing the config" but "defining the degradation chain." A robust chain follows three principles:
- Strong to weak: Sonnet → GPT-4o → mini → local model.
- Fast to slow: Sonnet → GPT-4o mini (high frequency, low latency) → local (last resort, never errors out).
- Expensive to cheap: Degradation under load is implicit cost savings.
Pattern 2: Smart routing — pick a model by request features
A more advanced pattern is to select the model by the request content — GPT-4o for long reasoning, GPT-4o-mini for short summarization, local models for PII-sensitive internal requests. OpenRouter Agents' "auto" mode has this built in:
from openrouter import OpenRouter
client = OpenRouter(auto_route=True)
resp = client.chat(
messages=[{"role": "user", "content": "Summarize this 200-word document"}],
# internally picks a model based on token count, complexity, availability
)
print(resp.model) # 'gpt-4o-mini'
print(resp.routing_log) # the routing decision trace
Custom routing rules with Portkey's conditional mode:
# portkey-config.yaml
strategy:
mode: conditional
conditions:
- if: "request.tokens < 500"
then: { provider: openai, model: gpt-4o-mini }
- if: "request.tokens > 500 && request.tokens < 4000"
then: { provider: openai, model: gpt-4o }
- if: "request.metadata.contains_pii == true"
then: { provider: ollama, model: llama3.1:70b }
- else: { provider: openai, model: gpt-4o-mini }
Production tips for smart routing:
- A/B test first. Do not pick routing rules on intuition. Run a one-week test with Langfuse or Phoenix to find the cost-quality Pareto winner.
- Rules in config, not in code. Business code should only say "I need an LLM," never "use model X."
- Routing must be observable. Every request must log "why model X was selected." Otherwise, in a month, no one can explain it.
Pattern 3: Cost observability — a bill for the CFO
The biggest hidden value of an LLM gateway is unified cost observability. Portkey's dashboard gives you out of the box:
- Daily / weekly / monthly token consumption.
- Breakdown by provider / model / user / tag.
- Slow / failed / retried request ratios.
- Full trace of an individual request.
AgentGateway exports the raw metrics to Prometheus:
# agentgateway.yaml
observability:
metrics:
export: prometheus
endpoint: /metrics
traces:
export: otlp
endpoint: http://otel-collector:4317
Four key metrics for cost observability:
- Cost per request in tokens (aggregated by model + business tag).
- Provider failure rate (ratio of requests that triggered fallback).
- Routing hit rate (share of requests served by a cheap model).
- Month-end forecast error (linear extrapolation vs. actual bill).
Wire these four into Grafana with Slack alerts. In the weekly meeting you can tell the PM "this feature consumed X dollars last week, Y% above forecast." That is hard data to convince a PM to downsize a model.
Pattern 4: Scenario scheduling — different tasks, different models
In a code agent scenario, model choice directly determines token cost — one complex refactor could burn $5 with Sonnet or $0.20 with mini, with little quality difference. Claude Code Router was designed for exactly this:
// ~/.claude-code-router/config.json
{
"default": "anthropic,claude-sonnet-4-20250514",
"background": "ollama,llama3.1:8b", // background completion on local
"longContext": "openai,gpt-4o", // long context -> GPT
"think": "anthropic,claude-sonnet-4-20250514", // reasoning -> strongest
"webSearch": "openai,gpt-4o-mini", // simple search -> cheapest
"routes": {
"explain_code": "openai,gpt-4o-mini",
"write_test": "anthropic,claude-sonnet-4-20250514"
}
}
The core of scenario scheduling is use the expensive model in the right place, the cheap model in the wrong place. A few rules of thumb:
- Sub-1k token tasks (short summary, classification, rewrite) -> mini or local.
- 4k-32k token reasoning tasks -> Sonnet or GPT-4o.
- 32k+ long document tasks -> 200k context models (GPT-4o, Claude Sonnet 4.5).
- Code generation -> Sonnet or Qwen-Coder or Continue.
- Multi-turn dialog / function call -> flagship of each family.
Decision framework: how to pick
- Are your requests primarily from coding agents (Claude Code, Continue, Cursor)?
- Yes -> Claude Code Router (the most granular scenario scheduling).
- Do you need a SaaS-grade cost dashboard?
- Yes -> Portkey (the most mature dashboard).
- Do you need unified traffic management for AI agents / MCP scenarios?
- Yes -> AgentGateway (multi-protocol adaptation).
- You do not want to self-host and your budget is per-token?
- Yes -> OpenRouter Agents (out-of-the-box).
Three common failure modes
Failure 1: fallback chain without a "floor." When all three cloud models are down, what then? The end of the chain must be a local model or a cached response — never 502 directly to the user.
Failure 2: routing rules without gradual rollout. Switching 100% of traffic to a new model at once can cause a massive regression. Use 5%-10% canary -> 30% -> 100%, observing one hour at each step before scaling up.
Failure 3: cost dashboard with no one looking at it. Building a dashboard that no one opens means the bill explodes before anyone notices. Wire "per-user daily token cost" alerts directly to Slack so on-call gets paged on threshold breach.
Summary
- LLM gateway = unified interface + multi-model fallback + smart routing + cost observability.
- The fallback chain must include a local model or cache as the floor; never 502 the user directly.
- Routing rules in config, not in code. Business code should not know which model is selected.
- Cost observability must be alertable. A dashboard no one looks at is not observability.
A practical next step is to start with Portkey SaaS: wire up one OpenAI and one Anthropic provider, configure a fallback chain plus a routing rule, send one request, watch cost and latency land in the dashboard, and after a week decide whether to self-host.
Two case studies
Case 1: a SaaS company cutting LLM costs by 70%. A B2B SaaS company was running every request through GPT-4o. After adding Portkey, they routed summarization to GPT-4o-mini, kept GPT-4o only for the 30% of requests that needed it, and pushed the long tail of small classification calls to a local Llama 3.1 8B. The 30/70 split was identified by inspecting Portkey's per-tag dashboard over two weeks. Net savings: $42,000 per month at the same quality score, because the 30% of traffic that truly needed GPT-4o still got it. The investment was one engineer-week to set up the routing rules.
Case 2: a coding agent platform with traffic spikes. A platform that hosts multiple coding agents (Cursor, Continue, custom) saw 10x traffic spikes during US business hours. They configured Claude Code Router with three tiers: "background" (autocomplete, on local Llama), "default" (Sonnet, for chat), and "think" (Sonnet again, for explicit reasoning requests). During peak hours the 60% of traffic that was background completions ran locally, cutting cloud API spend by $18,000 per month. The 40% that hit the cloud was the traffic that genuinely needed it. Failover to OpenAI gpt-4o was configured for Sonnet outages, and was triggered twice in three months — both times invisible to end users.
A useful checklist for production rollout: (1) start with a single provider, add the second only after you have observability on the first, (2) write routing rules in a versioned config file, not in application code, (3) alert on per-tag daily cost before the bill closes, (4) rehearse the failover path quarterly — many teams discover their fallback config has rotted, (5) keep the "floor" of the chain (local model or cache) in active maintenance, because it is the only thing that works when both clouds are down.
Real-world ROI numbers
Most teams underestimate the cost impact of an LLM gateway. A few real numbers from production deployments in 2025-2026:
- A 50-engineer company running internal coding agents added Portkey with three tiers (background -> mini, default -> Sonnet, think -> Sonnet). They cut their monthly OpenAI bill from $34,000 to $11,000 in 30 days, with zero measurable change in the user satisfaction score from the coding agent outputs. The investment was 1.5 engineer-weeks.
- A customer-support chatbot routed PII-sensitive requests to a local Llama 3.1 70B and kept GPT-4o for the rest. PII-sensitive traffic was 15% of total volume but consumed 40% of cost (because GPT-4o is used for the "explain to user" turn). Routing that 15% to local saved $9,000 per month.
- A research firm running daily market scans needed a Sonnet fallback because OpenAI rate limits hit them at 9am every day. They added an Anthropic provider as the second tier in Portkey. During rate limit events, the fallback was invisible to end users; the dashboard showed the 12% of requests that had been rerouted.
- A coding-agent platform set up Claude Code Router with three tiers and cut peak-hour API spend by 60% by routing background completions to a local model. The 40% of traffic that needed the cloud still got it. The 60% that did not need it stopped hitting OpenAI.
- A consulting firm running heavy GPT-4o usage added Portkey cost tags by client. They discovered one client was costing them 38% of total spend on a single high-volume integration. They renegotiated the contract with a usage cap. Without per-client tags, this would have been invisible.
The pattern: once you have per-request tags and a dashboard, the 80/20 cost culprits jump out within a week. Most teams find that 20% of their traffic drives 60% of their bill, and a routing rule with five lines of YAML can capture most of the savings.
Two case studies
Case 1: a SaaS company cutting LLM costs by 70%. A B2B SaaS company was running every request through GPT-4o. After adding Portkey, they routed summarization to GPT-4o-mini, kept GPT-4o only for the 30% of requests that needed it, and pushed the long tail of small classification calls to a local Llama 3.1 8B. The 30/70 split was identified by inspecting Portkey's per-tag dashboard over two weeks. Net savings: $42,000 per month at the same quality score, because the 30% of traffic that truly needed GPT-4o still got it. The investment was one engineer-week to set up the routing rules.
Case 2: a coding agent platform with traffic spikes. A platform that hosts multiple coding agents (Cursor, Continue, custom) saw 10x traffic spikes during US business hours. They configured Claude Code Router with three tiers: "background" (autocomplete, on local Llama), "default" (Sonnet, for chat), and "think" (Sonnet again, for explicit reasoning requests). During peak hours the 60% of traffic that was background completions ran locally, cutting cloud API spend by $18,000 per month. The 40% that hit the cloud was the traffic that genuinely needed it. Failover to OpenAI gpt-4o was configured for Sonnet outages, and was triggered twice in three months — both times invisible to end users.
A useful checklist for production rollout: (1) start with a single provider, add the second only after you have observability on the first, (2) write routing rules in a versioned config file, not in application code, (3) alert on per-tag daily cost before the bill closes, (4) rehearse the failover path quarterly — many teams discover their fallback config has rotted, (5) keep the "floor" of the chain (local model or cache) in active maintenance, because it is the only thing that works when both clouds are down.
Projects in this article
Portkey AI Gateway
12.2k ⭐Portkey AI Gateway is a blazing fast AI gateway with integrated guardrails, routing to 200+ LLMs with 50+ AI guardrails through a single fast and friendly API.
AgentGateway
3.4k ⭐Next generation agentic proxy for AI agents and MCP servers. Provides unified traffic management, routing, and security control.
OpenRouter Agents
3.0k ⭐OpenRouter Agents is OpenRouter's platform capability for multi-model agent use cases, focused on routing, tool calling, and unified access layers.
Claude Code Router
4.5k ⭐Claude Code Router is a model routing tool for coding-agent scenarios, unifying requests across providers to optimize cost, latency, and task-specific routing strategies.