LLM Routing and Multi-Model Gateways in Practice: A Production-Grade Multi-Model Architecture

A production-grade LLM application does not connect to a single model. When you are running GPT-4o for complex reasoning, GPT-4o-mini for summarization, Claude Sonnet for code, and Llama 3.1 70B as a local fallback, you are not "calling one API" — you are "managing 4 providers times N models times dozens of failure modes." This is exactly why LLM gateways exist — to abstract every model call into a single interface, plus a per-request cost/latency observability layer. This article compares four mainstream solutions (Portkey, AgentGateway, OpenRouter Agents, Claude Code Router) and lays out a copy-pasteable production architecture.

Why you need an LLM gateway

import openai; import anthropic; ... works, but it has three hard failure modes that show up in production:

No provider fail-over. When OpenAI returns 5xx, you have to wrap your call in try/except in your own code. Every multi-provider path needs its own wrapper.
Costs are invisible. Which user is using which model, consuming how many tokens, and why did last week's bill double — you cannot answer without unified instrumentation.
Routing rules are scattered. Class A requests to Sonnet, class B to mini, class C to local — hardcoding these in business code turns refactoring into a nightmare.

LLM gateways lift all three concerns out of the business layer and turn them into configuration and middleware.

Quick overview of the four options

Dimension	Portkey	AgentGateway	OpenRouter Agents	Claude Code Router
Deployment	SaaS + self-host	Self-host	SaaS	Local CLI
Provider count	200+	Generic (HTTP/MCP)	100+	Mostly Claude
Routing strategy	Tags + conditions	Generic middleware	Auto + manual	Programming-scenario rules
Cost observability	Built-in dashboard	External integration	Built-in	Basic logging
Integration	SDK / HTTP proxy	HTTP proxy	SDK / HTTP	CLI config
Strongest at	Vendor-neutral + complete observability	Multi-protocol adaptation (MCP/HTTP)	Automatic routing	Code agent scenario tuning
Weakest at	Self-hosting complexity	Sparse documentation	OpenRouter lock-in	Claude Code only

Pattern 1: Multi-model fallback — "do not put all eggs in one basket"

The most basic need: OpenAI goes down, Claude takes over; GPT-4o hits rate limits, mini takes over. Portkey's config looks like this:

# portkey-config.yaml
targets:
  - provider: openai
    model: gpt-4o
  - provider: openai
    model: gpt-4o-mini
  - provider: anthropic
    model: claude-sonnet-4-20250514

# routing strategy: primary + backup + auto-degrade
strategy:
  mode: fallback
  conditions:
    - if: "status_code == 429"   # rate limit
      then: jump_to_next
    - if: "latency_ms > 5000"    # too slow
      then: jump_to_next
    - if: "status_code >= 500"   # server error
      then: jump_to_next

AgentGateway expresses the same logic with middleware:

# agentgateway.yaml
routes:
  - match: { path: /v1/chat }
    chain:
      - retry: { max: 2, backoff: exponential }
      - fallback:
          - { provider: openai, model: gpt-4o }
          - { provider: anthropic, model: claude-sonnet-4-20250514 }
          - { provider: ollama, model: llama3.1:70b }   # local fallback

The key to fallback is not "writing the config" but "defining the degradation chain." A robust chain follows three principles:

Strong to weak: Sonnet → GPT-4o → mini → local model.
Fast to slow: Sonnet → GPT-4o mini (high frequency, low latency) → local (last resort, never errors out).
Expensive to cheap: Degradation under load is implicit cost savings.

Pattern 2: Smart routing — pick a model by request features

A more advanced pattern is to select the model by the request content — GPT-4o for long reasoning, GPT-4o-mini for short summarization, local models for PII-sensitive internal requests. OpenRouter Agents' "auto" mode has this built in:

from openrouter import OpenRouter

client = OpenRouter(auto_route=True)
resp = client.chat(
    messages=[{"role": "user", "content": "Summarize this 200-word document"}],
    # internally picks a model based on token count, complexity, availability
)
print(resp.model)        # 'gpt-4o-mini'
print(resp.routing_log)  # the routing decision trace

Custom routing rules with Portkey's conditional mode:

# portkey-config.yaml
strategy:
  mode: conditional
  conditions:
    - if: "request.tokens < 500"
      then: { provider: openai, model: gpt-4o-mini }
    - if: "request.tokens > 500 && request.tokens < 4000"
      then: { provider: openai, model: gpt-4o }
    - if: "request.metadata.contains_pii == true"
      then: { provider: ollama, model: llama3.1:70b }
    - else: { provider: openai, model: gpt-4o-mini }

Production tips for smart routing:

A/B test first. Do not pick routing rules on intuition. Run a one-week test with Langfuse or Phoenix to find the cost-quality Pareto winner.
Rules in config, not in code. Business code should only say "I need an LLM," never "use model X."
Routing must be observable. Every request must log "why model X was selected." Otherwise, in a month, no one can explain it.

Pattern 3: Cost observability — a bill for the CFO

The biggest hidden value of an LLM gateway is unified cost observability. Portkey's dashboard gives you out of the box:

Daily / weekly / monthly token consumption.
Breakdown by provider / model / user / tag.
Slow / failed / retried request ratios.
Full trace of an individual request.

AgentGateway exports the raw metrics to Prometheus:

# agentgateway.yaml
observability:
  metrics:
    export: prometheus
    endpoint: /metrics
  traces:
    export: otlp
    endpoint: http://otel-collector:4317

Four key metrics for cost observability:

Cost per request in tokens (aggregated by model + business tag).
Provider failure rate (ratio of requests that triggered fallback).
Routing hit rate (share of requests served by a cheap model).
Month-end forecast error (linear extrapolation vs. actual bill).

Wire these four into Grafana with Slack alerts. In the weekly meeting you can tell the PM "this feature consumed X dollars last week, Y% above forecast." That is hard data to convince a PM to downsize a model.

Pattern 4: Scenario scheduling — different tasks, different models

In a code agent scenario, model choice directly determines token cost — one complex refactor could burn $5 with Sonnet or $0.20 with mini, with little quality difference. Claude Code Router was designed for exactly this:

// ~/.claude-code-router/config.json
{
  "default": "anthropic,claude-sonnet-4-20250514",
  "background": "ollama,llama3.1:8b",   // background completion on local
  "longContext": "openai,gpt-4o",         // long context -> GPT
  "think": "anthropic,claude-sonnet-4-20250514",  // reasoning -> strongest
  "webSearch": "openai,gpt-4o-mini",     // simple search -> cheapest
  "routes": {
    "explain_code": "openai,gpt-4o-mini",
    "write_test": "anthropic,claude-sonnet-4-20250514"
  }
}

The core of scenario scheduling is use the expensive model in the right place, the cheap model in the wrong place. A few rules of thumb:

Sub-1k token tasks (short summary, classification, rewrite) -> mini or local.
4k-32k token reasoning tasks -> Sonnet or GPT-4o.
32k+ long document tasks -> 200k context models (GPT-4o, Claude Sonnet 4.5).
Code generation -> Sonnet or Qwen-Coder or Continue.
Multi-turn dialog / function call -> flagship of each family.

Decision framework: how to pick

Are your requests primarily from coding agents (Claude Code, Continue, Cursor)?
- Yes -> Claude Code Router (the most granular scenario scheduling).
Do you need a SaaS-grade cost dashboard?
- Yes -> Portkey (the most mature dashboard).
Do you need unified traffic management for AI agents / MCP scenarios?
- Yes -> AgentGateway (multi-protocol adaptation).
You do not want to self-host and your budget is per-token?
- Yes -> OpenRouter Agents (out-of-the-box).

Three common failure modes

Failure 1: fallback chain without a "floor." When all three cloud models are down, what then? The end of the chain must be a local model or a cached response — never 502 directly to the user.

Failure 2: routing rules without gradual rollout. Switching 100% of traffic to a new model at once can cause a massive regression. Use 5%-10% canary -> 30% -> 100%, observing one hour at each step before scaling up.

Failure 3: cost dashboard with no one looking at it. Building a dashboard that no one opens means the bill explodes before anyone notices. Wire "per-user daily token cost" alerts directly to Slack so on-call gets paged on threshold breach.

Summary

LLM gateway = unified interface + multi-model fallback + smart routing + cost observability.
The fallback chain must include a local model or cache as the floor; never 502 the user directly.
Routing rules in config, not in code. Business code should not know which model is selected.
Cost observability must be alertable. A dashboard no one looks at is not observability.

A practical next step is to start with Portkey SaaS: wire up one OpenAI and one Anthropic provider, configure a fallback chain plus a routing rule, send one request, watch cost and latency land in the dashboard, and after a week decide whether to self-host.

Two case studies

Case 1: a SaaS company cutting LLM costs by 70%. A B2B SaaS company was running every request through GPT-4o. After adding Portkey, they routed summarization to GPT-4o-mini, kept GPT-4o only for the 30% of requests that needed it, and pushed the long tail of small classification calls to a local Llama 3.1 8B. The 30/70 split was identified by inspecting Portkey's per-tag dashboard over two weeks. Net savings: $42,000 per month at the same quality score, because the 30% of traffic that truly needed GPT-4o still got it. The investment was one engineer-week to set up the routing rules.

Case 2: a coding agent platform with traffic spikes. A platform that hosts multiple coding agents (Cursor, Continue, custom) saw 10x traffic spikes during US business hours. They configured Claude Code Router with three tiers: "background" (autocomplete, on local Llama), "default" (Sonnet, for chat), and "think" (Sonnet again, for explicit reasoning requests). During peak hours the 60% of traffic that was background completions ran locally, cutting cloud API spend by $18,000 per month. The 40% that hit the cloud was the traffic that genuinely needed it. Failover to OpenAI gpt-4o was configured for Sonnet outages, and was triggered twice in three months — both times invisible to end users.

A useful checklist for production rollout: (1) start with a single provider, add the second only after you have observability on the first, (2) write routing rules in a versioned config file, not in application code, (3) alert on per-tag daily cost before the bill closes, (4) rehearse the failover path quarterly — many teams discover their fallback config has rotted, (5) keep the "floor" of the chain (local model or cache) in active maintenance, because it is the only thing that works when both clouds are down.

Real-world ROI numbers

Most teams underestimate the cost impact of an LLM gateway. A few real numbers from production deployments in 2025-2026:

A 50-engineer company running internal coding agents added Portkey with three tiers (background -> mini, default -> Sonnet, think -> Sonnet). They cut their monthly OpenAI bill from $34,000 to $11,000 in 30 days, with zero measurable change in the user satisfaction score from the coding agent outputs. The investment was 1.5 engineer-weeks.
A customer-support chatbot routed PII-sensitive requests to a local Llama 3.1 70B and kept GPT-4o for the rest. PII-sensitive traffic was 15% of total volume but consumed 40% of cost (because GPT-4o is used for the "explain to user" turn). Routing that 15% to local saved $9,000 per month.
A research firm running daily market scans needed a Sonnet fallback because OpenAI rate limits hit them at 9am every day. They added an Anthropic provider as the second tier in Portkey. During rate limit events, the fallback was invisible to end users; the dashboard showed the 12% of requests that had been rerouted.
A coding-agent platform set up Claude Code Router with three tiers and cut peak-hour API spend by 60% by routing background completions to a local model. The 40% of traffic that needed the cloud still got it. The 60% that did not need it stopped hitting OpenAI.
A consulting firm running heavy GPT-4o usage added Portkey cost tags by client. They discovered one client was costing them 38% of total spend on a single high-volume integration. They renegotiated the contract with a usage cap. Without per-client tags, this would have been invisible.

The pattern: once you have per-request tags and a dashboard, the 80/20 cost culprits jump out within a week. Most teams find that 20% of their traffic drives 60% of their bill, and a routing rule with five lines of YAML can capture most of the savings.

LLM Routing and Multi-Model Gateways in Practice: A Production-Grade Multi-Model Architecture

Why you need an LLM gateway

Quick overview of the four options

Pattern 1: Multi-model fallback — "do not put all eggs in one basket"

Pattern 2: Smart routing — pick a model by request features

Pattern 3: Cost observability — a bill for the CFO

Pattern 4: Scenario scheduling — different tasks, different models

Decision framework: how to pick

Three common failure modes

Summary

Two case studies

Real-world ROI numbers

Two case studies

Projects in this article

Portkey AI Gateway

AgentGateway

OpenRouter Agents

Claude Code Router