LLM Agent Cost Control: Semantic Caching and Model Routing in Practice

The biggest hidden cost in agent production is not token pricing but redundant calls and model mismatches. A quantifiable cost-control framework covering caching strategies, fallback chains, and routing rules.

AgentList Team · 2026年6月29日
LLMOps成本优化语义缓存模型路由Langfuse

Most teams calculate agent cost by token price alone. But the real waste usually comes from two places: the same requests being called repeatedly, and simple tasks being handled by expensive models.

A typical customer service agent handles 100K requests per day, 30% of which are repeat questions ("how to refund", "where is my order"). If every call goes to GPT-4o, that alone costs thousands of dollars extra per month. Another common issue: classification tasks use Claude Sonnet, summarization uses GPT-4o, when Llama 3 70B is already sufficient for the same tasks.

This article skips the "token price list" and gives you an actionable cost-control framework: quantify where the waste is, then deploy caching, routing, and degradation strategies accordingly.

Cost Breakdown: Where the Money Actually Goes

Before optimizing, understand your agent cost structure. The total cost of an LLM call can be broken into three parts:

Direct cost: API call fees. This is the most visible number on the bill, but usually not the biggest waste source.

Indirect cost: Extra calls from retries, timeouts, and error handling. A failed request may trigger 2-3 retries, each billed at full price.

Opportunity cost: Using high-capability models for low-complexity tasks. Using GPT-4o for simple intent classification is like using a Ferrari for delivery -- it gets there, but the fuel cost is a problem.

from dataclasses import dataclass
from typing import Any


@dataclass
class CostBreakdown:
    total_requests: int = 0
    cache_hits: int = 0
    model_calls: dict[str, int] = None  # type: ignore

    def __post_init__(self):
        if self.model_calls is None:
            self.model_calls = {}

    @property
    def cache_hit_rate(self) -> float:
        if self.total_requests == 0:
            return 0.0
        return self.cache_hits / self.total_requests

    @property
    def cost_saved_by_cache(self) -> float:
        return self.cache_hits * 0.01

    def report(self) -> dict[str, Any]:
        return {
            "total_requests": self.total_requests,
            "cache_hit_rate": f"{self.cache_hit_rate:.1%}",
            "cost_saved_monthly": f"${self.cost_saved_by_cache * 30:.2f}",
            "model_distribution": dict(self.model_calls),
        }

Run a week of data collection and you will be surprised: often 20% of request patterns contribute 60% of repeated calls.

Strategy 1: Semantic Caching -- Not Just Key-Value Matching

Traditional HTTP caching is based on exact match (same URL + parameters). But LLM inputs are natural language -- users rarely repeat questions with identical wording.

The core idea of semantic caching: if two requests are similar enough to produce the same response, the second request should return the cached result directly.

import hashlib
from dataclasses import dataclass, field
from typing import Any


@dataclass
class CacheEntry:
    key: str
    embedding: list[float]
    response: str
    model: str
    token_count: int
    created_at: str
    hit_count: int = 0
    similarity_threshold: float = 0.92


class SemanticCache:
    def __init__(self, embedding_fn, similarity_threshold: float = 0.92):
        self.embedding_fn = embedding_fn
        self.similarity_threshold = similarity_threshold
        self._store: dict[str, CacheEntry] = {}

    def _compute_key(self, text: str) -> str:
        return hashlib.sha256(text.encode()).hexdigest()[:16]

    def get(self, query: str) -> CacheEntry | None:
        key = self._compute_key(query)
        if key in self._store:
            entry = self._store[key]
            entry.hit_count += 1
            return entry
        query_emb = self.embedding_fn(query)
        for entry in self._store.values():
            sim = self._cosine_similarity(query_emb, entry.embedding)
            if sim >= self.similarity_threshold:
                entry.hit_count += 1
                return entry
        return None

    def put(self, query: str, response: str, model: str, token_count: int):
        key = self._compute_key(query)
        self._store[key] = CacheEntry(
            key=key,
            embedding=self.embedding_fn(query),
            response=response,
            model=model,
            token_count=token_count,
            created_at=__import__('datetime').datetime.now().isoformat(),
        )

    def invalidate(self, model: str, prompt_version: str):
        to_remove = []
        for key, entry in self._store.items():
            if entry.model == model:
                to_remove.append(key)
        for key in to_remove:
            del self._store[key]

    @staticmethod
    def _cosine_similarity(a: list[float], b: list[float]) -> float:
        import math
        dot = sum(x * y for x, y in zip(a, b))
        mag_a = math.sqrt(sum(x * x for x in a))
        mag_b = math.sqrt(sum(x * x for x in b))
        return dot / (mag_a * mag_b) if mag_a and mag_b else 0.0

Key design decisions:

  • Dual query strategy: exact match first (zero latency), then semantic similarity (covers paraphrases)
  • Set similarity_threshold to around 0.92 -- lower returns wrong answers, higher reduces hit rate
  • invalidate ensures old caches do not pollute new responses when prompts or models change
  • Cache keys should include model name and prompt version to avoid cross-contamination

Best for: FAQs, common questions, standardized flows (refund queries, order status). In these scenarios, user questions are highly repetitive, achieving 30-50% cache hit rates.

Tool reference: Pezzo claims up to 90% LLM cost and latency savings with its built-in cache layer. Helicone provides proxy-level caching with transparent support for OpenAI-compatible APIs. Bifrost uses similarity-based semantic caching that handles natural language variants.

Strategy 2: Model Routing -- Assign Models by Task Complexity

Not every request needs GPT-4o. Classify tasks by complexity and allocate appropriate models to each tier:

from dataclasses import dataclass
from typing import Literal


@dataclass
class ModelTier:
    name: str
    model_id: str
    cost_per_1k_tokens: float
    max_context: int
    best_for: list[str]


MODEL_TIERS: dict[str, ModelTier] = {
    "simple": ModelTier(
        name="simple",
        model_id="gpt-4o-mini",
        cost_per_1k_tokens=0.00015,
        max_context=128000,
        best_for=["intent_classification", "simple_qa", "formatting"],
    ),
    "medium": ModelTier(
        name="medium",
        model_id="gpt-4o",
        cost_per_1k_tokens=0.0025,
        max_context=128000,
        best_for=["rag_generation", "tool_planning", "multi_step_reasoning"],
    ),
    "complex": ModelTier(
        name="complex",
        model_id="claude-sonnet-4-20250514",
        cost_per_1k_tokens=0.003,
        max_context=200000,
        best_for=["complex_reasoning", "code_generation", "long_document"],
    ),
}


class ModelRouter:
    def __init__(self, classifier_fn):
        self.classifier_fn = classifier_fn
        self.call_stats: dict[str, int] = {}

    def route(self, request: dict) -> ModelTier:
        task_type = self.classifier_fn(request)
        tier = MODEL_TIERS.get(task_type, MODEL_TIERS["medium"])
        self.call_stats[tier.name] = self.call_stats.get(tier.name, 0) + 1
        return tier

    def estimate_cost(self, request: dict, estimated_tokens: int) -> float:
        tier = self.route(request)
        return (estimated_tokens / 1000) * tier.cost_per_1k_tokens

    def report(self) -> dict[str, Any]:
        total = sum(self.call_stats.values())
        return {
            tier: {
                "calls": count,
                "percentage": f"{count / total:.1%}" if total else "0%",
            }
            for tier, count in self.call_stats.items()
        }

Key design decisions:

  • Routing decisions must be made before sending the request, not after the model responds
  • Classifiers can be implemented with a lightweight model or even a rule engine -- the cost should be far lower than the requests being routed
  • Record every routing decision and actual model used for subsequent classifier optimization

Best for: Mixed workloads (simple Q&A alongside complex reasoning), multi-model environments, cost-sensitive production systems.

Tool reference: Helicone's AI Gateway provides intelligent routing and automatic failover. Bifrost connects 23+ LLM providers with automatic load balancing. Langfuse analytics can help identify which requests should be routed to cheaper models.

Strategy 3: Fallback Chain -- Dynamic Balance Between Cost and Quality

Sometimes you do not know which model a request needs. Use a fallback chain: try the cheapest model first, upgrade if confidence is insufficient.

from dataclasses import dataclass
from typing import Any


@dataclass
class FallbackConfig:
    primary_model: str
    fallback_model: str
    confidence_threshold: float = 0.8
    max_fallback_depth: int = 2


class FallbackChain:
    def __init__(self, primary_fn, fallback_fn, confidence_fn):
        self.primary_fn = primary_fn
        self.fallback_fn = fallback_fn
        self.confidence_fn = confidence_fn
        self.stats = {"primary_success": 0, "fallback_used": 0, "total": 0}

    def execute(self, request: dict) -> dict[str, Any]:
        self.stats["total"] += 1
        response = self.primary_fn(request)
        confidence = self.confidence_fn(response)
        if confidence >= self.fallback_config.confidence_threshold:
            self.stats["primary_success"] += 1
            return response
        self.stats["fallback_used"] += 1
        return self.fallback_fn(request)

    def cost_analysis(self, primary_cost: float, fallback_cost: float) -> dict:
        primary_only = self.stats["total"] * primary_cost
        actual = (
            self.stats["primary_success"] * primary_cost
            + self.stats["fallback_used"] * fallback_cost
        )
        return {
            "hypothetical_primary_only": f"${primary_only:.2f}",
            "actual_with_fallback": f"${actual:.2f}",
            "savings": f"${primary_only - actual:.2f}",
            "fallback_rate": f"{self.stats['fallback_used'] / self.stats['total']:.1%}" if self.stats['total'] else "0%",
        }

Key design decisions:

  • confidence_fn is the core -- it must quickly and reliably judge whether the current response is "good enough." Use length, structural completeness, or a lightweight classifier
  • Cost benefit depends on the fallback rate. If 80% of requests succeed at the first tier, you only pay premium model costs for 20% of traffic
  • Avoid infinite fallback -- set max_fallback_depth to prevent cycling through multiple models

Best for: Scenarios with high quality requirements but cost sensitivity, progressive enhancement when model capability is uncertain.

Strategy 4: Prompt Caching -- The Overlooked Cost Killer

Many teams do not realize: repeated transmission of the system prompt is a hidden cost. If your system prompt is 2000 tokens and you serve 100K requests per day, that is 200M tokens of redundant transmission.

from dataclasses import dataclass, field
from typing import Any


@dataclass
class PromptCacheConfig:
    system_prompt: str
    cache_ttl_seconds: int = 300
    max_cache_size: int = 1000


class PromptCache:
    def __init__(self, config: PromptCacheConfig):
        self.config = config
        self._cache: dict[str, Any] = {}
        self._hits = 0
        self._misses = 0

    def get_system_prompt(self, version: str) -> str:
        if version in self._cache:
            self._hits += 1
            return self._cache[version]
        self._misses += 1
        self._cache[version] = self.config.system_prompt
        return self._cache[version]

    def invalidate(self, version: str):
        self._cache.pop(version, None)

    def stats(self) -> dict[str, Any]:
        total = self._hits + self._misses
        return {
            "hits": self._hits,
            "misses": self._misses,
            "hit_rate": f"{self._hits / total:.1%}" if total else "0%",
            "tokens_saved": self._hits * len(self.config.system_prompt.split()),
        }

Key design decisions:

  • System prompts are usually much longer than user inputs and change infrequently -- the most worth caching part
  • Cache keys should be prompt version hashes, not request hashes
  • TTL settings must balance cache hit rates and prompt update latency

Best for: High-traffic agent systems, long system prompts (>1000 tokens), standardized prompt version management.

Tool reference: Langfuse prompt versioning helps track prompt changes and coordinate cache invalidation. Pezzo cache layer is specifically optimized for repeated prompt transmission.

Cost Optimization Decision Matrix

Scenario Primary Strategy Expected Benefit Implementation Complexity
High repeat FAQ Semantic caching 30-50% call reduction Low
Mixed complexity workload Model routing 20-40% cost reduction Medium
High quality requirements but cost-sensitive Fallback chain 15-30% cost reduction Medium
High traffic + long system prompt Prompt caching 10-20% token savings Low
Multiple model vendors Gateway aggregation Ops simplification + failover Medium

Priority recommendation:

  1. Deploy semantic caching first -- fastest ROI, simplest to implement
  2. Configure model routing next -- needs one week of data collection to train the classifier
  3. Add fallback chain -- progressive enhancement on critical paths
  4. Optimize prompt caching last -- requires coordination with prompt version management

Three Common Mistakes

Mistake 1: Cache keys too broad or too narrow

A cache key based only on "user ID" mixes answers to different questions. A key based on "full request text" has almost no hits. The correct approach is "task type + core entity" -- e.g., "order_status:ORD-12345" instead of the entire user message.

Mistake 2: Overly complex routing classifiers

Many teams use LLMs for task classification out of the gate, ending up with classifier costs approaching the requests being routed. Start with rule engines (keywords, regex) for coarse classification, only calling lightweight models at the boundaries.

Mistake 3: Ignoring cache invalidation

The biggest risk of caching is not low hit rates -- it is returning stale answers. When prompts update, models switch, or business rules change, there must be a clear cache invalidation strategy. Recommend automatic cache cleanup triggered by version changes in the CI/CD pipeline.

Summary

  • The first step in agent cost optimization is always quantification. Use Langfuse or Helicone to track for a week and you will find that 20% of request patterns contribute 60% of waste
  • Semantic caching is the highest ROI strategy. Repeat questions in customer service and technical support scenarios can reach 30-50% of traffic; cache hits completely avoid API calls
  • Model routing is not "use whichever is cheapest" but "use whichever is just good enough". Use gpt-4o-mini for simple tasks, Claude Sonnet only for complex reasoning
  • Fallback chains put a floor on quality and a ceiling on cost. Try cheap models first, upgrade only when confidence is insufficient, avoiding over-provisioning
  • Prompt caching is free lunch. System prompts typically account for 40-60% of token consumption and change infrequently; caching them costs almost nothing
  • Bifrost and Pezzo provide out-of-the-box gateway and caching layers that can deploy these strategies within a week without building from scratch

For a complete agent cost-control stack, explore Langfuse (LLM observability and cost analytics), Helicone (proxy monitoring and caching), Pezzo (prompt management and caching layer), and Bifrost (LLM gateway and semantic caching).