LLM Agent Cost Control: Semantic Caching and Model Routing in Practice
The biggest hidden cost in agent production is not token pricing but redundant calls and model mismatches. A quantifiable cost-control framework covering caching strategies, fallback chains, and routing rules.
Most teams calculate agent cost by token price alone. But the real waste usually comes from two places: the same requests being called repeatedly, and simple tasks being handled by expensive models.
A typical customer service agent handles 100K requests per day, 30% of which are repeat questions ("how to refund", "where is my order"). If every call goes to GPT-4o, that alone costs thousands of dollars extra per month. Another common issue: classification tasks use Claude Sonnet, summarization uses GPT-4o, when Llama 3 70B is already sufficient for the same tasks.
This article skips the "token price list" and gives you an actionable cost-control framework: quantify where the waste is, then deploy caching, routing, and degradation strategies accordingly.
Cost Breakdown: Where the Money Actually Goes
Before optimizing, understand your agent cost structure. The total cost of an LLM call can be broken into three parts:
Direct cost: API call fees. This is the most visible number on the bill, but usually not the biggest waste source.
Indirect cost: Extra calls from retries, timeouts, and error handling. A failed request may trigger 2-3 retries, each billed at full price.
Opportunity cost: Using high-capability models for low-complexity tasks. Using GPT-4o for simple intent classification is like using a Ferrari for delivery -- it gets there, but the fuel cost is a problem.
from dataclasses import dataclass
from typing import Any
@dataclass
class CostBreakdown:
total_requests: int = 0
cache_hits: int = 0
model_calls: dict[str, int] = None # type: ignore
def __post_init__(self):
if self.model_calls is None:
self.model_calls = {}
@property
def cache_hit_rate(self) -> float:
if self.total_requests == 0:
return 0.0
return self.cache_hits / self.total_requests
@property
def cost_saved_by_cache(self) -> float:
return self.cache_hits * 0.01
def report(self) -> dict[str, Any]:
return {
"total_requests": self.total_requests,
"cache_hit_rate": f"{self.cache_hit_rate:.1%}",
"cost_saved_monthly": f"${self.cost_saved_by_cache * 30:.2f}",
"model_distribution": dict(self.model_calls),
}
Run a week of data collection and you will be surprised: often 20% of request patterns contribute 60% of repeated calls.
Strategy 1: Semantic Caching -- Not Just Key-Value Matching
Traditional HTTP caching is based on exact match (same URL + parameters). But LLM inputs are natural language -- users rarely repeat questions with identical wording.
The core idea of semantic caching: if two requests are similar enough to produce the same response, the second request should return the cached result directly.
import hashlib
from dataclasses import dataclass, field
from typing import Any
@dataclass
class CacheEntry:
key: str
embedding: list[float]
response: str
model: str
token_count: int
created_at: str
hit_count: int = 0
similarity_threshold: float = 0.92
class SemanticCache:
def __init__(self, embedding_fn, similarity_threshold: float = 0.92):
self.embedding_fn = embedding_fn
self.similarity_threshold = similarity_threshold
self._store: dict[str, CacheEntry] = {}
def _compute_key(self, text: str) -> str:
return hashlib.sha256(text.encode()).hexdigest()[:16]
def get(self, query: str) -> CacheEntry | None:
key = self._compute_key(query)
if key in self._store:
entry = self._store[key]
entry.hit_count += 1
return entry
query_emb = self.embedding_fn(query)
for entry in self._store.values():
sim = self._cosine_similarity(query_emb, entry.embedding)
if sim >= self.similarity_threshold:
entry.hit_count += 1
return entry
return None
def put(self, query: str, response: str, model: str, token_count: int):
key = self._compute_key(query)
self._store[key] = CacheEntry(
key=key,
embedding=self.embedding_fn(query),
response=response,
model=model,
token_count=token_count,
created_at=__import__('datetime').datetime.now().isoformat(),
)
def invalidate(self, model: str, prompt_version: str):
to_remove = []
for key, entry in self._store.items():
if entry.model == model:
to_remove.append(key)
for key in to_remove:
del self._store[key]
@staticmethod
def _cosine_similarity(a: list[float], b: list[float]) -> float:
import math
dot = sum(x * y for x, y in zip(a, b))
mag_a = math.sqrt(sum(x * x for x in a))
mag_b = math.sqrt(sum(x * x for x in b))
return dot / (mag_a * mag_b) if mag_a and mag_b else 0.0
Key design decisions:
- Dual query strategy: exact match first (zero latency), then semantic similarity (covers paraphrases)
- Set
similarity_thresholdto around 0.92 -- lower returns wrong answers, higher reduces hit rate invalidateensures old caches do not pollute new responses when prompts or models change- Cache keys should include model name and prompt version to avoid cross-contamination
Best for: FAQs, common questions, standardized flows (refund queries, order status). In these scenarios, user questions are highly repetitive, achieving 30-50% cache hit rates.
Tool reference: Pezzo claims up to 90% LLM cost and latency savings with its built-in cache layer. Helicone provides proxy-level caching with transparent support for OpenAI-compatible APIs. Bifrost uses similarity-based semantic caching that handles natural language variants.
Strategy 2: Model Routing -- Assign Models by Task Complexity
Not every request needs GPT-4o. Classify tasks by complexity and allocate appropriate models to each tier:
from dataclasses import dataclass
from typing import Literal
@dataclass
class ModelTier:
name: str
model_id: str
cost_per_1k_tokens: float
max_context: int
best_for: list[str]
MODEL_TIERS: dict[str, ModelTier] = {
"simple": ModelTier(
name="simple",
model_id="gpt-4o-mini",
cost_per_1k_tokens=0.00015,
max_context=128000,
best_for=["intent_classification", "simple_qa", "formatting"],
),
"medium": ModelTier(
name="medium",
model_id="gpt-4o",
cost_per_1k_tokens=0.0025,
max_context=128000,
best_for=["rag_generation", "tool_planning", "multi_step_reasoning"],
),
"complex": ModelTier(
name="complex",
model_id="claude-sonnet-4-20250514",
cost_per_1k_tokens=0.003,
max_context=200000,
best_for=["complex_reasoning", "code_generation", "long_document"],
),
}
class ModelRouter:
def __init__(self, classifier_fn):
self.classifier_fn = classifier_fn
self.call_stats: dict[str, int] = {}
def route(self, request: dict) -> ModelTier:
task_type = self.classifier_fn(request)
tier = MODEL_TIERS.get(task_type, MODEL_TIERS["medium"])
self.call_stats[tier.name] = self.call_stats.get(tier.name, 0) + 1
return tier
def estimate_cost(self, request: dict, estimated_tokens: int) -> float:
tier = self.route(request)
return (estimated_tokens / 1000) * tier.cost_per_1k_tokens
def report(self) -> dict[str, Any]:
total = sum(self.call_stats.values())
return {
tier: {
"calls": count,
"percentage": f"{count / total:.1%}" if total else "0%",
}
for tier, count in self.call_stats.items()
}
Key design decisions:
- Routing decisions must be made before sending the request, not after the model responds
- Classifiers can be implemented with a lightweight model or even a rule engine -- the cost should be far lower than the requests being routed
- Record every routing decision and actual model used for subsequent classifier optimization
Best for: Mixed workloads (simple Q&A alongside complex reasoning), multi-model environments, cost-sensitive production systems.
Tool reference: Helicone's AI Gateway provides intelligent routing and automatic failover. Bifrost connects 23+ LLM providers with automatic load balancing. Langfuse analytics can help identify which requests should be routed to cheaper models.
Strategy 3: Fallback Chain -- Dynamic Balance Between Cost and Quality
Sometimes you do not know which model a request needs. Use a fallback chain: try the cheapest model first, upgrade if confidence is insufficient.
from dataclasses import dataclass
from typing import Any
@dataclass
class FallbackConfig:
primary_model: str
fallback_model: str
confidence_threshold: float = 0.8
max_fallback_depth: int = 2
class FallbackChain:
def __init__(self, primary_fn, fallback_fn, confidence_fn):
self.primary_fn = primary_fn
self.fallback_fn = fallback_fn
self.confidence_fn = confidence_fn
self.stats = {"primary_success": 0, "fallback_used": 0, "total": 0}
def execute(self, request: dict) -> dict[str, Any]:
self.stats["total"] += 1
response = self.primary_fn(request)
confidence = self.confidence_fn(response)
if confidence >= self.fallback_config.confidence_threshold:
self.stats["primary_success"] += 1
return response
self.stats["fallback_used"] += 1
return self.fallback_fn(request)
def cost_analysis(self, primary_cost: float, fallback_cost: float) -> dict:
primary_only = self.stats["total"] * primary_cost
actual = (
self.stats["primary_success"] * primary_cost
+ self.stats["fallback_used"] * fallback_cost
)
return {
"hypothetical_primary_only": f"${primary_only:.2f}",
"actual_with_fallback": f"${actual:.2f}",
"savings": f"${primary_only - actual:.2f}",
"fallback_rate": f"{self.stats['fallback_used'] / self.stats['total']:.1%}" if self.stats['total'] else "0%",
}
Key design decisions:
confidence_fnis the core -- it must quickly and reliably judge whether the current response is "good enough." Use length, structural completeness, or a lightweight classifier- Cost benefit depends on the fallback rate. If 80% of requests succeed at the first tier, you only pay premium model costs for 20% of traffic
- Avoid infinite fallback -- set
max_fallback_depthto prevent cycling through multiple models
Best for: Scenarios with high quality requirements but cost sensitivity, progressive enhancement when model capability is uncertain.
Strategy 4: Prompt Caching -- The Overlooked Cost Killer
Many teams do not realize: repeated transmission of the system prompt is a hidden cost. If your system prompt is 2000 tokens and you serve 100K requests per day, that is 200M tokens of redundant transmission.
from dataclasses import dataclass, field
from typing import Any
@dataclass
class PromptCacheConfig:
system_prompt: str
cache_ttl_seconds: int = 300
max_cache_size: int = 1000
class PromptCache:
def __init__(self, config: PromptCacheConfig):
self.config = config
self._cache: dict[str, Any] = {}
self._hits = 0
self._misses = 0
def get_system_prompt(self, version: str) -> str:
if version in self._cache:
self._hits += 1
return self._cache[version]
self._misses += 1
self._cache[version] = self.config.system_prompt
return self._cache[version]
def invalidate(self, version: str):
self._cache.pop(version, None)
def stats(self) -> dict[str, Any]:
total = self._hits + self._misses
return {
"hits": self._hits,
"misses": self._misses,
"hit_rate": f"{self._hits / total:.1%}" if total else "0%",
"tokens_saved": self._hits * len(self.config.system_prompt.split()),
}
Key design decisions:
- System prompts are usually much longer than user inputs and change infrequently -- the most worth caching part
- Cache keys should be prompt version hashes, not request hashes
- TTL settings must balance cache hit rates and prompt update latency
Best for: High-traffic agent systems, long system prompts (>1000 tokens), standardized prompt version management.
Tool reference: Langfuse prompt versioning helps track prompt changes and coordinate cache invalidation. Pezzo cache layer is specifically optimized for repeated prompt transmission.
Cost Optimization Decision Matrix
| Scenario | Primary Strategy | Expected Benefit | Implementation Complexity |
|---|---|---|---|
| High repeat FAQ | Semantic caching | 30-50% call reduction | Low |
| Mixed complexity workload | Model routing | 20-40% cost reduction | Medium |
| High quality requirements but cost-sensitive | Fallback chain | 15-30% cost reduction | Medium |
| High traffic + long system prompt | Prompt caching | 10-20% token savings | Low |
| Multiple model vendors | Gateway aggregation | Ops simplification + failover | Medium |
Priority recommendation:
- Deploy semantic caching first -- fastest ROI, simplest to implement
- Configure model routing next -- needs one week of data collection to train the classifier
- Add fallback chain -- progressive enhancement on critical paths
- Optimize prompt caching last -- requires coordination with prompt version management
Three Common Mistakes
Mistake 1: Cache keys too broad or too narrow
A cache key based only on "user ID" mixes answers to different questions. A key based on "full request text" has almost no hits. The correct approach is "task type + core entity" -- e.g., "order_status:ORD-12345" instead of the entire user message.
Mistake 2: Overly complex routing classifiers
Many teams use LLMs for task classification out of the gate, ending up with classifier costs approaching the requests being routed. Start with rule engines (keywords, regex) for coarse classification, only calling lightweight models at the boundaries.
Mistake 3: Ignoring cache invalidation
The biggest risk of caching is not low hit rates -- it is returning stale answers. When prompts update, models switch, or business rules change, there must be a clear cache invalidation strategy. Recommend automatic cache cleanup triggered by version changes in the CI/CD pipeline.
Summary
- The first step in agent cost optimization is always quantification. Use Langfuse or Helicone to track for a week and you will find that 20% of request patterns contribute 60% of waste
- Semantic caching is the highest ROI strategy. Repeat questions in customer service and technical support scenarios can reach 30-50% of traffic; cache hits completely avoid API calls
- Model routing is not "use whichever is cheapest" but "use whichever is just good enough". Use gpt-4o-mini for simple tasks, Claude Sonnet only for complex reasoning
- Fallback chains put a floor on quality and a ceiling on cost. Try cheap models first, upgrade only when confidence is insufficient, avoiding over-provisioning
- Prompt caching is free lunch. System prompts typically account for 40-60% of token consumption and change infrequently; caching them costs almost nothing
- Bifrost and Pezzo provide out-of-the-box gateway and caching layers that can deploy these strategies within a week without building from scratch
For a complete agent cost-control stack, explore Langfuse (LLM observability and cost analytics), Helicone (proxy monitoring and caching), Pezzo (prompt management and caching layer), and Bifrost (LLM gateway and semantic caching).
Projects in this article
Langfuse
30.2k ⭐Open-source LLM engineering platform providing tracing, evaluations, prompt management, and dataset management with integrations for LangChain, OpenAI, Anthropic, and more.
Helicone
5.9k ⭐Helicone is an open-source proxy and observability platform for LLM applications, offering request tracing, caching, and cost analytics.
Pezzo
3.2k ⭐An open-source, developer-first LLMOps platform for streamlined prompt design, version management, real-time observability, monitoring, and team collaboration across LLM applications.
Bifrost
6.2k ⭐An observability and gateway platform for LLM applications, providing request tracing, model routing, logging, and cost analysis for agent workflows.