Agent Observability in Practice: OpenTelemetry to Production Traces
Build a production-grade observability stack for multi-step agents using OpenTelemetry: OpenLLMetry semantic conventions, hierarchical span correlation, token cost attribution, retrieval quality metrics, and layered alerting.
Agent Observability in Practice: OpenTelemetry to Production Traces
Engineers running AI agents in production all hit the same wall: when a multi-step agent task fails, drifts, or spikes in latency, how do you identify the root cause within five minutes -- is it the model, the tool, or retrieval? That is the core value of agent observability: turning a black-box reasoning process into a queryable, correlated, alertable engineering system. This article takes a production-engineering perspective on agent observability, covering OpenTelemetry semantic conventions, trace correlation models, token cost attribution, and anomaly alerting patterns.
Why Agents Need Observability More Than Traditional Services
Observability for traditional microservices is mature -- metrics, logs, and traces underpin almost every distributed system. Agent systems run on the same technology stack but have three fundamental differences.
First, multi-step reasoning is nested and non-deterministic. A single user request may trigger 5 to 20 LLM calls, each with a different prompt template, retrieval result, and tool output. If you do not instrument call boundaries, end users will see a single "agent invocation" with no visibility into which step went wrong.
Second, cost is token-dimensional, not request-dimensional. The same agent task on two runs may consume wildly different token counts depending on retrieved context length and chain-of-thought depth. Without per-step attribution, you cannot answer "which user or task type is burning money."
Third, failure modes are gradual rather than sudden. Model upgrades, prompt tweaks, and index refreshes cause agents to drift slowly: success rates slide from 95% to 80% over weeks, with each individual day looking normal. Traditional "5xx error rate" alerts are useless; you need statistical metrics combined with offline evaluation.
OpenTelemetry Semantic Conventions: Putting Agents in a Standard Protocol
The core challenge of integrating agents with OpenTelemetry is not technical -- it is mapping LLM concepts to span semantics. We recommend following the OpenLLMetry community conventions:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
tracer = trace.get_tracer("agent.runtime")
# Standard GenAI span naming
with tracer.start_as_current_span("openai.chat") as span:
span.set_attribute("gen_ai.system", "openai")
span.set_attribute("gen_ai.request.model", "gpt-4o")
span.set_attribute("gen_ai.request.max_tokens", 4096)
span.set_attribute("gen_ai.request.temperature", 0.7)
response = openai_client.chat.completions.create(...)
span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)
span.set_attribute("gen_ai.usage.total_tokens", response.usage.total_tokens)
span.set_attribute("gen_ai.response.model", response.model)
span.set_attribute("gen_ai.response.finish_reasons", [choice.finish_reason for choice in response.choices])
Key design points:
gen_ai.system,gen_ai.request.model, andgen_ai.usage.*are semantic attributes defined by OpenLLMetry, supported by all compatible backends (Langfuse, Phoenix, OpenInference, OpenLit)- Span names follow the
{provider}.{operation}pattern:openai.chat,anthropic.messages,cohere.rerank - Every LLM call becomes an independent span nested under the agent's parent span, preserving call order
Multi-Step Agent Trace Correlation Model
The most common observability mistake is treating the entire agent as a single span. That collapses timing data and eliminates all root-cause capability. The correct approach is to break agents into hierarchical spans:
@tracer.start_as_current_span("agent.run")
def run_agent(user_query: str, session_id: str):
span = trace.get_current_span()
span.set_attribute("agent.session_id", session_id)
span.set_attribute("agent.user_query", user_query)
with tracer.start_as_current_span("agent.plan") as plan_span:
plan = llm_call_planner(user_query)
plan_span.set_attribute("agent.plan.steps", len(plan.steps))
results = []
for i, step in enumerate(plan.steps):
with tracer.start_as_current_span(f"agent.step[{i}]") as step_span:
step_span.set_attribute("agent.step.tool", step.tool)
step_span.set_attribute("agent.step.input", step.input)
with tracer.start_as_current_span(f"tool.{step.tool}") as tool_span:
output = execute_tool(step)
tool_span.set_attribute("tool.output_size", len(str(output)))
with tracer.start_as_current_span("openai.chat") as llm_span:
reasoning = llm_reason(step, output)
llm_span.set_attribute("gen_ai.usage.total_tokens", reasoning.usage.total_tokens)
results.append(reasoning)
span.set_attribute("agent.total_steps", len(plan.steps))
return aggregate(results)
Span tree structure:
agent.run
|-- agent.plan
| `-- openai.chat (Planner LLM)
|-- agent.step[0]
| |-- tool.search
| `-- openai.chat (Reasoning LLM)
|-- agent.step[1]
| |-- tool.calculator
| `-- openai.chat (Reasoning LLM)
`-- agent.step[2]
`-- openai.chat (Final answer)
This structure pays off in three concrete ways in Langfuse or Phoenix:
- Slow request localization: the slowest step's LLM call is visible at a glance
- Cost attribution: per-step token consumption is recorded, aggregable by
session_idoruser_id - Failure localization: exceptions bind to specific spans, separating tool timeouts from model timeouts
Token Cost Attribution: Pinning Dollars to Spans
Agent token costs, unattributed, become post-hoc accounting. OTel span attributes enable real-time attribution:
def record_llm_cost(span, model: str, input_tokens: int, output_tokens: int):
# Reference pricing as of 2025 (update to actual)
pricing = {
"gpt-4o": {"input": 2.5e-6, "output": 1e-5},
"claude-sonnet-4": {"input": 3e-6, "output": 1.5e-5},
"deepseek-chat": {"input": 1.4e-7, "output": 2.8e-7},
}
p = pricing.get(model, {"input": 0, "output": 0})
cost_usd = input_tokens * p["input"] + output_tokens * p["output"]
span.set_attribute("gen_ai.usage.cost_usd", cost_usd)
span.set_attribute("gen_ai.usage.input_cost_usd", input_tokens * p["input"])
span.set_attribute("gen_ai.usage.output_cost_usd", output_tokens * p["output"])
Wrap this in a unified traced_llm_call() function that all LLM calls go through:
async def traced_llm_call(prompt: str, model: str = "gpt-4o", **kwargs):
with tracer.start_as_current_span(f"{model}.chat") as span:
span.set_attribute("gen_ai.system", model.split("-")[0])
span.set_attribute("gen_ai.request.model", model)
response = await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
**kwargs
)
usage = response.usage
span.set_attribute("gen_ai.usage.input_tokens", usage.prompt_tokens)
span.set_attribute("gen_ai.usage.output_tokens", usage.completion_tokens)
span.set_attribute("gen_ai.usage.total_tokens", usage.total_tokens)
record_llm_cost(span, model, usage.prompt_tokens, usage.completion_tokens)
return response
After aggregation in Langfuse or Phoenix by gen_ai.usage.cost_usd, you can see:
- Which users cost the most (filtered by
session_idoruser_id) - Which task types are most expensive (filtered by
agent.task_type) - Which prompts waste the most tokens (by
gen_ai.request.modelplus length distribution)
Observability for the Retrieval Stage
The retrieval stage of an agent system (RAG) is frequently overlooked but is a leading source of quality fluctuation. Key metrics for retrieval observability:
with tracer.start_as_current_span("retrieval.search") as span:
span.set_attribute("retrieval.query", query)
span.set_attribute("retrieval.top_k", top_k)
span.set_attribute("retrieval.embedding_model", embedding_model)
results = vector_store.search(query, top_k=top_k)
span.set_attribute("retrieval.results_count", len(results))
span.set_attribute("retrieval.top_score", results[0].score if results else 0)
span.set_attribute("retrieval.min_score", min(r.score for r in results) if results else 0)
span.set_attribute("retrieval.score_stddev", stdev([r.score for r in results]) if len(results) > 1 else 0)
span.set_attribute("retrieval.has_high_confidence", any(r.score > 0.8 for r in results))
Core metrics:
retrieval.top_score: highest relevance, indicating retrieval qualityretrieval.score_stddev: distribution of scores, distinguishing "all results equally relevant" (weak signal) from "one or two stand out" (strong signal)retrieval.has_high_confidence: whether a high-confidence hit exists; when false, the agent should fall back to web search or other strategies
With these metrics on spans, you can build a "low-confidence query ratio" alert in Phoenix or Langfuse. A sustained rise usually indicates vector index drift or document corpus changes.
Tool-Call Observability
Tool calls are the most likely component of an agent system to fail silently. Design principles:
@tracer.start_as_current_span("tool.{name}")
def traced_tool_call(name: str, **kwargs):
span = trace.get_current_span()
span.set_attribute("tool.name", name)
span.set_attribute("tool.input", json.dumps(kwargs, default=str)[:1000])
start = time.perf_counter()
status = "success"
try:
result = tool_registry[name](**kwargs)
span.set_attribute("tool.output_size", len(str(result)))
return result
except TimeoutError:
status = "timeout"
span.set_attribute("error.type", "timeout")
span.set_attribute("error.timeout_seconds", timeout)
raise
except Exception as e:
status = "error"
span.set_attribute("error.type", type(e).__name__)
span.set_attribute("error.message", str(e)[:500])
raise
finally:
elapsed_ms = (time.perf_counter() - start) * 1000
span.set_attribute("tool.duration_ms", elapsed_ms)
span.set_attribute("tool.status", status)
tool_call_counter.add(1, {"tool": name, "status": status})
tool_latency.record(elapsed_ms, {"tool": name, "status": status})
Critical attributes:
tool.duration_mscombined withtool.statuspowers P50/P95/P99 latency and error-rate dashboardserror.typedistinguishestimeout,rate_limit,auth_error, andvalidation_error-- each requires different alert thresholds- Avoid stuffing raw input/output into span attributes (may contain PII); record only size and hash
Anomaly Detection and Alerting Patterns
With complete span data, the next step is alerting. Agent system alerts should be layered:
Layer 1: hard error alerts (must be real-time)
- 5xx error rate above 1% over the last 5 minutes
- P95 latency exceeding 1.5x SLA
- Provider API 429/5xx ratio above 10%
Layer 2: quality drift alerts (hourly aggregation)
- Task success rate (LLM-as-judge evaluated) dropping more than 5% week-over-week
- Average step count (chain length) suddenly rising more than 30%
- Low-confidence retrieval ratio (
retrieval.has_high_confidence == false) above 40%
Layer 3: cost alerts (daily aggregation)
- Daily token cost exceeding 80% of budget
- A single tenant or task type costing more than 5x the global average
Encode alert rules in code rather than dashboard configuration files for version control, rollback, and auditability:
# alerts.py
ALERT_RULES = {
"error_rate_spike": {
"query": "rate(tool_call_total{status='error'}[5m]) / rate(tool_call_total[5m])",
"threshold": 0.01,
"window": "5m",
},
"cost_daily_budget": {
"query": "sum(gen_ai_usage_cost_usd_total)",
"threshold": 0.8, # 80% of daily budget
"window": "1d",
},
}
Backend Selection: Langfuse / Phoenix / OpenLit
| Backend | Deployment | Strength | Best For |
|---|---|---|---|
| Langfuse | SaaS or self-hosted | Prompt version management, user feedback collection | Mid-sized teams that need prompt iteration tracking |
| Phoenix (Arize) | Self-hosted or SaaS | Powerful span search, embedding visualization | Existing OTel infrastructure requiring deep debugging |
| OpenLit | Pure OTLP collector | Compatible with any OTel backend (Datadog, Grafana, Honeycomb) | Existing unified OTel infrastructure |
| Weave (W&B) | SaaS | Tight integration with W&B experiment tracking | Existing W&B ecosystem |
If your team is just starting out, Langfuse is the easiest entry point -- it has full prompt template management, user feedback labeling, and span search out of the box. If you already use Datadog, Grafana, or Honeycomb as a general APM, OpenLit plus OTLP is the more elegant path, avoiding observability data silos.
Implementation Path
Week 1: Adopt OpenLLMetry semantic conventions so all LLM calls produce standardized spans.
Week 2: Wrap every tool call in traced_tool_call, recording duration, status, and error.
Week 3: Establish trace correlation IDs, propagating session_id, user_id, and task_id to all child spans.
Week 4: Implement token cost attribution and build a cost dashboard.
Week 5: Integrate an offline evaluator (LLM-as-judge) so success rate becomes a computable metric.
Week 6: Route hard error alerts into PagerDuty or Feishu.
Week 7: Produce a weekly quality drift report to identify slowly-declining trends.
Summary
Agent observability is not just "add an APM." Its core value is making reasoning transparent, cost attributable to business dimensions, and quality drift alertable. Start with OpenLLMetry semantic conventions, put every LLM call, tool call, and retrieval query into a standard span, then ship spans over OTLP to Langfuse, Phoenix, or OpenLit. Finally, version-controlled alert rules protect SLAs and cost ceilings.
For agent systems already in production, observability is not optional -- it is the engineering step that turns an agent from "a talking demo" into "trusted infrastructure."
Reference tools: Langfuse (open-source LLM observability platform), Phoenix (Arize) (experiment and evaluation platform), OpenLit (OTel collector), OpenInference (OTel semantic conventions), and Weave (W&B) (experiment tracking) form a solid starting point for any agent observability stack.
Projects in this article
Langfuse
30.2k ⭐Open-source LLM engineering platform providing tracing, evaluations, prompt management, and dataset management with integrations for LangChain, OpenAI, Anthropic, and more.
Phoenix
10.3k ⭐Arize AI's open-source LLM eval and observability with notebook-first UX.
OpenLIT
2.6k ⭐OpenLIT is an open-source AI engineering platform providing OpenTelemetry-native LLM observability, GPU monitoring, guardrails, evaluations, prompt management, and playground, integrating with 50+ LLM providers and agent frameworks.
OpenTelemetry Python
2.5k ⭐CNCF-hosted official Python implementation of OpenTelemetry, providing standardized APIs and SDKs for telemetry data collection (traces, metrics, logs) — the de facto standard for cloud-native application observability.
Weave
1.1k ⭐A toolkit by Weights & Biases for developing AI-powered applications, providing LLM call tracing, evaluation experiment management, and versioning from prototype to production.