Agent Observability in Practice: OpenTelemetry to Production Traces

Build a production-grade observability stack for multi-step agents using OpenTelemetry: OpenLLMetry semantic conventions, hierarchical span correlation, token cost attribution, retrieval quality metrics, and layered alerting.

AgentList · 2026年6月29日
可观测性OpenTelemetryLLMOpstrace成本归因

Agent Observability in Practice: OpenTelemetry to Production Traces

Engineers running AI agents in production all hit the same wall: when a multi-step agent task fails, drifts, or spikes in latency, how do you identify the root cause within five minutes -- is it the model, the tool, or retrieval? That is the core value of agent observability: turning a black-box reasoning process into a queryable, correlated, alertable engineering system. This article takes a production-engineering perspective on agent observability, covering OpenTelemetry semantic conventions, trace correlation models, token cost attribution, and anomaly alerting patterns.

Why Agents Need Observability More Than Traditional Services

Observability for traditional microservices is mature -- metrics, logs, and traces underpin almost every distributed system. Agent systems run on the same technology stack but have three fundamental differences.

First, multi-step reasoning is nested and non-deterministic. A single user request may trigger 5 to 20 LLM calls, each with a different prompt template, retrieval result, and tool output. If you do not instrument call boundaries, end users will see a single "agent invocation" with no visibility into which step went wrong.

Second, cost is token-dimensional, not request-dimensional. The same agent task on two runs may consume wildly different token counts depending on retrieved context length and chain-of-thought depth. Without per-step attribution, you cannot answer "which user or task type is burning money."

Third, failure modes are gradual rather than sudden. Model upgrades, prompt tweaks, and index refreshes cause agents to drift slowly: success rates slide from 95% to 80% over weeks, with each individual day looking normal. Traditional "5xx error rate" alerts are useless; you need statistical metrics combined with offline evaluation.

OpenTelemetry Semantic Conventions: Putting Agents in a Standard Protocol

The core challenge of integrating agents with OpenTelemetry is not technical -- it is mapping LLM concepts to span semantics. We recommend following the OpenLLMetry community conventions:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

tracer = trace.get_tracer("agent.runtime")

# Standard GenAI span naming
with tracer.start_as_current_span("openai.chat") as span:
    span.set_attribute("gen_ai.system", "openai")
    span.set_attribute("gen_ai.request.model", "gpt-4o")
    span.set_attribute("gen_ai.request.max_tokens", 4096)
    span.set_attribute("gen_ai.request.temperature", 0.7)
    
    response = openai_client.chat.completions.create(...)
    
    span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
    span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)
    span.set_attribute("gen_ai.usage.total_tokens", response.usage.total_tokens)
    span.set_attribute("gen_ai.response.model", response.model)
    span.set_attribute("gen_ai.response.finish_reasons", [choice.finish_reason for choice in response.choices])

Key design points:

  • gen_ai.system, gen_ai.request.model, and gen_ai.usage.* are semantic attributes defined by OpenLLMetry, supported by all compatible backends (Langfuse, Phoenix, OpenInference, OpenLit)
  • Span names follow the {provider}.{operation} pattern: openai.chat, anthropic.messages, cohere.rerank
  • Every LLM call becomes an independent span nested under the agent's parent span, preserving call order

Multi-Step Agent Trace Correlation Model

The most common observability mistake is treating the entire agent as a single span. That collapses timing data and eliminates all root-cause capability. The correct approach is to break agents into hierarchical spans:

@tracer.start_as_current_span("agent.run")
def run_agent(user_query: str, session_id: str):
    span = trace.get_current_span()
    span.set_attribute("agent.session_id", session_id)
    span.set_attribute("agent.user_query", user_query)
    
    with tracer.start_as_current_span("agent.plan") as plan_span:
        plan = llm_call_planner(user_query)
        plan_span.set_attribute("agent.plan.steps", len(plan.steps))
    
    results = []
    for i, step in enumerate(plan.steps):
        with tracer.start_as_current_span(f"agent.step[{i}]") as step_span:
            step_span.set_attribute("agent.step.tool", step.tool)
            step_span.set_attribute("agent.step.input", step.input)
            
            with tracer.start_as_current_span(f"tool.{step.tool}") as tool_span:
                output = execute_tool(step)
                tool_span.set_attribute("tool.output_size", len(str(output)))
            
            with tracer.start_as_current_span("openai.chat") as llm_span:
                reasoning = llm_reason(step, output)
                llm_span.set_attribute("gen_ai.usage.total_tokens", reasoning.usage.total_tokens)
            
            results.append(reasoning)
    
    span.set_attribute("agent.total_steps", len(plan.steps))
    return aggregate(results)

Span tree structure:

agent.run
|-- agent.plan
|   `-- openai.chat (Planner LLM)
|-- agent.step[0]
|   |-- tool.search
|   `-- openai.chat (Reasoning LLM)
|-- agent.step[1]
|   |-- tool.calculator
|   `-- openai.chat (Reasoning LLM)
`-- agent.step[2]
    `-- openai.chat (Final answer)

This structure pays off in three concrete ways in Langfuse or Phoenix:

  1. Slow request localization: the slowest step's LLM call is visible at a glance
  2. Cost attribution: per-step token consumption is recorded, aggregable by session_id or user_id
  3. Failure localization: exceptions bind to specific spans, separating tool timeouts from model timeouts

Token Cost Attribution: Pinning Dollars to Spans

Agent token costs, unattributed, become post-hoc accounting. OTel span attributes enable real-time attribution:

def record_llm_cost(span, model: str, input_tokens: int, output_tokens: int):
    # Reference pricing as of 2025 (update to actual)
    pricing = {
        "gpt-4o": {"input": 2.5e-6, "output": 1e-5},
        "claude-sonnet-4": {"input": 3e-6, "output": 1.5e-5},
        "deepseek-chat": {"input": 1.4e-7, "output": 2.8e-7},
    }
    p = pricing.get(model, {"input": 0, "output": 0})
    cost_usd = input_tokens * p["input"] + output_tokens * p["output"]
    
    span.set_attribute("gen_ai.usage.cost_usd", cost_usd)
    span.set_attribute("gen_ai.usage.input_cost_usd", input_tokens * p["input"])
    span.set_attribute("gen_ai.usage.output_cost_usd", output_tokens * p["output"])

Wrap this in a unified traced_llm_call() function that all LLM calls go through:

async def traced_llm_call(prompt: str, model: str = "gpt-4o", **kwargs):
    with tracer.start_as_current_span(f"{model}.chat") as span:
        span.set_attribute("gen_ai.system", model.split("-")[0])
        span.set_attribute("gen_ai.request.model", model)
        
        response = await client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            **kwargs
        )
        
        usage = response.usage
        span.set_attribute("gen_ai.usage.input_tokens", usage.prompt_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", usage.completion_tokens)
        span.set_attribute("gen_ai.usage.total_tokens", usage.total_tokens)
        record_llm_cost(span, model, usage.prompt_tokens, usage.completion_tokens)
        return response

After aggregation in Langfuse or Phoenix by gen_ai.usage.cost_usd, you can see:

  • Which users cost the most (filtered by session_id or user_id)
  • Which task types are most expensive (filtered by agent.task_type)
  • Which prompts waste the most tokens (by gen_ai.request.model plus length distribution)

Observability for the Retrieval Stage

The retrieval stage of an agent system (RAG) is frequently overlooked but is a leading source of quality fluctuation. Key metrics for retrieval observability:

with tracer.start_as_current_span("retrieval.search") as span:
    span.set_attribute("retrieval.query", query)
    span.set_attribute("retrieval.top_k", top_k)
    span.set_attribute("retrieval.embedding_model", embedding_model)
    
    results = vector_store.search(query, top_k=top_k)
    
    span.set_attribute("retrieval.results_count", len(results))
    span.set_attribute("retrieval.top_score", results[0].score if results else 0)
    span.set_attribute("retrieval.min_score", min(r.score for r in results) if results else 0)
    span.set_attribute("retrieval.score_stddev", stdev([r.score for r in results]) if len(results) > 1 else 0)
    span.set_attribute("retrieval.has_high_confidence", any(r.score > 0.8 for r in results))

Core metrics:

  • retrieval.top_score: highest relevance, indicating retrieval quality
  • retrieval.score_stddev: distribution of scores, distinguishing "all results equally relevant" (weak signal) from "one or two stand out" (strong signal)
  • retrieval.has_high_confidence: whether a high-confidence hit exists; when false, the agent should fall back to web search or other strategies

With these metrics on spans, you can build a "low-confidence query ratio" alert in Phoenix or Langfuse. A sustained rise usually indicates vector index drift or document corpus changes.

Tool-Call Observability

Tool calls are the most likely component of an agent system to fail silently. Design principles:

@tracer.start_as_current_span("tool.{name}")
def traced_tool_call(name: str, **kwargs):
    span = trace.get_current_span()
    span.set_attribute("tool.name", name)
    span.set_attribute("tool.input", json.dumps(kwargs, default=str)[:1000])
    
    start = time.perf_counter()
    status = "success"
    try:
        result = tool_registry[name](**kwargs)
        span.set_attribute("tool.output_size", len(str(result)))
        return result
    except TimeoutError:
        status = "timeout"
        span.set_attribute("error.type", "timeout")
        span.set_attribute("error.timeout_seconds", timeout)
        raise
    except Exception as e:
        status = "error"
        span.set_attribute("error.type", type(e).__name__)
        span.set_attribute("error.message", str(e)[:500])
        raise
    finally:
        elapsed_ms = (time.perf_counter() - start) * 1000
        span.set_attribute("tool.duration_ms", elapsed_ms)
        span.set_attribute("tool.status", status)
        tool_call_counter.add(1, {"tool": name, "status": status})
        tool_latency.record(elapsed_ms, {"tool": name, "status": status})

Critical attributes:

  • tool.duration_ms combined with tool.status powers P50/P95/P99 latency and error-rate dashboards
  • error.type distinguishes timeout, rate_limit, auth_error, and validation_error -- each requires different alert thresholds
  • Avoid stuffing raw input/output into span attributes (may contain PII); record only size and hash

Anomaly Detection and Alerting Patterns

With complete span data, the next step is alerting. Agent system alerts should be layered:

Layer 1: hard error alerts (must be real-time)

  • 5xx error rate above 1% over the last 5 minutes
  • P95 latency exceeding 1.5x SLA
  • Provider API 429/5xx ratio above 10%

Layer 2: quality drift alerts (hourly aggregation)

  • Task success rate (LLM-as-judge evaluated) dropping more than 5% week-over-week
  • Average step count (chain length) suddenly rising more than 30%
  • Low-confidence retrieval ratio (retrieval.has_high_confidence == false) above 40%

Layer 3: cost alerts (daily aggregation)

  • Daily token cost exceeding 80% of budget
  • A single tenant or task type costing more than 5x the global average

Encode alert rules in code rather than dashboard configuration files for version control, rollback, and auditability:

# alerts.py
ALERT_RULES = {
    "error_rate_spike": {
        "query": "rate(tool_call_total{status='error'}[5m]) / rate(tool_call_total[5m])",
        "threshold": 0.01,
        "window": "5m",
    },
    "cost_daily_budget": {
        "query": "sum(gen_ai_usage_cost_usd_total)",
        "threshold": 0.8,  # 80% of daily budget
        "window": "1d",
    },
}

Backend Selection: Langfuse / Phoenix / OpenLit

Backend Deployment Strength Best For
Langfuse SaaS or self-hosted Prompt version management, user feedback collection Mid-sized teams that need prompt iteration tracking
Phoenix (Arize) Self-hosted or SaaS Powerful span search, embedding visualization Existing OTel infrastructure requiring deep debugging
OpenLit Pure OTLP collector Compatible with any OTel backend (Datadog, Grafana, Honeycomb) Existing unified OTel infrastructure
Weave (W&B) SaaS Tight integration with W&B experiment tracking Existing W&B ecosystem

If your team is just starting out, Langfuse is the easiest entry point -- it has full prompt template management, user feedback labeling, and span search out of the box. If you already use Datadog, Grafana, or Honeycomb as a general APM, OpenLit plus OTLP is the more elegant path, avoiding observability data silos.

Implementation Path

Week 1: Adopt OpenLLMetry semantic conventions so all LLM calls produce standardized spans. Week 2: Wrap every tool call in traced_tool_call, recording duration, status, and error. Week 3: Establish trace correlation IDs, propagating session_id, user_id, and task_id to all child spans. Week 4: Implement token cost attribution and build a cost dashboard. Week 5: Integrate an offline evaluator (LLM-as-judge) so success rate becomes a computable metric. Week 6: Route hard error alerts into PagerDuty or Feishu. Week 7: Produce a weekly quality drift report to identify slowly-declining trends.

Summary

Agent observability is not just "add an APM." Its core value is making reasoning transparent, cost attributable to business dimensions, and quality drift alertable. Start with OpenLLMetry semantic conventions, put every LLM call, tool call, and retrieval query into a standard span, then ship spans over OTLP to Langfuse, Phoenix, or OpenLit. Finally, version-controlled alert rules protect SLAs and cost ceilings.

For agent systems already in production, observability is not optional -- it is the engineering step that turns an agent from "a talking demo" into "trusted infrastructure."

Reference tools: Langfuse (open-source LLM observability platform), Phoenix (Arize) (experiment and evaluation platform), OpenLit (OTel collector), OpenInference (OTel semantic conventions), and Weave (W&B) (experiment tracking) form a solid starting point for any agent observability stack.