Agent 可观测性深度实战：从 OpenTelemetry 到生产 trace 体系

在生产环境跑 AI Agent 的工程师都会遇到同一个问题：当一个多步 Agent 任务失败、跑偏或延迟飙升时，如何在 5 分钟内定位是模型问题、工具问题还是检索问题？这就是 Agent 可观测性（Observability）的核心价值——把一个黑盒推理过程变成可查询、可关联、可告警的工程系统。本文从工程实战角度，拆解 Agent 可观测性的关键设计：OpenTelemetry 语义规范、trace 关联模型、token 成本归因和异常告警模式。

为什么 Agent 比传统服务更需要可观测性

传统微服务的可观测性已经成熟——Metrics、Logs、Traces 三支柱支撑了几乎所有分布式系统。Agent 系统虽然也运行在同样的技术栈上，但具有三个根本差异：

第一，多步推理是嵌套的、非确定性的。一次用户请求可能触发 5-20 次 LLM 调用，每次调用都包含不同的 prompt 模板、检索结果和工具输出。如果不在调用边界打点，最终用户只会看到"调用了一次 Agent"，无法知道哪一步出了问题。

第二，成本不是请求维度的，而是 token 维度的。同一个 Agent 任务在两次运行中可能消耗完全不同的 token 数（取决于检索召回的段落数、思维链的长度）。如果不按 step 归因 token 成本，就无法回答"哪个用户、哪类任务在烧钱"。

第三，故障模式是渐变而非突变的。模型版本升级、提示词微调、检索索引更新都会导致 Agent 行为缓慢漂移：成功率从 95% 慢慢掉到 80%，但每次单独看都在正常范围。传统的"5xx 错误率告警"完全失效，必须靠统计指标 + 离线评估双管齐下。

OpenTelemetry 语义规范：把 Agent 装进标准协议

OpenTelemetry（OTel）是 CNCF 的可观测性标准。Agent 系统接入 OTel 的核心不是技术问题，而是如何把 LLM 概念映射到 Span 语义。我们建议遵循 OpenLLMetry 社区的约定：

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

tracer = trace.get_tracer("agent.runtime")

# 标准 GenAI Span 命名
with tracer.start_as_current_span("openai.chat") as span:
    span.set_attribute("gen_ai.system", "openai")
    span.set_attribute("gen_ai.request.model", "gpt-4o")
    span.set_attribute("gen_ai.request.max_tokens", 4096)
    span.set_attribute("gen_ai.request.temperature", 0.7)
    
    response = openai_client.chat.completions.create(...)
    
    span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
    span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)
    span.set_attribute("gen_ai.usage.total_tokens", response.usage.total_tokens)
    span.set_attribute("gen_ai.response.model", response.model)
    span.set_attribute("gen_ai.response.finish_reasons", [choice.finish_reason for choice in response.choices])

关键设计点：

gen_ai.system / gen_ai.request.model / gen_ai.usage.* 是 OpenLLMetry 定义的语义属性，所有兼容后端（Langfuse、Phoenix、OpenInference、OpenLit）都支持
Span 名遵循 {provider}.{operation} 模式：openai.chat、anthropic.messages、cohere.rerank
每个 LLM 调用作为独立 Span 嵌入到 Agent 的父 Span 中，保留调用顺序

多步 Agent 的 trace 关联模型

最常见的可观测性错误是把整个 Agent 视为一个 Span。这样做虽然能统计总耗时，但失去了一切定位能力。正确做法是把 Agent 拆成层级 Span：

@tracer.start_as_current_span("agent.run")
def run_agent(user_query: str, session_id: str):
    span = trace.get_current_span()
    span.set_attribute("agent.session_id", session_id)
    span.set_attribute("agent.user_query", user_query)
    
    with tracer.start_as_current_span("agent.plan") as plan_span:
        plan = llm_call_planner(user_query)
        plan_span.set_attribute("agent.plan.steps", len(plan.steps))
    
    results = []
    for i, step in enumerate(plan.steps):
        with tracer.start_as_current_span(f"agent.step[{i}]") as step_span:
            step_span.set_attribute("agent.step.tool", step.tool)
            step_span.set_attribute("agent.step.input", step.input)
            
            # 嵌套子 Span
            with tracer.start_as_current_span(f"tool.{step.tool}") as tool_span:
                output = execute_tool(step)
                tool_span.set_attribute("tool.output_size", len(str(output)))
            
            with tracer.start_as_current_span("openai.chat") as llm_span:
                reasoning = llm_reason(step, output)
                llm_span.set_attribute("gen_ai.usage.total_tokens", reasoning.usage.total_tokens)
            
            results.append(reasoning)
    
    span.set_attribute("agent.total_steps", len(plan.steps))
    return aggregate(results)

Span 树结构：

agent.run
├── agent.plan
│   └── openai.chat (Planner LLM)
├── agent.step[0]
│   ├── tool.search
│   └── openai.chat (Reasoning LLM)
├── agent.step[1]
│   ├── tool.calculator
│   └── openai.chat (Reasoning LLM)
└── agent.step[2]
    ├── openai.chat (Final answer)

这种结构在 Langfuse 或 Phoenix 里有三个直接收益：

慢请求定位：可以一眼看到耗时最长的 step 是哪个 LLM 调用
成本归因：每个 step 的 token 消耗都有记录，按 session_id / user_id 聚合
失败定位：异常会被自动绑定到具体 Span，工具超时和模型超时分得清清楚楚

Token 成本归因：把美元挂到 Span 上

Agent 的 token 成本如果不归因到具体业务维度（如用户、租户、任务类型），就只能事后算总账。OTel 的 Span 属性机制天然支持成本归因：

def record_llm_cost(span, model: str, input_tokens: int, output_tokens: int):
    # 2025 年的参考价格（请按实际更新）
    pricing = {
        "gpt-4o": {"input": 2.5e-6, "output": 1e-5},
        "claude-sonnet-4": {"input": 3e-6, "output": 1.5e-5},
        "deepseek-chat": {"input": 1.4e-7, "output": 2.8e-7},
    }
    p = pricing.get(model, {"input": 0, "output": 0})
    cost_usd = input_tokens * p["input"] + output_tokens * p["output"]
    
    span.set_attribute("gen_ai.usage.cost_usd", cost_usd)
    span.set_attribute("gen_ai.usage.input_cost_usd", input_tokens * p["input"])
    span.set_attribute("gen_ai.usage.output_cost_usd", output_tokens * p["output"])

把这段逻辑包装成一个统一的 traced_llm_call() 函数，所有 LLM 调用都过它：

async def traced_llm_call(prompt: str, model: str = "gpt-4o", **kwargs):
    with tracer.start_as_current_span(f"{model}.chat") as span:
        span.set_attribute("gen_ai.system", model.split("-")[0])
        span.set_attribute("gen_ai.request.model", model)
        
        response = await client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            **kwargs
        )
        
        usage = response.usage
        span.set_attribute("gen_ai.usage.input_tokens", usage.prompt_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", usage.completion_tokens)
        span.set_attribute("gen_ai.usage.total_tokens", usage.total_tokens)
        record_llm_cost(span, model, usage.prompt_tokens, usage.completion_tokens)
        return response

之后在 Langfuse / Phoenix 里按 gen_ai.usage.cost_usd 聚合，就能看到：

哪些用户最烧钱（按 session_id / user_id 过滤）
哪些任务类型成本最高（按 agent.task_type 过滤）
哪类 prompt 最浪费 token（按 gen_ai.request.model + 长度分布）

检索阶段的可观测性

Agent 系统的检索阶段（RAG）经常被忽视，但它是质量波动的主要来源之一。检索可观测性的关键指标：

with tracer.start_as_current_span("retrieval.search") as span:
    span.set_attribute("retrieval.query", query)
    span.set_attribute("retrieval.top_k", top_k)
    span.set_attribute("retrieval.embedding_model", embedding_model)
    
    results = vector_store.search(query, top_k=top_k)
    
    span.set_attribute("retrieval.results_count", len(results))
    span.set_attribute("retrieval.top_score", results[0].score if results else 0)
    span.set_attribute("retrieval.min_score", min(r.score for r in results) if results else 0)
    span.set_attribute("retrieval.score_stddev", stdev([r.score for r in results]) if len(results) > 1 else 0)
    span.set_attribute("retrieval.has_high_confidence", any(r.score > 0.8 for r in results))

核心指标：

retrieval.top_score：最高分，反映检索质量
`retrieval.score_stddev**：分数分布，衡量"是否所有结果都差不多"（信号弱）还是"有一两个特别相关"（信号强）
retrieval.has_high_confidence：是否存在高置信度命中，没有就应该触发 web search / fallback

把这些指标埋进 Span 后，就能在 Phoenix / Langfuse 里建一个"低置信度查询比例"的告警——这个数字持续上升通常意味着向量索引出问题或文档集变化。

工具调用的可观测性

工具调用是 Agent 系统中最容易"静默失败"的部分。设计原则：

@tracer.start_as_current_span("tool.{name}")
def traced_tool_call(name: str, **kwargs):
    span = trace.get_current_span()
    span.set_attribute("tool.name", name)
    span.set_attribute("tool.input", json.dumps(kwargs, default=str)[:1000])
    
    start = time.perf_counter()
    status = "success"
    try:
        result = tool_registry[name](**kwargs)
        span.set_attribute("tool.output_size", len(str(result)))
        return result
    except TimeoutError:
        status = "timeout"
        span.set_attribute("error.type", "timeout")
        span.set_attribute("error.timeout_seconds", timeout)
        raise
    except Exception as e:
        status = "error"
        span.set_attribute("error.type", type(e).__name__)
        span.set_attribute("error.message", str(e)[:500])
        raise
    finally:
        elapsed_ms = (time.perf_counter() - start) * 1000
        span.set_attribute("tool.duration_ms", elapsed_ms)
        span.set_attribute("tool.status", status)
        # 关键：把状态关联到 metrics
        tool_call_counter.add(1, {"tool": name, "status": status})
        tool_latency.record(elapsed_ms, {"tool": name, "status": status})

关键属性：

tool.duration_ms + tool.status 的组合可以衍生出 P50/P95/P99 延迟和错误率仪表盘
error.type 区分 timeout / rate_limit / auth_error / validation_error——不同错误类型的告警阈值应该不同
不要把原始输入输出全部塞进 Span 属性（可能包含 PII），只记录 size 和 hash 即可

异常检测与告警模式

有了完整 Span 数据后，下一步是告警。Agent 系统的告警应分三层：

第一层：硬错误告警（必须实时）

5xx 错误率 > 1%（last 5 min）
P95 latency > SLA * 1.5
Provider API 错误码 429/5xx 比例 > 10%

第二层：质量漂移告警（小时级聚合）

任务成功率（基于离线评估器打标）周环比下降 > 5%
平均 step 数（思维链长度）突然增加 > 30%
低置信度检索比例（retrieval.has_high_confidence == false）> 40%

第三层：成本告警（日级聚合）

每日 token 成本超出预算 80%
单一租户 / 单一任务类型成本 > 全局平均 5x

把告警规则写进代码而不是 dashboard 配置文件，版本化、可回滚、可审计：

# alerts.py
ALERT_RULES = {
    "error_rate_spike": {
        "query": "rate(tool_call_total{status='error'}[5m]) / rate(tool_call_total[5m])",
        "threshold": 0.01,
        "window": "5m",
    },
    "cost_daily_budget": {
        "query": "sum(gen_ai_usage_cost_usd_total)",
        "threshold": 0.8,  # 80% of daily budget
        "window": "1d",
    },
}

工具选型：Langfuse / Phoenix / OpenLit

后端	部署模式	优势	适合
Langfuse	SaaS / 自托管	提示词版本管理、用户反馈收集	中等规模团队，需要 prompt 迭代追踪
Phoenix (Arize)	自托管 / SaaS	强大的 span 检索、embedding 可视化	已有 OTel 基础设施，需要深度调试
OpenLit	纯 OTLP 收集器	兼容任意 OTel 后端（Datadog/Grafana/Honeycomb）	已有统一 OTel 基础设施
Weave (W&B)	SaaS	与 W&B 实验跟踪集成	已有 W&B 生态

如果团队刚起步，Langfuse 是最易上手的——它有完整的 prompt 模板管理、用户反馈打标、Span 检索。如果已经使用 Datadog / Grafana / Honeycomb 等通用 APM，OpenLit + OTLP 是更优雅的方案，避免可观测性数据孤岛。

实施路径

第 1 周：接入 OpenLLMetry 语义规范，让所有 LLM 调用产生标准化 Span。 第 2 周：把所有工具调用包装成 traced_tool_call，记录 duration/status/error。 第 3 周：建立 trace 关联 ID 机制，把 session_id / user_id / task_id 透传到所有子 Span。 第 4 周：实现 token 成本归因，建立成本仪表盘。 第 5 周：接入离线评估器（LLM-as-judge），把"成功率"作为可计算的指标。 第 6 周：把硬错误告警接入 PagerDuty / 飞书。 第 7 周：建立质量漂移周报，识别"缓慢变差"的趋势。

总结

Agent 可观测性不是"接一个 APM"那么简单。它的核心价值在于把推理过程透明化、把成本归因到业务维度、把质量漂移变成可告警的指标。从 OpenLLMetry 语义规范入手，把所有 LLM 调用、工具调用、检索查询都装进标准 Span，然后用 OTLP 协议发送到 Langfuse / Phoenix / OpenLit 等后端。最后用版本化的告警规则守住 SLA 和成本红线。

对于已经投入生产环境的 Agent 系统，可观测性不是可选项——它是把 Agent 从"会说话的 demo"变成"可信赖的基础设施"的关键工程化步骤。

参考工具：Langfuse（开源 LLM 可观测平台）、Phoenix (Arize)（实验与评估平台）、OpenLit（OTel 收集器）、OpenInference（OTel 语义规范）和 Weave (W&B)（实验跟踪）可作为可观测性栈的起点。

Agent 可观测性深度实战：从 OpenTelemetry 到生产 trace 体系

为什么 Agent 比传统服务更需要可观测性

OpenTelemetry 语义规范：把 Agent 装进标准协议

多步 Agent 的 trace 关联模型

Token 成本归因：把美元挂到 Span 上

检索阶段的可观测性

工具调用的可观测性

异常检测与告警模式

工具选型：Langfuse / Phoenix / OpenLit

实施路径

总结

本文涉及的项目

Langfuse

Phoenix

OpenLIT

OpenTelemetry Python

Weave