📊

Best Observability Top 20

Top 20 most popular open-source Observability projects, ranked by GitHub Stars.

Kong

The cloud-native API and AI Gateway providing LLM request routing, rate limiting, load balancing and observability for AI agent applications.

observabilityapiagentlua

Prompt Optimizer

31.6k Stars

An AI prompt optimizer that helps users write better prompts and achieve improved AI results.

prompt-engineeringevaluationllmtypescript

Langfuse

30.2k Stars

Open-source LLM engineering platform providing tracing, evaluations, prompt management, and dataset management with integrations for LangChain, OpenAI, Anthropic, and more.

observabilitytracingllm-evaluationprompt-management

Langfuse

30.2k Stars

Open-source LLM observability: tracing, evals, prompt management.

langfuseobservabilitytracingevals

Langfuse

30.2k Stars

Langfuse is an open-source observability platform for LLM applications, supporting tracing, evaluation, prompt versioning, and cost analytics.

observabilitytracingllmanalytics

MLflow

26.8k Stars

MLflow is the open-source AI engineering platform for debugging, evaluating, monitoring, and optimizing AI agents and LLM applications, with model and data access management.

mlflowllmopsevaluationobservability

12 Factor Agents

23.9k Stars

What are the principles we can use to build LLM-powered software that is actually good enough to put in the hands of production customers?

agentframeworkevaluationobservability

Promptfoo

22.8k Stars

CLI tool that combines LLM prompt testing with red-teaming.

promptfootestingred-teamcli

Promptfoo

22.8k Stars

Test and evaluate LLM prompts, agents, and RAG pipelines. Built-in red teaming and security evaluation for reliable AI applications.

testingevaluationred-teamingprompt-testing

Promptfoo

22.8k Stars

Promptfoo is an evaluation and regression testing tool for LLM apps and agents, useful for comparing prompts, tool-call results, and model outputs over time.

evaluationtestingpromptstypescript

Agents Towards Production

20.9k Stars

End-to-end, code-first tutorials for building production-grade GenAI agents. From prototype to enterprise deployment.

agentframeworkevaluationobservability

Opik

20.2k Stars

Opik is an open-source LLM observability platform providing agent tracing, evaluation testing, and prompt experiment management to help developers monitor and optimize AI agent systems.

observabilityllm-evaluationtracingprompt-testing

openobserve

19.6k Stars

OpenObserve is a high-performance observability platform for logs, metrics, and traces, well suited for monitoring AI agent runtimes and tool calls.

observabilitylogsmetricstracing

OpenAI Evals

18.8k Stars

OpenAI's framework for evaluating LLMs and LLM systems, providing an open-source registry of benchmarks and tools for systematic model assessment.

llm-evaluationbenchmarkevalsred-teaming

ccusage

16.7k Stars

Analyze coding (agent) CLI token usage and costs from local data.

token-usagecost-analysisclirust

DeepEval

16.6k Stars

DeepEval is an open-source evaluation framework for LLM applications. It provides rich evaluation metrics and tools, supporting unit testing and integration testing to help developers build reliable LLM applications.

llmevaluationtestingrag

RagaAI Catalyst

16.1k Stars

RagaAI Catalyst is an observability, monitoring, and evaluation framework for Agent AI, supporting agent/LLM/tool tracing, multi-agent debugging, and self-hosted dashboard analytics.

observabilitytracingevaluationagent-monitoring

Ragas

14.6k Stars

Ragas is a framework for evaluating RAG (Retrieval Augmented Generation) systems. It provides various evaluation metrics including faithfulness, answer relevance, context precision, helping developers optimize RAG application performance.

ragevaluationllmtesting

OpenMetadata

14.4k Stars

OpenMetadata is a unified metadata platform for data and AI, providing data asset discovery, lineage, governance, and agent context retrieval capabilities.

observabilitymetadatadata-governancelineage

LM Evaluation Harness

13.1k Stars

A framework for few-shot evaluation of language models by EleutherAI, providing standardized evaluation pipelines supporting hundreds of benchmark tasks and widely adopted as a core LLM evaluation tool in the community.

llm-evaluationbenchmarkevaluation-frameworklanguage-model

Agent 评估LLM 评测自动化测试

Agent Evaluation and Testing: From Vibe Checks to End-to-End Pipelines

Most teams evaluate agents by checking a few examples. Real evaluation needs layered metrics, non-rotting datasets, and judges that push back. This article provides runnable code patterns and a practical decision framework.

RAGhallucination-detectionagent-evaluation

Agent Hallucination Defense: Practical Mitigation Patterns Beyond Guardrails

Why do LLM agents hallucinate? This article traces root causes and systematically reviews practical mitigation patterns: retrieval augmentation, confidence scoring, multi-agent cross-validation, forced citation backtracking, and observability with UpTrain, Giskard, RagaAI Catalyst, Comet Opik, and NVIDIA Garak.

可观测性OpenTelemetryLLMOps

Agent Observability in Practice: OpenTelemetry to Production Traces

Build a production-grade observability stack for multi-step agents using OpenTelemetry: OpenLLMetry semantic conventions, hierarchical span correlation, token cost attribution, retrieval quality metrics, and layered alerting.

Best Observability Top 20

Kong

Prompt Optimizer

Langfuse

Langfuse

Langfuse

MLflow

12 Factor Agents

Promptfoo

Promptfoo

Promptfoo

Agents Towards Production

Opik

openobserve

OpenAI Evals

ccusage

DeepEval

RagaAI Catalyst

Ragas

OpenMetadata

LM Evaluation Harness

Related Articles

Agent Evaluation and Testing: From Vibe Checks to End-to-End Pipelines

Agent Hallucination Defense: Practical Mitigation Patterns Beyond Guardrails

Agent Observability in Practice: OpenTelemetry to Production Traces