Best Observability Top 20
Top 20 most popular open-source Observability projects, ranked by GitHub Stars.
Kong
43.7k StarsThe cloud-native API and AI Gateway providing LLM request routing, rate limiting, load balancing and observability for AI agent applications.
Prompt Optimizer
31.6k StarsAn AI prompt optimizer that helps users write better prompts and achieve improved AI results.
Langfuse
30.2k StarsOpen-source LLM engineering platform providing tracing, evaluations, prompt management, and dataset management with integrations for LangChain, OpenAI, Anthropic, and more.
Langfuse
30.2k StarsOpen-source LLM observability: tracing, evals, prompt management.
Langfuse
30.2k StarsLangfuse is an open-source observability platform for LLM applications, supporting tracing, evaluation, prompt versioning, and cost analytics.
MLflow
26.8k StarsMLflow is the open-source AI engineering platform for debugging, evaluating, monitoring, and optimizing AI agents and LLM applications, with model and data access management.
12 Factor Agents
23.9k StarsWhat are the principles we can use to build LLM-powered software that is actually good enough to put in the hands of production customers?
Promptfoo
22.8k StarsCLI tool that combines LLM prompt testing with red-teaming.
Promptfoo
22.8k StarsTest and evaluate LLM prompts, agents, and RAG pipelines. Built-in red teaming and security evaluation for reliable AI applications.
Promptfoo
22.8k StarsPromptfoo is an evaluation and regression testing tool for LLM apps and agents, useful for comparing prompts, tool-call results, and model outputs over time.
Agents Towards Production
20.9k StarsEnd-to-end, code-first tutorials for building production-grade GenAI agents. From prototype to enterprise deployment.
Opik
20.2k StarsOpik is an open-source LLM observability platform providing agent tracing, evaluation testing, and prompt experiment management to help developers monitor and optimize AI agent systems.
openobserve
19.6k StarsOpenObserve is a high-performance observability platform for logs, metrics, and traces, well suited for monitoring AI agent runtimes and tool calls.
OpenAI Evals
18.8k StarsOpenAI's framework for evaluating LLMs and LLM systems, providing an open-source registry of benchmarks and tools for systematic model assessment.
ccusage
16.7k StarsAnalyze coding (agent) CLI token usage and costs from local data.
DeepEval
16.6k StarsDeepEval is an open-source evaluation framework for LLM applications. It provides rich evaluation metrics and tools, supporting unit testing and integration testing to help developers build reliable LLM applications.
RagaAI Catalyst
16.1k StarsRagaAI Catalyst is an observability, monitoring, and evaluation framework for Agent AI, supporting agent/LLM/tool tracing, multi-agent debugging, and self-hosted dashboard analytics.
Ragas
14.6k StarsRagas is a framework for evaluating RAG (Retrieval Augmented Generation) systems. It provides various evaluation metrics including faithfulness, answer relevance, context precision, helping developers optimize RAG application performance.
OpenMetadata
14.4k StarsOpenMetadata is a unified metadata platform for data and AI, providing data asset discovery, lineage, governance, and agent context retrieval capabilities.
LM Evaluation Harness
13.1k StarsA framework for few-shot evaluation of language models by EleutherAI, providing standardized evaluation pipelines supporting hundreds of benchmark tasks and widely adopted as a core LLM evaluation tool in the community.
Related Articles
Agent Evaluation and Testing: From Vibe Checks to End-to-End Pipelines
Most teams evaluate agents by checking a few examples. Real evaluation needs layered metrics, non-rotting datasets, and judges that push back. This article provides runnable code patterns and a practical decision framework.
Agent Hallucination Defense: Practical Mitigation Patterns Beyond Guardrails
Why do LLM agents hallucinate? This article traces root causes and systematically reviews practical mitigation patterns: retrieval augmentation, confidence scoring, multi-agent cross-validation, forced citation backtracking, and observability with UpTrain, Giskard, RagaAI Catalyst, Comet Opik, and NVIDIA Garak.
Agent Observability in Practice: OpenTelemetry to Production Traces
Build a production-grade observability stack for multi-step agents using OpenTelemetry: OpenLLMetry semantic conventions, hierarchical span correlation, token cost attribution, retrieval quality metrics, and layered alerting.