Best Observability Top 20
Top 20 most popular open-source Observability projects, ranked by GitHub Stars.
Kong
43.4k StarsThe cloud-native API and AI Gateway providing LLM request routing, rate limiting, load balancing and observability for AI agent applications.
Prompt Optimizer
28.6k StarsAn AI prompt optimizer that helps users write better prompts and achieve improved AI results.
Langfuse
27.0k StarsLangfuse is an open-source observability platform for LLM applications, supporting tracing, evaluation, prompt versioning, and cost analytics.
MLflow
25.9k StarsMLflow is the open-source AI engineering platform for debugging, evaluating, monitoring, and optimizing AI agents and LLM applications, with model and data access management.
Promptfoo
21.2k StarsPromptfoo is an evaluation and regression testing tool for LLM apps and agents, useful for comparing prompts, tool-call results, and model outputs over time.
12 Factor Agents
19.8k StarsWhat are the principles we can use to build LLM-powered software that is actually good enough to put in the hands of production customers?
Opik
19.3k StarsOpik is an open-source LLM observability platform providing agent tracing, evaluation testing, and prompt experiment management to help developers monitor and optimize AI agent systems.
Agents Towards Production
19.1k StarsEnd-to-end, code-first tutorials for building production-grade GenAI agents. From prototype to enterprise deployment.
OpenAI Evals
18.4k StarsOpenAI's framework for evaluating LLMs and LLM systems, providing an open-source registry of benchmarks and tools for systematic model assessment.
RagaAI Catalyst
16.2k StarsRagaAI Catalyst is an observability, monitoring, and evaluation framework for Agent AI, supporting agent/LLM/tool tracing, multi-agent debugging, and self-hosted dashboard analytics.
DeepEval
15.3k StarsDeepEval is an open-source evaluation framework for LLM applications. It provides rich evaluation metrics and tools, supporting unit testing and integration testing to help developers build reliable LLM applications.
Ragas
13.9k StarsRagas is a framework for evaluating RAG (Retrieval Augmented Generation) systems. It provides various evaluation metrics including faithfulness, answer relevance, context precision, helping developers optimize RAG application performance.
LM Evaluation Harness
12.5k StarsA framework for few-shot evaluation of language models by EleutherAI, providing standardized evaluation pipelines supporting hundreds of benchmark tasks and widely adopted as a core LLM evaluation tool in the community.
Kubeshark
11.9k StarseBPF-powered network observability for Kubernetes. Indexes L4/L7 traffic with full K8s context, queryable by AI agents via MCP and humans via dashboard.
TensorZero
11.4k StarsTensorZero is an open-source inference gateway and optimization platform for LLM apps and agent systems, focused on high-performance serving, experimentation, routing, and production observability.
Crucix
9.7k StarsCrucix is a personal intelligence agent that watches the world from multiple data sources and pings you when something changes, helping you stay on top of information in real time.
Arize Phoenix
9.6k StarsPhoenix is an open-source observability and evaluation tool for LLM and agent applications, supporting online tracing and offline diagnosis.
BAML
8.2k StarsBAML is an AI framework that adds engineering rigor to prompt engineering, offering type-safe prompt definitions, automatic testing, version management, and multi-model support across Python, TypeScript, Ruby, Java, C#, Rust, and Go.
Garak
7.8k StarsNVIDIA's open-source LLM vulnerability scanner that automatically detects security issues in language models including safety vulnerabilities, hallucination tendencies, jailbreak risks, and prompt injection attacks.
Evidently
7.5k StarsEvidently is an open-source ML and LLM observability framework with 100+ metrics for evaluating, testing, and monitoring any AI-powered system or data pipeline.
Related Articles
Agent Evaluation and Testing: From Vibe Checks to End-to-End Pipelines
Most teams evaluate agents by checking a few examples. Real evaluation needs layered metrics, non-rotting datasets, and judges that push back. This article provides runnable code patterns and a practical decision framework.
Building Agent Observability: From Distributed Tracing to Automated Evaluation
A systematic guide to the three pillars of agent observability — distributed tracing, metrics monitoring, and automated evaluation — for building production-grade agent monitoring.
Agent Observability Playbook: End-to-End Tracing with Langfuse
Based on real production experience, this guide explains how to build a closed loop of tracing, evaluation, and cost analytics for AI agents with Langfuse.