📊

Best Observability Top 20

Top 20 most popular open-source Observability projects, ranked by GitHub Stars.

Kong

The cloud-native API and AI Gateway providing LLM request routing, rate limiting, load balancing and observability for AI agent applications.

observabilityapiagentlua

Prompt Optimizer

28.6k Stars

An AI prompt optimizer that helps users write better prompts and achieve improved AI results.

prompt-engineeringevaluationllmtypescript

Langfuse

27.0k Stars

Langfuse is an open-source observability platform for LLM applications, supporting tracing, evaluation, prompt versioning, and cost analytics.

observabilitytracingllmanalytics

MLflow

25.9k Stars

MLflow is the open-source AI engineering platform for debugging, evaluating, monitoring, and optimizing AI agents and LLM applications, with model and data access management.

mlflowllmopsevaluationobservability

Promptfoo

21.2k Stars

Promptfoo is an evaluation and regression testing tool for LLM apps and agents, useful for comparing prompts, tool-call results, and model outputs over time.

evaluationtestingpromptstypescript

12 Factor Agents

19.8k Stars

What are the principles we can use to build LLM-powered software that is actually good enough to put in the hands of production customers?

agentframeworkevaluationobservability

Opik

19.3k Stars

Opik is an open-source LLM observability platform providing agent tracing, evaluation testing, and prompt experiment management to help developers monitor and optimize AI agent systems.

observabilityllm-evaluationtracingprompt-testing

Agents Towards Production

19.1k Stars

End-to-end, code-first tutorials for building production-grade GenAI agents. From prototype to enterprise deployment.

agentframeworkevaluationobservability

OpenAI Evals

18.4k Stars

OpenAI's framework for evaluating LLMs and LLM systems, providing an open-source registry of benchmarks and tools for systematic model assessment.

llm-evaluationbenchmarkevalsred-teaming

RagaAI Catalyst

16.2k Stars

RagaAI Catalyst is an observability, monitoring, and evaluation framework for Agent AI, supporting agent/LLM/tool tracing, multi-agent debugging, and self-hosted dashboard analytics.

observabilitytracingevaluationagent-monitoring

DeepEval

15.3k Stars

DeepEval is an open-source evaluation framework for LLM applications. It provides rich evaluation metrics and tools, supporting unit testing and integration testing to help developers build reliable LLM applications.

llmevaluationtestingrag

Ragas

13.9k Stars

Ragas is a framework for evaluating RAG (Retrieval Augmented Generation) systems. It provides various evaluation metrics including faithfulness, answer relevance, context precision, helping developers optimize RAG application performance.

ragevaluationllmtesting

LM Evaluation Harness

12.5k Stars

A framework for few-shot evaluation of language models by EleutherAI, providing standardized evaluation pipelines supporting hundreds of benchmark tasks and widely adopted as a core LLM evaluation tool in the community.

llm-evaluationbenchmarkevaluation-frameworklanguage-model

Kubeshark

11.9k Stars

eBPF-powered network observability for Kubernetes. Indexes L4/L7 traffic with full K8s context, queryable by AI agents via MCP and humans via dashboard.

observabilitydevopsmcptools

TensorZero

11.4k Stars

TensorZero is an open-source inference gateway and optimization platform for LLM apps and agent systems, focused on high-performance serving, experimentation, routing, and production observability.

gatewayinferenceevaluationrust

Crucix

9.7k Stars

Crucix is a personal intelligence agent that watches the world from multiple data sources and pings you when something changes, helping you stay on top of information in real time.

agentautomationmonitoringjavascript

Arize Phoenix

9.6k Stars

Phoenix is an open-source observability and evaluation tool for LLM and agent applications, supporting online tracing and offline diagnosis.

observabilityevaltracingrag

BAML

8.2k Stars

BAML is an AI framework that adds engineering rigor to prompt engineering, offering type-safe prompt definitions, automatic testing, version management, and multi-model support across Python, TypeScript, Ruby, Java, C#, Rust, and Go.

prompt-engineeringtype-safellm-testingprompt-management

Garak

7.8k Stars

NVIDIA's open-source LLM vulnerability scanner that automatically detects security issues in language models including safety vulnerabilities, hallucination tendencies, jailbreak risks, and prompt injection attacks.

llm-securityvulnerability-scannerllm-evaluationred-teaming

Evidently

7.5k Stars

Evidently is an open-source ML and LLM observability framework with 100+ metrics for evaluating, testing, and monitoring any AI-powered system or data pipeline.

observabilityevaluationmonitoringml-ops

Agent 评估LLM 评测自动化测试