📊

Observability

Monitoring and debugging tools for agent apps

108 projects

Kong

The cloud-native API and AI Gateway providing LLM request routing, rate limiting, load balancing and observability for AI agent applications.

observabilityapiagent +2

Prompt Optimizer

31.6k · TypeScript

Active

An AI prompt optimizer that helps users write better prompts and achieve improved AI results.

prompt-engineeringevaluationllm +2

Langfuse

30.2k · TypeScript

Active

Open-source LLM engineering platform providing tracing, evaluations, prompt management, and dataset management with integrations for LangChain, OpenAI, Anthropic, and more.

observabilitytracingllm-evaluation +2

Langfuse

30.2k · TypeScript

Active

Open-source LLM observability: tracing, evals, prompt management.

langfuseobservabilitytracing +1

Langfuse

30.2k · TypeScript

Active

Langfuse is an open-source observability platform for LLM applications, supporting tracing, evaluation, prompt versioning, and cost analytics.

observabilitytracingllm +1

MLflow

26.8k · Python

Active

MLflow is the open-source AI engineering platform for debugging, evaluating, monitoring, and optimizing AI agents and LLM applications, with model and data access management.

mlflowllmopsevaluation +2

12 Factor Agents

23.9k · TypeScript

Stale

What are the principles we can use to build LLM-powered software that is actually good enough to put in the hands of production customers?

agentframeworkevaluation +2

Promptfoo

22.8k · TypeScript

Active

CLI tool that combines LLM prompt testing with red-teaming.

promptfootestingred-team +1

Promptfoo

22.8k · TypeScript

Active

Test and evaluate LLM prompts, agents, and RAG pipelines. Built-in red teaming and security evaluation for reliable AI applications.

testingevaluationred-teaming +2

Promptfoo

22.8k · TypeScript

Active

Promptfoo is an evaluation and regression testing tool for LLM apps and agents, useful for comparing prompts, tool-call results, and model outputs over time.

evaluationtestingprompts +1

Agents Towards Production

20.9k · Jupyter Notebook

Active

End-to-end, code-first tutorials for building production-grade GenAI agents. From prototype to enterprise deployment.

agentframeworkevaluation +2

Opik

20.2k · Python

Active

Opik is an open-source LLM observability platform providing agent tracing, evaluation testing, and prompt experiment management to help developers monitor and optimize AI agent systems.

observabilityllm-evaluationtracing +2

openobserve

19.6k · TypeScript

Active

OpenObserve is a high-performance observability platform for logs, metrics, and traces, well suited for monitoring AI agent runtimes and tool calls.

observabilitylogsmetrics +2

OpenAI Evals

18.8k · Python

Normal

OpenAI's framework for evaluating LLMs and LLM systems, providing an open-source registry of benchmarks and tools for systematic model assessment.

llm-evaluationbenchmarkevals +2

ccusage

16.7k · Rust

Active

Analyze coding (agent) CLI token usage and costs from local data.

token-usagecost-analysiscli +2

DeepEval

16.6k · Python

Active

DeepEval is an open-source evaluation framework for LLM applications. It provides rich evaluation metrics and tools, supporting unit testing and integration testing to help developers build reliable LLM applications.

llmevaluationtesting +1

RagaAI Catalyst

16.1k · Python

Stale

RagaAI Catalyst is an observability, monitoring, and evaluation framework for Agent AI, supporting agent/LLM/tool tracing, multi-agent debugging, and self-hosted dashboard analytics.

observabilitytracingevaluation +2

Ragas

14.6k · Python

Stale

Ragas is a framework for evaluating RAG (Retrieval Augmented Generation) systems. It provides various evaluation metrics including faithfulness, answer relevance, context precision, helping developers optimize RAG application performance.

ragevaluationllm +1

OpenMetadata

14.4k · TypeScript

Active

OpenMetadata is a unified metadata platform for data and AI, providing data asset discovery, lineage, governance, and agent context retrieval capabilities.

observabilitymetadatadata-governance +2

LM Evaluation Harness

13.1k · Python

Active

A framework for few-shot evaluation of language models by EleutherAI, providing standardized evaluation pipelines supporting hundreds of benchmark tasks and widely adopted as a core LLM evaluation tool in the community.

llm-evaluationbenchmarkevaluation-framework +2

Kubeshark

12.0k · Go

Active

eBPF-powered network observability for Kubernetes. Indexes L4/L7 traffic with full K8s context, queryable by AI agents via MCP and humans via dashboard.

observabilitydevopsmcp +2

TensorZero

11.7k · Rust

Active

TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and A/B testing, designed for production agents.

observabilityllmllm-gateway +2

TensorZero

11.7k · Rust

Active

TensorZero is an open-source inference gateway and optimization platform for LLM apps and agent systems, focused on high-performance serving, experimentation, routing, and production observability.

gatewayinferenceevaluation +1

Weights & Biases

11.2k · Python

Active

Weights & Biases is an experiment tracking, visualization, and collaboration platform for ML and LLM applications, covering agent training evaluation, hyperparameter management, and model registry workflows.

observabilityexperiment-trackingmlops +2

(24 / 108)

Agent 评估LLM 评测自动化测试

Agent Evaluation and Testing: From Vibe Checks to End-to-End Pipelines

Most teams evaluate agents by checking a few examples. Real evaluation needs layered metrics, non-rotting datasets, and judges that push back. This article provides runnable code patterns and a practical decision framework.

RAGhallucination-detectionagent-evaluation

Agent Hallucination Defense: Practical Mitigation Patterns Beyond Guardrails

Why do LLM agents hallucinate? This article traces root causes and systematically reviews practical mitigation patterns: retrieval augmentation, confidence scoring, multi-agent cross-validation, forced citation backtracking, and observability with UpTrain, Giskard, RagaAI Catalyst, Comet Opik, and NVIDIA Garak.

可观测性OpenTelemetryLLMOps

Agent Observability in Practice: OpenTelemetry to Production Traces

Build a production-grade observability stack for multi-step agents using OpenTelemetry: OpenLLMetry semantic conventions, hierarchical span correlation, token cost attribution, retrieval quality metrics, and layered alerting.

AI Agent可观测性链路追踪

Building Agent Observability: From Distributed Tracing to Automated Evaluation

A systematic guide to the three pillars of agent observability — distributed tracing, metrics monitoring, and automated evaluation — for building production-grade agent monitoring.

安全Prompt InjectionOWASP

Agent Prompt Injection Defense: OWASP LLM01 in Practice

Based on OWASP LLM Top 10 engineering practice, this article systematically explains the seven layers of defense-in-depth for agent prompt injection: input sanitization, instruction isolation, least-privilege, output auditing, guardrails frameworks, continuous red-teaming, and kill switches -- with actionable code and toolchains.

security-guardrailsred-teamprompt-injection

AI Agent Guardrails and Red Teaming in Practice: From Rule Engines to Adversarial Evaluation

Five-layer defense plus red-team loop, built on five open-source projects you can copy.

Observability

108 projects

Kong

Prompt Optimizer

Langfuse

Langfuse

Langfuse

MLflow

12 Factor Agents

Promptfoo

Promptfoo

Promptfoo

Agents Towards Production

Opik

openobserve

OpenAI Evals

ccusage

DeepEval

RagaAI Catalyst

Ragas

OpenMetadata

LM Evaluation Harness

Kubeshark

TensorZero

TensorZero

Weights & Biases

Related Articles

Agent Evaluation and Testing: From Vibe Checks to End-to-End Pipelines

Agent Hallucination Defense: Practical Mitigation Patterns Beyond Guardrails

Agent Observability in Practice: OpenTelemetry to Production Traces

Building Agent Observability: From Distributed Tracing to Automated Evaluation

Agent Prompt Injection Defense: OWASP LLM01 in Practice

AI Agent Guardrails and Red Teaming in Practice: From Rule Engines to Adversarial Evaluation