Observability

prompt-engineeringtype-safellm-testing +2

BAML is an AI framework that adds engineering rigor to prompt engineering, offering type-safe prompt definitions, automatic testing, version management, and multi-model support across Python, TypeScript, Ruby, Java, C#, Rust, and Go.

LM Evaluation Harness

12.5k · Python

llm-evaluationbenchmarkevaluation-framework +2

A framework for few-shot evaluation of language models by EleutherAI, providing standardized evaluation pipelines supporting hundreds of benchmark tasks and widely adopted as a core LLM evaluation tool in the community.

Giskard

5.3k · Python

evaluationtestingllm-safety +3

An open-source evaluation and testing library for LLM agents providing automated model scanning, bias detection, performance benchmarking, and compliance checks.

PromptLayer

762 · Python

prompt-managementllm-observabilityprompt-debugging +2

A prompt management and debugging platform for LLMs, providing prompt logging, request tracking, replay capabilities, and debugging tools to help teams systematically manage LLM interactions and optimize prompts.

NeMo Guardrails

6.1k · Python

guardrailsllm-safetynvidia +2

NVIDIA NeMo Guardrails is an open-source toolkit for adding programmable guardrails to LLM-based conversational systems, supporting topic control, safety enforcement, and dialog guidance.

OpenShell

5.8k · Rust

OpenShell is the safe, private runtime for autonomous AI agents, developed by NVIDIA. Provides controlled execution environments and resource management.

rustagentframework +2

Garak

7.8k · Python

llm-securityvulnerability-scannerllm-evaluation +2

NVIDIA's open-source LLM vulnerability scanner that automatically detects security issues in language models including safety vulnerabilities, hallucination tendencies, jailbreak risks, and prompt injection attacks.

Agents Towards Production

19.1k · Jupyter Notebook

agentframeworkevaluation +2

End-to-end, code-first tutorials for building production-grade GenAI agents. From prototype to enterprise deployment.

PrompToMatix

954 · Python

prompt-engineeringevaluationllm +1

An automatic prompt optimization framework by Salesforce AI Research that leverages LLMs to search for and refine prompts for improved model performance.

SwanLab

3.9k · Python

pythonobservabilityevaluation +2

An open-source, modern-design AI training tracking and visualization tool. Supports PyTorch, Transformers and more. Monitor and evaluate AI agent training processes.

AgentBench

3.4k · Python

A comprehensive benchmark to evaluate LLMs as agents (ICLR 2024), covering operating systems, databases, knowledge graphs, digital card games and more.

evaluationpythonagent +1

AI-Infra-Guard

3.7k · Python

ai-securityred-teamingllm-security +2

Tencent's full-stack AI red teaming platform integrating OpenClaw security scanning, agent scanning, skills scanning, MCP scanning, AI infrastructure scanning, and LLM jailbreak evaluation.

Inspect AI

2.0k · Python

llm-evaluationai-safetyevaluation-framework +2

A framework for large language model evaluations developed by the UK AI Safety Institute (AISI), providing comprehensive model capability assessment tools with support for safety and alignment testing.

AgentDiff

32 · Python

agent-evaluationsandboxreinforcement-learning +2

Interactive sandboxes for AI agent evaluations and reinforcement learning on third-party APIs like Slack, LinkedIn, and more.

AgentLabs

548 · TypeScript

testingdeveloper-toolsevaluation +1

AgentLabs is a toolkit for agent development and testing, focused on experimentation, replay, and workflow support to improve iteration speed.

AgentOps

5.5k · Python

observabilitymonitoringdebugging +1

AgentOps is an observability platform for AI agents, providing monitoring, debugging, and evaluation to help developers optimize agent performance.

AutoHarness

270 · Python

testingharness-engineeringagent-evaluation +2

Automated harness engineering for AI agents. Auto-generates test harnesses to evaluate agent safety and reliability across different scenarios.

Aegis EDR for AI

129 · JavaScript

agent-monitoringedrsecurity +2

Open-source EDR for AI agents to monitor processes, files, network, and behavior of autonomous AI agents.

Argilla

5.0k · Python

evaluationdata-processingllm +2

Argilla is a collaboration platform for AI engineers and domain experts to build high-quality datasets, collect human feedback, and evaluate models.

OpenInference

965 · Python

observabilitypythonllm +2

OpenTelemetry instrumentation for AI observability, providing standardized tracing, metrics collection, and span definitions for LLM inference processes to help developers monitor and debug AI agent systems.

Arize Phoenix

9.6k · Python

observabilityevaltracing +1

Phoenix is an open-source observability and evaluation tool for LLM and agent applications, supporting online tracing and offline diagnosis.

Arthur Bench

429 · TypeScript

llm-benchmarkregression-testingevaluation

An open-source evaluation tool for generative AI applications, helping teams build test suites, compare model outputs, and track quality changes over time.

AWS Agent Evaluation

360 · Python

awsevaluationbenchmark +2

Amazon's AI agent evaluation tool for automated quality assessment of Bedrock Agents and other LLM agents with multi-dimensional metrics and benchmarks.

Blaxel AI SDK

2.5k · TypeScript

sdkenterpriseobservability +1

Blaxel AI SDK is a production-focused toolkit for agent systems, emphasizing tool definitions, execution control, tracing, and service integrations for enterprise apps.

Observal

1.1k · Python

agent-monitoringtracinganalytics

An observability platform for AI agents that tracks model calls, tool executions, task trajectories, and runtime costs.

AgentOps for Coding Agents

343 · Go

coding-agentmemoryvalidation +2

An operational layer for coding agents with memory, validation, and feedback loops that compound across sessions.

Crucix

9.7k · JavaScript

agentautomationmonitoring +2

Crucix is a personal intelligence agent that watches the world from multiple data sources and pings you when something changes, helping you stay on top of information in real time.

Opik

19.3k · Python

observabilityllm-evaluationtracing +2

Opik is an open-source LLM observability platform providing agent tracing, evaluation testing, and prompt experiment management to help developers monitor and optimize AI agent systems.

DeepEval

15.3k · Python

DeepEval is an open-source evaluation framework for LLM applications. It provides rich evaluation metrics and tools, supporting unit testing and integration testing to help developers build reliable LLM applications.

llmevaluationtesting +1

Coval

2.7k · Python

evaluationvoiceconversation +1

Coval is an evaluation tool for voice and conversational agents, helping teams test response quality, interaction stability, and real dialog behavior.

Cozeloop

5.5k · Go

agent-observabilityagent-evaluationllmops +2

Next-generation AI Agent optimization platform providing full-lifecycle management capabilities from development, debugging, evaluation to monitoring with prompt management, agent evaluation, and LLM observability.

UQLM

1.1k · Python

hallucination-detectionuncertainty-quantificationllm-evaluation +2

CVS Health's open-source uncertainty quantification library for language models, providing UQ-based hallucination detection with confidence scoring and mitigation tools to identify and reduce unreliable LLM outputs.

Claude Code Hooks Multi-Agent Observability

1.4k · Python

observabilityclaude-codehooks +2

A real-time observability toolkit for Claude Code agents that tracks hook events to monitor multi-agent coding workflows.

Entire CLI

4.3k · Go

CLI that hooks into your Git workflow to capture AI agent sessions as you work — sessions are indexed alongside commits, creating a searchable record of how code was written in your repo.

gocodingagent +2

LLMTrace

48 · Rust

llm-securityobservabilityprompt-injection +2

Zero-code LLM security and observability proxy with real-time prompt injection detection, PII scanning, and security monitoring.

Evidently

7.5k · Jupyter Notebook

observabilityevaluationmonitoring +2

Evidently is an open-source ML and LLM observability framework with 100+ metrics for evaluating, testing, and monitoring any AI-powered system or data pipeline.

Ragas

13.9k · Python

Ragas is a framework for evaluating RAG (Retrieval Augmented Generation) systems. It provides various evaluation metrics including faithfulness, answer relevance, context precision, helping developers optimize RAG application performance.

ragevaluationllm +1

FlowMetr

41 · Ruby

observabilityworkflowmetrics +2

An observability platform for workflows, pipelines, and AI agents, providing metrics, logs, and traces for automation systems.

Grafana MCP

3.0k · Go

Official Grafana MCP server enabling AI agents to query dashboards, manage alerts, and analyze monitoring data for intelligent ops.

mcpgrafanamonitoring +2

Guardrails AI

6.8k · Python

guardrailsllm-safetyvalidation +2

Guardrails AI adds programmable guardrails to large language models, ensuring reliability and safety through input/output validation, structured data extraction, and custom validators.

Harbor

1.9k · Python

evaluationbenchmarkrl-environments +2

Framework for running agent evaluations and creating RL environments to measure and improve agent performance

Helicone

5.6k · TypeScript

observabilityproxyanalytics +1

Helicone is an open-source proxy and observability platform for LLM applications, offering request tracing, caching, and cost analytics.

HolmesGPT

2.4k · Python

observabilitypythonagent +2

A CNCF Sandbox SRE Agent that automatically analyzes infrastructure logs and metrics to assist with incident diagnosis and system operations.

KnowledgeOps Agent

203 · Java

ragtool-callingobservability +2

An enterprise-ready Spring AI platform integrating RAG, tool calling, asynchronous ingestion, JWT/RBAC security, and observability.

Hugging Face Evaluate

2.4k · Python

A library by Hugging Face for easily evaluating machine learning models and datasets, providing a wide range of metrics and evaluation methods.

evaluationllmpython +2

Lighteval

2.4k · Python

llm-evaluationevaluation-frameworkhuggingface +2

HuggingFace's all-in-one toolkit for evaluating LLMs across multiple backends, deeply integrated with the HuggingFace ecosystem and providing flexible evaluation metrics and benchmark configuration.

12 Factor Agents

19.8k · TypeScript

agentframeworkevaluation +2

What are the principles we can use to build LLM-powered software that is actually good enough to put in the hands of production customers?

Production Agentic RAG Course

5.9k · Python

A production-focused Agentic RAG course teaching how to build scalable, reliable RAG agent systems with indexing strategies, retrieval optimization, and monitoring.

ragproductioncourse +2

Judgeval

1.0k · Python

evaluationprompt-testingllm-quality

An evaluation framework for LLM applications providing test set management, metric computation, and output quality assessment for agent development teams.

Plano

6.5k · Rust

llm-gatewayllm-routingobservability +2

An AI-native proxy and data plane for agentic apps with built-in orchestration, safety, observability, and smart LLM routing so developers can focus on agent core logic.

Kong

43.4k · Lua

The cloud-native API and AI Gateway providing LLM request routing, rate limiting, load balancing and observability for AI agent applications.

observabilityapiagent +2

Kubeshark

11.9k · Go

observabilitydevopsmcp +2

eBPF-powered network observability for Kubernetes. Indexes L4/L7 traffic with full K8s context, queryable by AI agents via MCP and humans via dashboard.

LangSmith

881 · Python

observabilitytracingevaluation +1

LangSmith SDK is LangChain's observability toolkit for LLM apps and agents, covering tracing, evaluation, dataset management, and debugging for production workflows.

LangDB

2.7k · Python

observabilitypromptsoperations +1

LangDB is a data and operations tool for LLM and agent applications, helping teams manage prompts, traces, and experiment versions as a lightweight operational layer.

Langfuse

27.0k · TypeScript

observabilitytracingllm +1

Langfuse is an open-source observability platform for LLM applications, supporting tracing, evaluation, prompt versioning, and cost analytics.

LangEvals

72 · Unknown

llm-evaluationsafety-evaluationguardrails +1

LangEvals aggregates various language model evaluators into a single platform, providing a standardized LLM evaluation interface with safety checks.

LangWatch Lite

350 · TypeScript

llm-tracingevaluationobservability

A lightweight open-source observability component for LLM applications, providing tracing, evaluation, and debugging capabilities.

LangWatch

3.2k · TypeScript

observabilityevaluationllm-testing +2

Platform for LLM evaluations and AI agent testing, providing comprehensive tracing, evaluation, and quality monitoring to help teams build reliable AI applications.

Prompt Optimizer

28.6k · TypeScript

prompt-engineeringevaluationllm +2

An AI prompt optimizer that helps users write better prompts and achieve improved AI results.

RouteLLM

4.9k · Python

llm-routingcost-optimizationevaluation +1

RouteLLM is a framework for serving and evaluating LLM routers, enabling cost reduction without compromising quality through intelligent request routing across multiple model tiers.

LMNR

2.9k · TypeScript

observabilitytracingdiagnostics +1

LMNR is an open-source observability platform for LLM and agent applications, focused on tracing, quality analysis, and production diagnostics.

Lunary

2.4k · TypeScript

llm-observabilityprompt-managementevaluation

An open-source LLM observability platform providing logging, tracing, feedback, evaluation, and prompt management for chatbots and agent applications.

Bifrost

4.8k · Go

llm-observabilitygatewaytracing

An observability and gateway platform for LLM applications, providing request tracing, model routing, logging, and cost analysis for agent workflows.

Acontext

3.4k · JavaScript

context-debuggingprompt-analysisobservability

An open-source tool for analyzing and optimizing LLM context, helping developers observe how prompts, memory fragments, and retrieved content affect output.

Purple Llama

4.2k · Python

securityevaluationpython +2

Meta's set of tools to assess and improve LLM security, including safety benchmarks, prompt injection detection, and output auditing to help evaluate and enhance the safety of large language models.

Prompt Ops

809 · Python

prompt-engineeringllmtools +2

An open-source tool from Meta for LLM prompt optimization. Automates the process of continuously improving and refining LLM prompts.

PromptWizard

3.9k · Python

prompt-engineeringevaluationllm +2

A task-aware agent-driven prompt optimization framework from Microsoft Research that iteratively refines prompts for better LLM performance.

Agent Governance Toolkit

1.5k · Python

securityevaluationpython +2

Microsoft's AI Agent Governance Toolkit providing policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering for autonomous AI agents. Covers 10/10 OWASP Agentic Top 10.

WebQA Agent

210 · Python

browser-agentweb-testingqa +2

An autonomous web browser QA agent that evaluates performance, functionality, and user experience through GUI or CLI workflows.

MLflow

25.9k · Python

mlflowllmopsevaluation +2

MLflow is the open-source AI engineering platform for debugging, evaluating, monitoring, and optimizing AI agents and LLM applications, with model and data access management.

Mobfish Agent

164 · Python

agent-frameworktool-callingmemory +2

A production-ready AI agent framework with tool calling, persistent memory, intelligent concurrency, and event-driven observability.

Monte Carlo Agent Toolkit

81 · Python

data-observabilityagent-observabilityclaude-code +2

Monte Carlo’s official toolkit for AI coding agents, bringing data observability, triage, troubleshooting, and health checks into Claude Code, Cursor, and similar tools.

Agentic Security

1.9k · Python

llm-securityred-teamingllm-fuzzer +2

An open-source LLM vulnerability scanner and AI red teaming kit for automated security fuzzing of LLM applications, detecting jailbreaks, prompt injection, and adversarial attacks.

OpenPlayground

6.4k · TypeScript

An LLM playground you can run on your laptop. Compare models side-by-side for prompt testing and model evaluation in a local environment.

llmtoolstypescript +2

Empirica

222 · Python

evaluationreliabilityrag +2

A toolkit for making AI agents and workflows measurably reliable, with epistemic measurement, Noetic RAG, sentinel gating, and grounded calibration.

OpenCompass

7.0k · Python

llm-evaluationbenchmarkevaluation-platform +1

OpenCompass is a comprehensive LLM evaluation platform supporting a wide range of models including Llama, Mistral, GPT-4, Qwen, GLM, and Claude across 100+ benchmark datasets.

OpenAI Evals

18.4k · Python

llm-evaluationbenchmarkevals +2

OpenAI's framework for evaluating LLMs and LLM systems, providing an open-source registry of benchmarks and tools for systematic model assessment.

OpenLIT

2.4k · TypeScript

observabilityopentelemetryllm +2

OpenLIT is an open-source AI engineering platform providing OpenTelemetry-native LLM observability, GPU monitoring, guardrails, evaluations, prompt management, and playground, integrating with 50+ LLM providers and agent frameworks.

OpenPipe Artifacts

2.9k · TypeScript

observabilityevaluationartifacts +1

OpenPipe Artifacts is a data and artifact management tool for agent and LLM applications, helping teams track prompts, outputs, experiments, and evaluation records.

Pezzo

3.2k · TypeScript

llmopsprompt-managementobservability +2

An open-source, developer-first LLMOps platform for streamlined prompt design, version management, real-time observability, monitoring, and team collaboration across LLM applications.

AI Agents From Scratch

3.5k · JavaScript

javascriptagentevaluation +2

Demystify AI agents by building them yourself. Local LLMs, no black boxes, real understanding of function calling, memory, and ReAct patterns.

Promptfoo

21.2k · TypeScript

evaluationtestingprompts +1

Promptfoo is an evaluation and regression testing tool for LLM apps and agents, useful for comparing prompts, tool-call results, and model outputs over time.

LLM Guard

2.9k · Python

The security toolkit for LLM interactions, providing prompt injection detection, PII anonymization, content safety auditing, and more to secure production LLM deployments.

securityllmpython +2

Rebuff

1.5k · TypeScript

An LLM prompt injection detector that combines heuristics, vector similarity, and language model-based detection to identify and block malicious prompt injection attacks.

securityllmtesting +2

Logfire

4.2k · Python

pythonobservabilitytools +1

AI observability platform for production LLM and agent systems by the Pydantic team. Provides real-time monitoring, tracing, and debugging capabilities.

Rogue

1.0k · Python

securityevaluationobservability +2

AI Agent Evaluator and Red Team Platform. Provides systematic security evaluation and adversarial testing tools to discover and fix vulnerabilities in agent systems.

Radicalbit AI Monitoring

82 · Python

ai-monitoringmodel-qualityobservability

An open-source AI monitoring platform supporting model performance, data drift, and production quality metric observation for LLM and agent applications.

RagaAI Catalyst

16.2k · Python

observabilitytracingevaluation +2

RagaAI Catalyst is an observability, monitoring, and evaluation framework for Agent AI, supporting agent/LLM/tool tracing, multi-agent debugging, and self-hosted dashboard analytics.

Bananalyzer

328 · Python

agent-evaluationweb-tasksbenchmark +2

Open source AI Agent evaluation framework for web tasks to measure and compare AI agent performance on web operations.

Langtrace

1.2k · TypeScript

observabilityevaluationllm +2

Langtrace is an open-source, OpenTelemetry-based end-to-end observability tool for LLM applications, providing real-time tracing, evaluations, and metrics for popular LLMs, agent frameworks, and vector databases.

Agentic Radar

966 · Python

A security scanner for LLM agentic workflows. Automatically detects security vulnerabilities, prompt injection risks, and permission violations in agent pipelines before deployment.

securityagentpython +2

HELM

2.8k · Python

llm-evaluationbenchmarkstanford +2

HELM (Holistic Evaluation of Language Models) is Stanford CRFM's open-source framework for holistic, reproducible, and transparent evaluation of foundation models including LLMs and multimodal models.

Microsandbox

6.0k · Rust

Secure, local, cross-platform and programmable sandboxes for AI agents. Provides strict resource isolation using microVM technology.

rustagenttools +2

TensorZero

11.4k · Rust

gatewayinferenceevaluation +1

TensorZero is an open-source inference gateway and optimization platform for LLM apps and agent systems, focused on high-performance serving, experimentation, routing, and production observability.

OpenLLMetry

7.1k · Python

observabilityopentelemetryllm +2

OpenLLMetry is an open-source observability tool for LLM applications based on OpenTelemetry, providing tracing, metrics, and monitoring capabilities.

Traceroot

555 · TypeScript

llm-tracingdebuggingobservability

A tracing and debugging platform for LLM and agent applications, recording prompts, model responses, tool calls, and chain latency for observability.

LLMTracer

2.6k · Python

observabilitytracingdebugging +1

LLMTracer is a tracing tool for agent and LLM applications, helping developers capture call paths, tool execution, and state transitions for debugging and incident analysis.

TruLens

3.3k · Python

llmevaluationobservability +1

TruLens is an open-source tool for evaluating and tracking LLM apps. It provides specialized evaluation for RAG applications including context relevance, groundedness, and answer relevance.

AgMon

18 · Go

monitoringobservabilitycost-tracking +2

htop for AI Agents to monitor token usage, costs, and tool calls across Claude Code and Codex in real time.

AISI Sandboxing

21 · Unknown

sandboxevaluationagentic-eval +2

An open-source AISI toolkit for sandboxing agentic evaluations, helping researchers isolate models, tools, and execution environments safely.

UpTrain

2.3k · Python

llm-evaluationmonitoringtesting

An evaluation and monitoring tool for LLM applications that checks response quality, context relevance, factuality, and user feedback for agent systems.

Weave

1.1k · Python

observabilityevaluationllm +2

A toolkit by Weights & Biases for developing AI-powered applications, providing LLM call tracing, evaluation experiment management, and versioning from prototype to production.

LangKit

984 · Jupyter Notebook