Agenta
Agenta is an open-source LLMOps platform providing prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
Monitoring and debugging tools for agent apps
Agenta is an open-source LLMOps platform providing prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
BAML is an AI framework that adds engineering rigor to prompt engineering, offering type-safe prompt definitions, automatic testing, version management, and multi-model support across Python, TypeScript, Ruby, Java, C#, Rust, and Go.
A framework for few-shot evaluation of language models by EleutherAI, providing standardized evaluation pipelines supporting hundreds of benchmark tasks and widely adopted as a core LLM evaluation tool in the community.
An open-source evaluation and testing library for LLM agents providing automated model scanning, bias detection, performance benchmarking, and compliance checks.
A prompt management and debugging platform for LLMs, providing prompt logging, request tracking, replay capabilities, and debugging tools to help teams systematically manage LLM interactions and optimize prompts.
NVIDIA NeMo Guardrails is an open-source toolkit for adding programmable guardrails to LLM-based conversational systems, supporting topic control, safety enforcement, and dialog guidance.
OpenShell is the safe, private runtime for autonomous AI agents, developed by NVIDIA. Provides controlled execution environments and resource management.
NVIDIA's open-source LLM vulnerability scanner that automatically detects security issues in language models including safety vulnerabilities, hallucination tendencies, jailbreak risks, and prompt injection attacks.
End-to-end, code-first tutorials for building production-grade GenAI agents. From prototype to enterprise deployment.
An automatic prompt optimization framework by Salesforce AI Research that leverages LLMs to search for and refine prompts for improved model performance.
An open-source, modern-design AI training tracking and visualization tool. Supports PyTorch, Transformers and more. Monitor and evaluate AI agent training processes.
A comprehensive benchmark to evaluate LLMs as agents (ICLR 2024), covering operating systems, databases, knowledge graphs, digital card games and more.
Tencent's full-stack AI red teaming platform integrating OpenClaw security scanning, agent scanning, skills scanning, MCP scanning, AI infrastructure scanning, and LLM jailbreak evaluation.
A framework for large language model evaluations developed by the UK AI Safety Institute (AISI), providing comprehensive model capability assessment tools with support for safety and alignment testing.
Interactive sandboxes for AI agent evaluations and reinforcement learning on third-party APIs like Slack, LinkedIn, and more.
AgentLabs is a toolkit for agent development and testing, focused on experimentation, replay, and workflow support to improve iteration speed.
AgentOps is an observability platform for AI agents, providing monitoring, debugging, and evaluation to help developers optimize agent performance.
Automated harness engineering for AI agents. Auto-generates test harnesses to evaluate agent safety and reliability across different scenarios.
Open-source EDR for AI agents to monitor processes, files, network, and behavior of autonomous AI agents.
Argilla is a collaboration platform for AI engineers and domain experts to build high-quality datasets, collect human feedback, and evaluate models.
OpenTelemetry instrumentation for AI observability, providing standardized tracing, metrics collection, and span definitions for LLM inference processes to help developers monitor and debug AI agent systems.
Phoenix is an open-source observability and evaluation tool for LLM and agent applications, supporting online tracing and offline diagnosis.
An open-source evaluation tool for generative AI applications, helping teams build test suites, compare model outputs, and track quality changes over time.
Amazon's AI agent evaluation tool for automated quality assessment of Bedrock Agents and other LLM agents with multi-dimensional metrics and benchmarks.
Blaxel AI SDK is a production-focused toolkit for agent systems, emphasizing tool definitions, execution control, tracing, and service integrations for enterprise apps.
An observability platform for AI agents that tracks model calls, tool executions, task trajectories, and runtime costs.
An operational layer for coding agents with memory, validation, and feedback loops that compound across sessions.
Crucix is a personal intelligence agent that watches the world from multiple data sources and pings you when something changes, helping you stay on top of information in real time.
Opik is an open-source LLM observability platform providing agent tracing, evaluation testing, and prompt experiment management to help developers monitor and optimize AI agent systems.
DeepEval is an open-source evaluation framework for LLM applications. It provides rich evaluation metrics and tools, supporting unit testing and integration testing to help developers build reliable LLM applications.
Coval is an evaluation tool for voice and conversational agents, helping teams test response quality, interaction stability, and real dialog behavior.
Next-generation AI Agent optimization platform providing full-lifecycle management capabilities from development, debugging, evaluation to monitoring with prompt management, agent evaluation, and LLM observability.
CVS Health's open-source uncertainty quantification library for language models, providing UQ-based hallucination detection with confidence scoring and mitigation tools to identify and reduce unreliable LLM outputs.
A real-time observability toolkit for Claude Code agents that tracks hook events to monitor multi-agent coding workflows.
CLI that hooks into your Git workflow to capture AI agent sessions as you work — sessions are indexed alongside commits, creating a searchable record of how code was written in your repo.
Zero-code LLM security and observability proxy with real-time prompt injection detection, PII scanning, and security monitoring.
Evidently is an open-source ML and LLM observability framework with 100+ metrics for evaluating, testing, and monitoring any AI-powered system or data pipeline.
Ragas is a framework for evaluating RAG (Retrieval Augmented Generation) systems. It provides various evaluation metrics including faithfulness, answer relevance, context precision, helping developers optimize RAG application performance.
An observability platform for workflows, pipelines, and AI agents, providing metrics, logs, and traces for automation systems.
Official Grafana MCP server enabling AI agents to query dashboards, manage alerts, and analyze monitoring data for intelligent ops.
Guardrails AI adds programmable guardrails to large language models, ensuring reliability and safety through input/output validation, structured data extraction, and custom validators.
Framework for running agent evaluations and creating RL environments to measure and improve agent performance
Helicone is an open-source proxy and observability platform for LLM applications, offering request tracing, caching, and cost analytics.
A CNCF Sandbox SRE Agent that automatically analyzes infrastructure logs and metrics to assist with incident diagnosis and system operations.
An enterprise-ready Spring AI platform integrating RAG, tool calling, asynchronous ingestion, JWT/RBAC security, and observability.
A library by Hugging Face for easily evaluating machine learning models and datasets, providing a wide range of metrics and evaluation methods.
HuggingFace's all-in-one toolkit for evaluating LLMs across multiple backends, deeply integrated with the HuggingFace ecosystem and providing flexible evaluation metrics and benchmark configuration.
What are the principles we can use to build LLM-powered software that is actually good enough to put in the hands of production customers?
A production-focused Agentic RAG course teaching how to build scalable, reliable RAG agent systems with indexing strategies, retrieval optimization, and monitoring.
An evaluation framework for LLM applications providing test set management, metric computation, and output quality assessment for agent development teams.
An AI-native proxy and data plane for agentic apps with built-in orchestration, safety, observability, and smart LLM routing so developers can focus on agent core logic.
The cloud-native API and AI Gateway providing LLM request routing, rate limiting, load balancing and observability for AI agent applications.
eBPF-powered network observability for Kubernetes. Indexes L4/L7 traffic with full K8s context, queryable by AI agents via MCP and humans via dashboard.
LangSmith SDK is LangChain's observability toolkit for LLM apps and agents, covering tracing, evaluation, dataset management, and debugging for production workflows.
LangDB is a data and operations tool for LLM and agent applications, helping teams manage prompts, traces, and experiment versions as a lightweight operational layer.
Langfuse is an open-source observability platform for LLM applications, supporting tracing, evaluation, prompt versioning, and cost analytics.
LangEvals aggregates various language model evaluators into a single platform, providing a standardized LLM evaluation interface with safety checks.
A lightweight open-source observability component for LLM applications, providing tracing, evaluation, and debugging capabilities.
Platform for LLM evaluations and AI agent testing, providing comprehensive tracing, evaluation, and quality monitoring to help teams build reliable AI applications.
An AI prompt optimizer that helps users write better prompts and achieve improved AI results.
RouteLLM is a framework for serving and evaluating LLM routers, enabling cost reduction without compromising quality through intelligent request routing across multiple model tiers.
LMNR is an open-source observability platform for LLM and agent applications, focused on tracing, quality analysis, and production diagnostics.
An open-source LLM observability platform providing logging, tracing, feedback, evaluation, and prompt management for chatbots and agent applications.
An observability and gateway platform for LLM applications, providing request tracing, model routing, logging, and cost analysis for agent workflows.
An open-source tool for analyzing and optimizing LLM context, helping developers observe how prompts, memory fragments, and retrieved content affect output.
Meta's set of tools to assess and improve LLM security, including safety benchmarks, prompt injection detection, and output auditing to help evaluate and enhance the safety of large language models.
An open-source tool from Meta for LLM prompt optimization. Automates the process of continuously improving and refining LLM prompts.
A task-aware agent-driven prompt optimization framework from Microsoft Research that iteratively refines prompts for better LLM performance.
Microsoft's AI Agent Governance Toolkit providing policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering for autonomous AI agents. Covers 10/10 OWASP Agentic Top 10.
An autonomous web browser QA agent that evaluates performance, functionality, and user experience through GUI or CLI workflows.
MLflow is the open-source AI engineering platform for debugging, evaluating, monitoring, and optimizing AI agents and LLM applications, with model and data access management.
A production-ready AI agent framework with tool calling, persistent memory, intelligent concurrency, and event-driven observability.
Monte Carlo’s official toolkit for AI coding agents, bringing data observability, triage, troubleshooting, and health checks into Claude Code, Cursor, and similar tools.
An open-source LLM vulnerability scanner and AI red teaming kit for automated security fuzzing of LLM applications, detecting jailbreaks, prompt injection, and adversarial attacks.
An LLM playground you can run on your laptop. Compare models side-by-side for prompt testing and model evaluation in a local environment.
A toolkit for making AI agents and workflows measurably reliable, with epistemic measurement, Noetic RAG, sentinel gating, and grounded calibration.
OpenCompass is a comprehensive LLM evaluation platform supporting a wide range of models including Llama, Mistral, GPT-4, Qwen, GLM, and Claude across 100+ benchmark datasets.
OpenAI's framework for evaluating LLMs and LLM systems, providing an open-source registry of benchmarks and tools for systematic model assessment.
OpenLIT is an open-source AI engineering platform providing OpenTelemetry-native LLM observability, GPU monitoring, guardrails, evaluations, prompt management, and playground, integrating with 50+ LLM providers and agent frameworks.
OpenPipe Artifacts is a data and artifact management tool for agent and LLM applications, helping teams track prompts, outputs, experiments, and evaluation records.
An open-source, developer-first LLMOps platform for streamlined prompt design, version management, real-time observability, monitoring, and team collaboration across LLM applications.
Demystify AI agents by building them yourself. Local LLMs, no black boxes, real understanding of function calling, memory, and ReAct patterns.
Promptfoo is an evaluation and regression testing tool for LLM apps and agents, useful for comparing prompts, tool-call results, and model outputs over time.
The security toolkit for LLM interactions, providing prompt injection detection, PII anonymization, content safety auditing, and more to secure production LLM deployments.
An LLM prompt injection detector that combines heuristics, vector similarity, and language model-based detection to identify and block malicious prompt injection attacks.
AI observability platform for production LLM and agent systems by the Pydantic team. Provides real-time monitoring, tracing, and debugging capabilities.
AI Agent Evaluator and Red Team Platform. Provides systematic security evaluation and adversarial testing tools to discover and fix vulnerabilities in agent systems.
An open-source AI monitoring platform supporting model performance, data drift, and production quality metric observation for LLM and agent applications.
RagaAI Catalyst is an observability, monitoring, and evaluation framework for Agent AI, supporting agent/LLM/tool tracing, multi-agent debugging, and self-hosted dashboard analytics.
Open source AI Agent evaluation framework for web tasks to measure and compare AI agent performance on web operations.
Langtrace is an open-source, OpenTelemetry-based end-to-end observability tool for LLM applications, providing real-time tracing, evaluations, and metrics for popular LLMs, agent frameworks, and vector databases.
A security scanner for LLM agentic workflows. Automatically detects security vulnerabilities, prompt injection risks, and permission violations in agent pipelines before deployment.
HELM (Holistic Evaluation of Language Models) is Stanford CRFM's open-source framework for holistic, reproducible, and transparent evaluation of foundation models including LLMs and multimodal models.
Secure, local, cross-platform and programmable sandboxes for AI agents. Provides strict resource isolation using microVM technology.
TensorZero is an open-source inference gateway and optimization platform for LLM apps and agent systems, focused on high-performance serving, experimentation, routing, and production observability.
OpenLLMetry is an open-source observability tool for LLM applications based on OpenTelemetry, providing tracing, metrics, and monitoring capabilities.
A tracing and debugging platform for LLM and agent applications, recording prompts, model responses, tool calls, and chain latency for observability.
LLMTracer is a tracing tool for agent and LLM applications, helping developers capture call paths, tool execution, and state transitions for debugging and incident analysis.
TruLens is an open-source tool for evaluating and tracking LLM apps. It provides specialized evaluation for RAG applications including context relevance, groundedness, and answer relevance.
htop for AI Agents to monitor token usage, costs, and tool calls across Claude Code and Codex in real time.
An open-source AISI toolkit for sandboxing agentic evaluations, helping researchers isolate models, tools, and execution environments safely.
An evaluation and monitoring tool for LLM applications that checks response quality, context relevance, factuality, and user feedback for agent systems.
A toolkit by Weights & Biases for developing AI-powered applications, providing LLM call tracing, evaluation experiment management, and versioning from prototype to production.
An open-source toolkit for monitoring Large Language Models, extracting signals from prompts and responses for quality and safety evaluation.
Most teams evaluate agents by checking a few examples. Real evaluation needs layered metrics, non-rotting datasets, and judges that push back. This article provides runnable code patterns and a practical decision framework.
A systematic guide to the three pillars of agent observability — distributed tracing, metrics monitoring, and automated evaluation — for building production-grade agent monitoring.
Based on real production experience, this guide explains how to build a closed loop of tracing, evaluation, and cost analytics for AI agents with Langfuse.
Learn how to evaluate RAG systems using Ragas and DeepEval, including measuring key metrics like faithfulness, answer relevance, and context precision.