Kong
The cloud-native API and AI Gateway providing LLM request routing, rate limiting, load balancing and observability for AI agent applications.
Monitoring and debugging tools for agent apps
The cloud-native API and AI Gateway providing LLM request routing, rate limiting, load balancing and observability for AI agent applications.
An AI prompt optimizer that helps users write better prompts and achieve improved AI results.
Open-source LLM engineering platform providing tracing, evaluations, prompt management, and dataset management with integrations for LangChain, OpenAI, Anthropic, and more.
Open-source LLM observability: tracing, evals, prompt management.
Langfuse is an open-source observability platform for LLM applications, supporting tracing, evaluation, prompt versioning, and cost analytics.
MLflow is the open-source AI engineering platform for debugging, evaluating, monitoring, and optimizing AI agents and LLM applications, with model and data access management.
What are the principles we can use to build LLM-powered software that is actually good enough to put in the hands of production customers?
CLI tool that combines LLM prompt testing with red-teaming.
Test and evaluate LLM prompts, agents, and RAG pipelines. Built-in red teaming and security evaluation for reliable AI applications.
Promptfoo is an evaluation and regression testing tool for LLM apps and agents, useful for comparing prompts, tool-call results, and model outputs over time.
End-to-end, code-first tutorials for building production-grade GenAI agents. From prototype to enterprise deployment.
Opik is an open-source LLM observability platform providing agent tracing, evaluation testing, and prompt experiment management to help developers monitor and optimize AI agent systems.
OpenObserve is a high-performance observability platform for logs, metrics, and traces, well suited for monitoring AI agent runtimes and tool calls.
OpenAI's framework for evaluating LLMs and LLM systems, providing an open-source registry of benchmarks and tools for systematic model assessment.
Analyze coding (agent) CLI token usage and costs from local data.
DeepEval is an open-source evaluation framework for LLM applications. It provides rich evaluation metrics and tools, supporting unit testing and integration testing to help developers build reliable LLM applications.
RagaAI Catalyst is an observability, monitoring, and evaluation framework for Agent AI, supporting agent/LLM/tool tracing, multi-agent debugging, and self-hosted dashboard analytics.
Ragas is a framework for evaluating RAG (Retrieval Augmented Generation) systems. It provides various evaluation metrics including faithfulness, answer relevance, context precision, helping developers optimize RAG application performance.
OpenMetadata is a unified metadata platform for data and AI, providing data asset discovery, lineage, governance, and agent context retrieval capabilities.
A framework for few-shot evaluation of language models by EleutherAI, providing standardized evaluation pipelines supporting hundreds of benchmark tasks and widely adopted as a core LLM evaluation tool in the community.
eBPF-powered network observability for Kubernetes. Indexes L4/L7 traffic with full K8s context, queryable by AI agents via MCP and humans via dashboard.
TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and A/B testing, designed for production agents.
TensorZero is an open-source inference gateway and optimization platform for LLM apps and agent systems, focused on high-performance serving, experimentation, routing, and production observability.
Weights & Biases is an experiment tracking, visualization, and collaboration platform for ML and LLM applications, covering agent training evaluation, hyperparameter management, and model registry workflows.
(24 / 108)
Most teams evaluate agents by checking a few examples. Real evaluation needs layered metrics, non-rotting datasets, and judges that push back. This article provides runnable code patterns and a practical decision framework.
Why do LLM agents hallucinate? This article traces root causes and systematically reviews practical mitigation patterns: retrieval augmentation, confidence scoring, multi-agent cross-validation, forced citation backtracking, and observability with UpTrain, Giskard, RagaAI Catalyst, Comet Opik, and NVIDIA Garak.
Build a production-grade observability stack for multi-step agents using OpenTelemetry: OpenLLMetry semantic conventions, hierarchical span correlation, token cost attribution, retrieval quality metrics, and layered alerting.
A systematic guide to the three pillars of agent observability — distributed tracing, metrics monitoring, and automated evaluation — for building production-grade agent monitoring.
Based on OWASP LLM Top 10 engineering practice, this article systematically explains the seven layers of defense-in-depth for agent prompt injection: input sanitization, instruction isolation, least-privilege, output auditing, guardrails frameworks, continuous red-teaming, and kill switches -- with actionable code and toolchains.
Five-layer defense plus red-team loop, built on five open-source projects you can copy.