Deep Research Bench
ActiveDescription
Comprehensive benchmark for deep research agents, providing systematic evaluation framework for assessing deep research agent performance.
Comprehensive benchmark for deep research agents, providing systematic evaluation framework for assessing deep research agent performance.
Amazon's AI agent evaluation tool for automated quality assessment of Bedrock Agents and other LLM agents with multi-dimensional metrics and benchmarks.
An open-source evaluation and testing library for LLM agents providing automated model scanning, bias detection, performance benchmarking, and compliance checks.
AgentLabs is a toolkit for agent development and testing, focused on experimentation, replay, and workflow support to improve iteration speed.
DeepEval is an open-source evaluation framework for LLM applications. It provides rich evaluation metrics and tools, supporting unit testing and integration testing to help developers build reliable LLM applications.