Deep Research Bench

Active

GitHub Python Apache-2.0

Description

Comprehensive benchmark for deep research agents, providing systematic evaluation framework for assessing deep research agent performance.

Related Projects

AWS Agent Evaluation

364 · Python

Stale

Amazon's AI agent evaluation tool for automated quality assessment of Bedrock Agents and other LLM agents with multi-dimensional metrics and benchmarks.

awsevaluationbenchmark +2

Giskard

5.4k · Python

Active

An open-source evaluation and testing library for LLM agents providing automated model scanning, bias detection, performance benchmarking, and compliance checks.

evaluationtestingllm-safety +3

AgentLabs

550 · TypeScript

Stale

AgentLabs is a toolkit for agent development and testing, focused on experimentation, replay, and workflow support to improve iteration speed.

testingdeveloper-toolsevaluation +1

DeepEval

15.9k · Python

Active

DeepEval is an open-source evaluation framework for LLM applications. It provides rich evaluation metrics and tools, supporting unit testing and integration testing to help developers build reliable LLM applications.

llmevaluationtesting +1

Deep Research Bench

Description

Tags

Categories

Related Projects

AWS Agent Evaluation

Giskard

AgentLabs

DeepEval