AgentLab
NormalDescription
An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
The first LLM-based web agent and benchmark for generalist web agents, providing datasets, evaluation frameworks and baseline methods for building agents that operate on real websites.
WebArena is a realistic benchmark environment for evaluating autonomous web agents. It provides Gym-like interactive website simulations covering e-commerce, forums, CMS, and more, enabling end-to-end task evaluation as a standard framework for web agent research.
A research project exploring how models understand web interfaces, decompose action steps, and complete complex online tasks through browser agent capabilities.
Amazon's AI agent evaluation tool for automated quality assessment of Bedrock Agents and other LLM agents with multi-dimensional metrics and benchmarks.