AgentList
HomeProjectsArticlesAbout
Explore Projects
HomeProjectsArticlesAbout
Explore Projects
Home / Projects / AgentBench

AgentBench

Normal
GitHub Python Apache-2.0

Description

A comprehensive benchmark to evaluate LLMs as agents (ICLR 2024), covering operating systems, databases, knowledge graphs, digital card games and more.

Tags

evaluation python agent framework

Categories

📊 Observability
Visit GitHub

Project Metrics

Stars 3.3k
Forks 0
Watchers 0
Issues 0
Created July 28, 2023
Last commit February 8, 2026

Deployment

Local

Related Projects

Agents Towards Production

18.8k · Jupyter Notebook
Active

End-to-end, code-first tutorials for building production-grade GenAI agents. From prototype to enterprise deployment.

agentframeworkevaluation +2

Argilla

4.9k · Python
Active

Argilla is a collaboration platform for AI engineers and domain experts to build high-quality datasets, collect human feedback, and evaluate models.

evaluationdata-processingllm +2

Hugging Face Evaluate

2.4k · Python
Active

A library by Hugging Face for easily evaluating machine learning models and datasets, providing a wide range of metrics and evaluation methods.

evaluationllmpython +2

12 Factor Agents

19.4k · TypeScript
Stale

What are the principles we can use to build LLM-powered software that is actually good enough to put in the hands of production customers?

agentframeworkevaluation +2
AgentList

Curated directory of open-source AI agent projects

Quick Links

  • Project List
  • Featured Articles
  • Browse Categories

Contact

  • About
  • Privacy Policy
  • Contact Us

© 2026 AgentList. All rights reserved.

Made with for the open source community