AgentList
HomeProjectsArticlesAbout
Explore Projects
HomeProjectsArticlesAbout
Explore Projects
Projects AgentBench

AgentBench

Stale
GitHub Python Apache-2.0

Description

A comprehensive benchmark to evaluate LLMs as agents (ICLR 2024), covering operating systems, databases, knowledge graphs, digital card games and more.

Tags

evaluation python agent framework

Categories

📊 Observability
Visit GitHub

Project Metrics

Stars 3.5k
Forks 260
Watchers 3.5k
Issues 72
Created July 28, 2023
Last commit February 8, 2026

Deployment

Local

Related Projects

Agents Towards Production

20.6k · Jupyter Notebook
Active

End-to-end, code-first tutorials for building production-grade GenAI agents. From prototype to enterprise deployment.

agentframeworkevaluation +2

Argilla

5.0k · Python
Active

Argilla is a collaboration platform for AI engineers and domain experts to build high-quality datasets, collect human feedback, and evaluate models.

evaluationdata-processingllm +2

Hugging Face Evaluate

2.5k · Python
Active

A library by Hugging Face for easily evaluating machine learning models and datasets, providing a wide range of metrics and evaluation methods.

evaluationllmpython +2

12 Factor Agents

22.9k · TypeScript
Stale

What are the principles we can use to build LLM-powered software that is actually good enough to put in the hands of production customers?

agentframeworkevaluation +2
AgentList

The most comprehensive directory of open-source AI Agent projects. Discover and compare top Agent frameworks like LangChain, CrewAI, and more.

Quick Links

  • Project List
  • Featured Articles
  • Browse Categories

Contact

  • About
  • Privacy Policy
  • Contact Us

© 2026 AgentList. All rights reserved.

Made with for the open source community