AgentBench

Stale

GitHub Python Apache-2.0

Description

A comprehensive benchmark to evaluate LLMs as agents (ICLR 2024), covering operating systems, databases, knowledge graphs, digital card games and more.

Related Projects

Agents Towards Production

20.6k · Jupyter Notebook

Active

End-to-end, code-first tutorials for building production-grade GenAI agents. From prototype to enterprise deployment.

agentframeworkevaluation +2

Argilla

5.0k · Python

Active

Argilla is a collaboration platform for AI engineers and domain experts to build high-quality datasets, collect human feedback, and evaluate models.

evaluationdata-processingllm +2

Hugging Face Evaluate

2.5k · Python

Active

A library by Hugging Face for easily evaluating machine learning models and datasets, providing a wide range of metrics and evaluation methods.

evaluationllmpython +2

12 Factor Agents

22.9k · TypeScript

Stale

What are the principles we can use to build LLM-powered software that is actually good enough to put in the hands of production customers?

agentframeworkevaluation +2

AgentBench

Description

Tags

Categories

Related Projects

Agents Towards Production

Argilla

Hugging Face Evaluate

12 Factor Agents