AgentBench
NormalDescription
A comprehensive benchmark to evaluate LLMs as agents (ICLR 2024), covering operating systems, databases, knowledge graphs, digital card games and more.
A comprehensive benchmark to evaluate LLMs as agents (ICLR 2024), covering operating systems, databases, knowledge graphs, digital card games and more.
End-to-end, code-first tutorials for building production-grade GenAI agents. From prototype to enterprise deployment.
Argilla is a collaboration platform for AI engineers and domain experts to build high-quality datasets, collect human feedback, and evaluate models.
A library by Hugging Face for easily evaluating machine learning models and datasets, providing a wide range of metrics and evaluation methods.
What are the principles we can use to build LLM-powered software that is actually good enough to put in the hands of production customers?