Lighteval
ActiveDescription
HuggingFace's all-in-one toolkit for evaluating LLMs across multiple backends, deeply integrated with the HuggingFace ecosystem and providing flexible evaluation metrics and benchmark configuration.
HuggingFace's all-in-one toolkit for evaluating LLMs across multiple backends, deeply integrated with the HuggingFace ecosystem and providing flexible evaluation metrics and benchmark configuration.
A framework for few-shot evaluation of language models by EleutherAI, providing standardized evaluation pipelines supporting hundreds of benchmark tasks and widely adopted as a core LLM evaluation tool in the community.
A framework for large language model evaluations developed by the UK AI Safety Institute (AISI), providing comprehensive model capability assessment tools with support for safety and alignment testing.
Opik is an open-source LLM observability platform providing agent tracing, evaluation testing, and prompt experiment management to help developers monitor and optimize AI agent systems.
Framework for running agent evaluations and creating RL environments to measure and improve agent performance