Inspect AI
ActiveDescription
A framework for large language model evaluations developed by the UK AI Safety Institute (AISI), providing comprehensive model capability assessment tools with support for safety and alignment testing.
A framework for large language model evaluations developed by the UK AI Safety Institute (AISI), providing comprehensive model capability assessment tools with support for safety and alignment testing.
A framework for few-shot evaluation of language models by EleutherAI, providing standardized evaluation pipelines supporting hundreds of benchmark tasks and widely adopted as a core LLM evaluation tool in the community.
HuggingFace's all-in-one toolkit for evaluating LLMs across multiple backends, deeply integrated with the HuggingFace ecosystem and providing flexible evaluation metrics and benchmark configuration.
Opik is an open-source LLM observability platform providing agent tracing, evaluation testing, and prompt experiment management to help developers monitor and optimize AI agent systems.
CVS Health's open-source uncertainty quantification library for language models, providing UQ-based hallucination detection with confidence scoring and mitigation tools to identify and reduce unreliable LLM outputs.