LM Evaluation Harness

Active

Description

A framework for few-shot evaluation of language models by EleutherAI, providing standardized evaluation pipelines supporting hundreds of benchmark tasks and widely adopted as a core LLM evaluation tool in the community.

Related Projects

Lighteval

2.4k · Python

Active

HuggingFace's all-in-one toolkit for evaluating LLMs across multiple backends, deeply integrated with the HuggingFace ecosystem and providing flexible evaluation metrics and benchmark configuration.

llm-evaluationevaluation-frameworkhuggingface +2

Inspect AI

2.2k · Python

Active

A framework for large language model evaluations developed by the UK AI Safety Institute (AISI), providing comprehensive model capability assessment tools with support for safety and alignment testing.

llm-evaluationai-safetyevaluation-framework +2

Opik

19.4k · Python

Active

Opik is an open-source LLM observability platform providing agent tracing, evaluation testing, and prompt experiment management to help developers monitor and optimize AI agent systems.

observabilityllm-evaluationtracing +2

Harbor

2.3k · Python

Active

Framework for running agent evaluations and creating RL environments to measure and improve agent performance

evaluationbenchmarkrl-environments +2

LM Evaluation Harness

Description

Tags

Categories

Related Projects

Lighteval

Inspect AI

Opik

Harbor