HELM

Active

GitHub Python Apache-2.0

Description

HELM (Holistic Evaluation of Language Models) is Stanford CRFM's open-source framework for holistic, reproducible, and transparent evaluation of foundation models including LLMs and multimodal models.

Related Projects

OpenCompass

7.1k · Python

Active

OpenCompass is a comprehensive LLM evaluation platform supporting a wide range of models including Llama, Mistral, GPT-4, Qwen, GLM, and Claude across 100+ benchmark datasets.

llm-evaluationbenchmarkevaluation-platform +1

OpenAI Evals

18.6k · Python

Normal

OpenAI's framework for evaluating LLMs and LLM systems, providing an open-source registry of benchmarks and tools for systematic model assessment.

llm-evaluationbenchmarkevals +2

UQLM

1.2k · Python

Active

CVS Health's open-source uncertainty quantification library for language models, providing UQ-based hallucination detection with confidence scoring and mitigation tools to identify and reduce unreliable LLM outputs.

hallucination-detectionuncertainty-quantificationllm-evaluation +2

Guardrails AI

7.0k · Python

Active

Guardrails AI adds programmable guardrails to large language models, ensuring reliability and safety through input/output validation, structured data extraction, and custom validators.

guardrailsllm-safetyvalidation +2

HELM

Description

Tags

Categories

Related Projects

OpenCompass

OpenAI Evals

UQLM

Guardrails AI