HELM
ActiveDescription
HELM (Holistic Evaluation of Language Models) is Stanford CRFM's open-source framework for holistic, reproducible, and transparent evaluation of foundation models including LLMs and multimodal models.
HELM (Holistic Evaluation of Language Models) is Stanford CRFM's open-source framework for holistic, reproducible, and transparent evaluation of foundation models including LLMs and multimodal models.
OpenCompass is a comprehensive LLM evaluation platform supporting a wide range of models including Llama, Mistral, GPT-4, Qwen, GLM, and Claude across 100+ benchmark datasets.
OpenAI's framework for evaluating LLMs and LLM systems, providing an open-source registry of benchmarks and tools for systematic model assessment.
CVS Health's open-source uncertainty quantification library for language models, providing UQ-based hallucination detection with confidence scoring and mitigation tools to identify and reduce unreliable LLM outputs.
Guardrails AI adds programmable guardrails to large language models, ensuring reliability and safety through input/output validation, structured data extraction, and custom validators.