OpenAI Evals
NormalDescription
OpenAI's framework for evaluating LLMs and LLM systems, providing an open-source registry of benchmarks and tools for systematic model assessment.
OpenAI's framework for evaluating LLMs and LLM systems, providing an open-source registry of benchmarks and tools for systematic model assessment.
CVS Health's open-source uncertainty quantification library for language models, providing UQ-based hallucination detection with confidence scoring and mitigation tools to identify and reduce unreliable LLM outputs.
Guardrails AI adds programmable guardrails to large language models, ensuring reliability and safety through input/output validation, structured data extraction, and custom validators.
NVIDIA's open-source LLM vulnerability scanner that automatically detects security issues in language models including safety vulnerabilities, hallucination tendencies, jailbreak risks, and prompt injection attacks.
OpenCompass is a comprehensive LLM evaluation platform supporting a wide range of models including Llama, Mistral, GPT-4, Qwen, GLM, and Claude across 100+ benchmark datasets.