DeepEval

Active

GitHub Python Apache-2.0

Description

DeepEval is an open-source evaluation framework for LLM applications. It provides rich evaluation metrics and tools, supporting unit testing and integration testing to help developers build reliable LLM applications.

Key Features

Pytest-compatible LLM evaluation framework with ready-to-use metrics for agents, RAG, and chatbots
Agentic metrics including Task Completion, Tool Correctness, Step Efficiency, and Plan Adherence
RAG metrics covering Answer Relevancy, Faithfulness, Contextual Recall/Precision/Relevancy, and RAGAS
Multi-turn metrics for Knowledge Retention, Conversation Completeness, and Turn Relevancy
MCP metrics for evaluating Model Context Protocol agent task completion and tool usage
G-Eval and DAG metrics for custom criteria evaluation using LLM-as-a-judge with human-like accuracy

Use Cases

💡 Unit testing LLM applications before deployment to catch quality regressions

💡 Evaluating RAG pipeline accuracy with retrieval and answer quality metrics

💡 Benchmarking different models, prompts, and architectures for optimal LLM selection

💡 Regression testing chatbots and multi-turn conversational agents

💡 Continuous evaluation in CI/CD pipelines for LLM-powered production systems

Quick Start

Install via `pip install deepeval`, write test cases using metrics like `AnswerRelevancyMetric` and `FaithfulnessMetric`, run with `deepeval test run` just like pytest, and view results in the terminal or on the Confident AI platform.

Visit GitHub Visit Website

DeepEval

Description

Key Features

Use Cases

Tags

Categories

Quick Start

Related Projects

Ragas

TensorZero

PromptTools

TruLens

Related Articles

RAG System Evaluation in Practice: Building High-Quality RAG Apps with Ragas and DeepEval