DeepEval

Active
GitHub Python Apache-2.0

Description

DeepEval is an open-source evaluation framework for LLM applications. It provides rich evaluation metrics and tools, supporting unit testing and integration testing to help developers build reliable LLM applications.

Key Features

  • Pytest-compatible LLM evaluation framework with ready-to-use metrics for agents, RAG, and chatbots
  • Agentic metrics including Task Completion, Tool Correctness, Step Efficiency, and Plan Adherence
  • RAG metrics covering Answer Relevancy, Faithfulness, Contextual Recall/Precision/Relevancy, and RAGAS
  • Multi-turn metrics for Knowledge Retention, Conversation Completeness, and Turn Relevancy
  • MCP metrics for evaluating Model Context Protocol agent task completion and tool usage
  • G-Eval and DAG metrics for custom criteria evaluation using LLM-as-a-judge with human-like accuracy

Use Cases

💡 Unit testing LLM applications before deployment to catch quality regressions
💡 Evaluating RAG pipeline accuracy with retrieval and answer quality metrics
💡 Benchmarking different models, prompts, and architectures for optimal LLM selection
💡 Regression testing chatbots and multi-turn conversational agents
💡 Continuous evaluation in CI/CD pipelines for LLM-powered production systems

Quick Start

Install via `pip install deepeval`, write test cases using metrics like `AnswerRelevancyMetric` and `FaithfulnessMetric`, run with `deepeval test run` just like pytest, and view results in the terminal or on the Confident AI platform.

Related Projects

Related Articles