RAG System Evaluation in Practice: Building High-Quality RAG Apps with Ragas and DeepEval

Learn how to evaluate RAG systems using Ragas and DeepEval, including measuring key metrics like faithfulness, answer relevance, and context precision.

AgentList Team · 2025年2月25日
RAG评估RagasDeepEvalLLM应用

Building high-quality RAG applications requires systematic evaluation methods. This article introduces how to use Ragas and DeepEval for RAG system evaluation.

Why Evaluate RAG?

RAG system quality depends on multiple factors:

  • Relevance of retrieved documents
  • Accuracy of generated answers
  • Faithfulness to context
  • Completeness and usefulness of responses

Key Evaluation Metrics

1. Context Precision

Measures the relevance of retrieved context to the question.

2. Faithfulness

Measures the consistency of generated answers with retrieved context.

3. Answer Relevance

Measures how relevant the answer is to the question.

4. Context Recall

Measures completeness of retrieved relevant information.

Evaluating with Ragas

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

# Prepare evaluation data
dataset = {
    "question": ["Question 1", "Question 2"],
    "answer": ["Answer 1", "Answer 2"],
    "contexts": [["Context 1"], ["Context 2"]],
    "ground_truth": ["Ground Truth 1", "Ground Truth 2"]
}

# Run evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy]
)

Evaluating with DeepEval

from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric

metric = FaithfulnessMetric()
test_case = LLMTestCase(
    input="Question",
    actual_output="Actual Answer",
    retrieval_context=["Context"]
)

evaluate([test_case], [metric])

Best Practices for Evaluation Process

  1. Establish Baseline: Use standard datasets to establish evaluation baseline
  2. Continuous Monitoring: Run evaluations regularly, track performance changes
  3. Iterative Optimization: Adjust system parameters based on evaluation results
  4. A/B Testing: Compare different configurations

Common Issues and Optimization

Issue: Low Faithfulness

  • Optimize prompt design
  • Reduce hallucination strategies
  • Add context constraints

Issue: Low Relevance

  • Improve retrieval strategy
  • Adjust embedding model
  • Optimize query rewriting

Summary

Systematic evaluation is key to building high-quality RAG applications. With Ragas and DeepEval, we can quantify evaluation results and continuously optimize system performance.

Metric Selection: Don't Run Everything

Ragas and DeepEval offer 20+ metrics, but running them all is slow and expensive. Priority guidance:

Must run first (explains 80% of quality issues):

  • Faithfulness: whether answers stay grounded in context (hallucination detection)
  • Answer Relevance: whether answers address the question
  • Context Precision: whether retrieved top-K documents are relevant

Add in second optimization round:

  • Context Recall: whether key documents are missed
  • Answer Correctness: right vs wrong relative to ground truth
  • Answer Similarity: semantic similarity to reference answer

Add for specific issue diagnosis:

  • Hallucination metric: isolate hallucinations
  • Toxicity metric: content safety
  • Bias metric: fairness

Building the Eval Set: Start from Online Logs

Most useful evaluation set sources:

  1. Collect last 30 days of online queries and answers
  2. Human-label 200-500 as baseline
  3. Stratify by business type (FAQ, technical Q&A, policy questions, etc.)
  4. Add 20-30 new cases monthly so the set doesn't go stale

Don't pursue a "perfect 5000-case evaluation set" — business changes every 3 months and labels age quickly. Continuously updated small sets beat one-shot large sets.

Real Differences Between Ragas and DeepEval

Positional differences:

Dimension Ragas DeepEval
Metric system Rich RAG-specific metrics General LLM + RAG metrics
Custom metrics Supported Strong (Pytest-style)
CI/CD integration Average Strong
Chinese support Average Better
Performance Medium Fast
Learning curve Medium Low

Recommendations:

  • Heavy RAG-metric exploration → Ragas
  • Engineering focus (CI/CD, unit-test style) → DeepEval
  • Teams can coexist per scenario, but don't measure the same metric twice

Combining Offline Eval and Online Monitoring

Eval can't be offline-only; must connect to online traffic:

  • Offline eval: nightly run on historical query samples, produce trend reports
  • Online A/B testing: when shipping a new version, compare 5% of traffic, watch metric changes
  • Online real-time monitoring: monitor rolling averages of Faithfulness and Answer Relevance; alert on anomaly

The problem with pure offline eval: "evaluation set drifts from online distribution" — you optimize for past problems while the live traffic has new ones.

Controlling Evaluation Cost

Running a full eval can cost more than expected:

  • Ragas one Faithfulness call needs 2-3 LLM calls
  • 500-case eval set × 3 metrics = 4500 LLM calls
  • Using GPT-4 to evaluate: $20-50 per run
  • Using local models: slow but cheap

Cost-reduction strategies:

  1. Use small models for most metrics (qwen2.5-7b, llama3.1-8b)
  2. Stratified sampling on the eval set (no need to run all 500 every time)
  3. Incremental eval (only run new cases)
  4. Metric caching (don't re-evaluate the same query)

"Traps" in Evaluation Metrics

Several common misreadings:

  • High Faithfulness ≠ correct answer: might be faithful to the wrong context
  • High Answer Relevance ≠ complete answer: might be on-topic but shallow
  • High Context Precision ≠ good retrieval: could be because the candidate set is large
  • Metric improvement ≠ user-experience improvement: metrics are proxies; user feedback is the ultimate judge

Look at multiple metrics together; focus on "correlation between metrics and human ratings"; don't optimize for a single metric.

Combining with Human Evaluation

No auto-evaluation fully replaces humans; judgment on critical nodes must be human:

  • Sample 50 cases monthly for human rating
  • Compute correlation between auto and human ratings
  • If correlation < 0.6, revisit the metric definition
  • Cases from human eval enter the prompt-improvement backlog

Tools are means, not ends. Ragas/DeepEval help you quantify; humans help define "good".

Selection Decision Table

Your scenario Recommended tool Reason
Pure RAG evaluation Ragas High metric expertise
Overall LLM application evaluation DeepEval Broader coverage
Team Python testing habit DeepEval Pytest-style
Multilingual / Chinese scenarios DeepEval More stable for Chinese
Need to customize metrics quickly DeepEval More flexible API

Don't get stuck on "which tool is better". Get one running first, establish the measurement practice, then optimize the metric system.