RAG System Evaluation in Practice: Building High-Quality RAG Apps with Ragas and DeepEval

Building high-quality RAG applications requires systematic evaluation methods. This article introduces how to use Ragas and DeepEval for RAG system evaluation.

Why Evaluate RAG?

RAG system quality depends on multiple factors:

Relevance of retrieved documents
Accuracy of generated answers
Faithfulness to context
Completeness and usefulness of responses

Key Evaluation Metrics

1. Context Precision

Measures the relevance of retrieved context to the question.

2. Faithfulness

Measures the consistency of generated answers with retrieved context.

3. Answer Relevance

Measures how relevant the answer is to the question.

4. Context Recall

Measures completeness of retrieved relevant information.

Evaluating with Ragas

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

# Prepare evaluation data
dataset = {
    "question": ["Question 1", "Question 2"],
    "answer": ["Answer 1", "Answer 2"],
    "contexts": [["Context 1"], ["Context 2"]],
    "ground_truth": ["Ground Truth 1", "Ground Truth 2"]
}

# Run evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy]
)

Evaluating with DeepEval

from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric

metric = FaithfulnessMetric()
test_case = LLMTestCase(
    input="Question",
    actual_output="Actual Answer",
    retrieval_context=["Context"]
)

evaluate([test_case], [metric])

Best Practices for Evaluation Process

Establish Baseline: Use standard datasets to establish evaluation baseline
Continuous Monitoring: Run evaluations regularly, track performance changes
Iterative Optimization: Adjust system parameters based on evaluation results
A/B Testing: Compare different configurations

Common Issues and Optimization

Issue: Low Faithfulness

Optimize prompt design
Reduce hallucination strategies
Add context constraints

Issue: Low Relevance

Improve retrieval strategy
Adjust embedding model
Optimize query rewriting

Summary

Systematic evaluation is key to building high-quality RAG applications. With Ragas and DeepEval, we can quantify evaluation results and continuously optimize system performance.

Metric Selection: Don't Run Everything

Ragas and DeepEval offer 20+ metrics, but running them all is slow and expensive. Priority guidance:

Must run first (explains 80% of quality issues):

Faithfulness: whether answers stay grounded in context (hallucination detection)
Answer Relevance: whether answers address the question
Context Precision: whether retrieved top-K documents are relevant

Add in second optimization round:

Context Recall: whether key documents are missed
Answer Correctness: right vs wrong relative to ground truth
Answer Similarity: semantic similarity to reference answer

Add for specific issue diagnosis:

Hallucination metric: isolate hallucinations
Toxicity metric: content safety
Bias metric: fairness

Building the Eval Set: Start from Online Logs

Most useful evaluation set sources:

Collect last 30 days of online queries and answers
Human-label 200-500 as baseline
Stratify by business type (FAQ, technical Q&A, policy questions, etc.)
Add 20-30 new cases monthly so the set doesn't go stale

Don't pursue a "perfect 5000-case evaluation set" — business changes every 3 months and labels age quickly. Continuously updated small sets beat one-shot large sets.

Real Differences Between Ragas and DeepEval

Positional differences:

Dimension	Ragas	DeepEval
Metric system	Rich RAG-specific metrics	General LLM + RAG metrics
Custom metrics	Supported	Strong (Pytest-style)
CI/CD integration	Average	Strong
Chinese support	Average	Better
Performance	Medium	Fast
Learning curve	Medium	Low

Recommendations:

Heavy RAG-metric exploration → Ragas
Engineering focus (CI/CD, unit-test style) → DeepEval
Teams can coexist per scenario, but don't measure the same metric twice

Combining Offline Eval and Online Monitoring

Eval can't be offline-only; must connect to online traffic:

Offline eval: nightly run on historical query samples, produce trend reports
Online A/B testing: when shipping a new version, compare 5% of traffic, watch metric changes
Online real-time monitoring: monitor rolling averages of Faithfulness and Answer Relevance; alert on anomaly

The problem with pure offline eval: "evaluation set drifts from online distribution" — you optimize for past problems while the live traffic has new ones.

Controlling Evaluation Cost

Running a full eval can cost more than expected:

Ragas one Faithfulness call needs 2-3 LLM calls
500-case eval set × 3 metrics = 4500 LLM calls
Using GPT-4 to evaluate: $20-50 per run
Using local models: slow but cheap

Cost-reduction strategies:

Use small models for most metrics (qwen2.5-7b, llama3.1-8b)
Stratified sampling on the eval set (no need to run all 500 every time)
Incremental eval (only run new cases)
Metric caching (don't re-evaluate the same query)

"Traps" in Evaluation Metrics

Several common misreadings:

High Faithfulness ≠ correct answer: might be faithful to the wrong context
High Answer Relevance ≠ complete answer: might be on-topic but shallow
High Context Precision ≠ good retrieval: could be because the candidate set is large
Metric improvement ≠ user-experience improvement: metrics are proxies; user feedback is the ultimate judge

Look at multiple metrics together; focus on "correlation between metrics and human ratings"; don't optimize for a single metric.

Combining with Human Evaluation

No auto-evaluation fully replaces humans; judgment on critical nodes must be human:

Sample 50 cases monthly for human rating
Compute correlation between auto and human ratings
If correlation < 0.6, revisit the metric definition
Cases from human eval enter the prompt-improvement backlog

Tools are means, not ends. Ragas/DeepEval help you quantify; humans help define "good".

Selection Decision Table

Your scenario	Recommended tool	Reason
Pure RAG evaluation	Ragas	High metric expertise
Overall LLM application evaluation	DeepEval	Broader coverage
Team Python testing habit	DeepEval	Pytest-style
Multilingual / Chinese scenarios	DeepEval	More stable for Chinese
Need to customize metrics quickly	DeepEval	More flexible API

Don't get stuck on "which tool is better". Get one running first, establish the measurement practice, then optimize the metric system.