RAG System Evaluation in Practice: Building High-Quality RAG Apps with Ragas and DeepEval
Learn how to evaluate RAG systems using Ragas and DeepEval, including measuring key metrics like faithfulness, answer relevance, and context precision.
Building high-quality RAG applications requires systematic evaluation methods. This article introduces how to use Ragas and DeepEval for RAG system evaluation.
Why Evaluate RAG?
RAG system quality depends on multiple factors:
- Relevance of retrieved documents
- Accuracy of generated answers
- Faithfulness to context
- Completeness and usefulness of responses
Key Evaluation Metrics
1. Context Precision
Measures the relevance of retrieved context to the question.
2. Faithfulness
Measures the consistency of generated answers with retrieved context.
3. Answer Relevance
Measures how relevant the answer is to the question.
4. Context Recall
Measures completeness of retrieved relevant information.
Evaluating with Ragas
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
# Prepare evaluation data
dataset = {
"question": ["Question 1", "Question 2"],
"answer": ["Answer 1", "Answer 2"],
"contexts": [["Context 1"], ["Context 2"]],
"ground_truth": ["Ground Truth 1", "Ground Truth 2"]
}
# Run evaluation
results = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy]
)
Evaluating with DeepEval
from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric
metric = FaithfulnessMetric()
test_case = LLMTestCase(
input="Question",
actual_output="Actual Answer",
retrieval_context=["Context"]
)
evaluate([test_case], [metric])
Best Practices for Evaluation Process
- Establish Baseline: Use standard datasets to establish evaluation baseline
- Continuous Monitoring: Run evaluations regularly, track performance changes
- Iterative Optimization: Adjust system parameters based on evaluation results
- A/B Testing: Compare different configurations
Common Issues and Optimization
Issue: Low Faithfulness
- Optimize prompt design
- Reduce hallucination strategies
- Add context constraints
Issue: Low Relevance
- Improve retrieval strategy
- Adjust embedding model
- Optimize query rewriting
Summary
Systematic evaluation is key to building high-quality RAG applications. With Ragas and DeepEval, we can quantify evaluation results and continuously optimize system performance.
Metric Selection: Don't Run Everything
Ragas and DeepEval offer 20+ metrics, but running them all is slow and expensive. Priority guidance:
Must run first (explains 80% of quality issues):
- Faithfulness: whether answers stay grounded in context (hallucination detection)
- Answer Relevance: whether answers address the question
- Context Precision: whether retrieved top-K documents are relevant
Add in second optimization round:
- Context Recall: whether key documents are missed
- Answer Correctness: right vs wrong relative to ground truth
- Answer Similarity: semantic similarity to reference answer
Add for specific issue diagnosis:
- Hallucination metric: isolate hallucinations
- Toxicity metric: content safety
- Bias metric: fairness
Building the Eval Set: Start from Online Logs
Most useful evaluation set sources:
- Collect last 30 days of online queries and answers
- Human-label 200-500 as baseline
- Stratify by business type (FAQ, technical Q&A, policy questions, etc.)
- Add 20-30 new cases monthly so the set doesn't go stale
Don't pursue a "perfect 5000-case evaluation set" — business changes every 3 months and labels age quickly. Continuously updated small sets beat one-shot large sets.
Real Differences Between Ragas and DeepEval
Positional differences:
| Dimension | Ragas | DeepEval |
|---|---|---|
| Metric system | Rich RAG-specific metrics | General LLM + RAG metrics |
| Custom metrics | Supported | Strong (Pytest-style) |
| CI/CD integration | Average | Strong |
| Chinese support | Average | Better |
| Performance | Medium | Fast |
| Learning curve | Medium | Low |
Recommendations:
- Heavy RAG-metric exploration → Ragas
- Engineering focus (CI/CD, unit-test style) → DeepEval
- Teams can coexist per scenario, but don't measure the same metric twice
Combining Offline Eval and Online Monitoring
Eval can't be offline-only; must connect to online traffic:
- Offline eval: nightly run on historical query samples, produce trend reports
- Online A/B testing: when shipping a new version, compare 5% of traffic, watch metric changes
- Online real-time monitoring: monitor rolling averages of Faithfulness and Answer Relevance; alert on anomaly
The problem with pure offline eval: "evaluation set drifts from online distribution" — you optimize for past problems while the live traffic has new ones.
Controlling Evaluation Cost
Running a full eval can cost more than expected:
- Ragas one Faithfulness call needs 2-3 LLM calls
- 500-case eval set × 3 metrics = 4500 LLM calls
- Using GPT-4 to evaluate: $20-50 per run
- Using local models: slow but cheap
Cost-reduction strategies:
- Use small models for most metrics (qwen2.5-7b, llama3.1-8b)
- Stratified sampling on the eval set (no need to run all 500 every time)
- Incremental eval (only run new cases)
- Metric caching (don't re-evaluate the same query)
"Traps" in Evaluation Metrics
Several common misreadings:
- High Faithfulness ≠ correct answer: might be faithful to the wrong context
- High Answer Relevance ≠ complete answer: might be on-topic but shallow
- High Context Precision ≠ good retrieval: could be because the candidate set is large
- Metric improvement ≠ user-experience improvement: metrics are proxies; user feedback is the ultimate judge
Look at multiple metrics together; focus on "correlation between metrics and human ratings"; don't optimize for a single metric.
Combining with Human Evaluation
No auto-evaluation fully replaces humans; judgment on critical nodes must be human:
- Sample 50 cases monthly for human rating
- Compute correlation between auto and human ratings
- If correlation < 0.6, revisit the metric definition
- Cases from human eval enter the prompt-improvement backlog
Tools are means, not ends. Ragas/DeepEval help you quantify; humans help define "good".
Selection Decision Table
| Your scenario | Recommended tool | Reason |
|---|---|---|
| Pure RAG evaluation | Ragas | High metric expertise |
| Overall LLM application evaluation | DeepEval | Broader coverage |
| Team Python testing habit | DeepEval | Pytest-style |
| Multilingual / Chinese scenarios | DeepEval | More stable for Chinese |
| Need to customize metrics quickly | DeepEval | More flexible API |
Don't get stuck on "which tool is better". Get one running first, establish the measurement practice, then optimize the metric system.
Projects in this article
Ragas
14.6k ⭐Ragas is a framework for evaluating RAG (Retrieval Augmented Generation) systems. It provides various evaluation metrics including faithfulness, answer relevance, context precision, helping developers optimize RAG application performance.
DeepEval
16.6k ⭐DeepEval is an open-source evaluation framework for LLM applications. It provides rich evaluation metrics and tools, supporting unit testing and integration testing to help developers build reliable LLM applications.
TruLens
3.4k ⭐TruLens is an open-source tool for evaluating and tracking LLM apps. It provides specialized evaluation for RAG applications including context relevance, groundedness, and answer relevance.