Agent Evaluation and Testing: From Vibe Checks to End-to-End Pipelines
Most teams evaluate agents by checking a few examples. Real evaluation needs layered metrics, non-rotting datasets, and judges that push back. This article provides runnable code patterns and a practical decision framework.
Agent Evaluation and Testing: From Vibe Checks to End-to-End Pipelines
Why "It Looks Right" Is Not an Evaluation Strategy
Agent systems share one property that causes every team pain: nondeterminism. Run the same prompt twice, and you may get completely different execution paths. A coding agent that successfully fixes a bug on one run might delete the test file on the next.
Most teams evaluate agents the same way in the early days: run a few cases manually, eyeball the output, and decide if it "seems correct." This works during the demo phase but breaks down fast once you enter a regular iteration cycle. Three problems emerge:
Silent regressions go unnoticed. You tweak the planning prompt. Tool selection accuracy drops from 92% to 78%. But two test cases happen to still pass, so you ship it. By the time users report problems, you have pushed three more versions.
Improvement has no direction. "The agent got worse" is a useless statement. Worse at what? Wrong tool selection? Too many planning steps? Incorrect output format? Without layered metrics, improvement is guesswork.
Trade-off decisions are blind. Switch to a cheaper but weaker model -- how much does success rate drop? Add a tool-selection validation step -- how much latency does it add? Without quantitative data, these decisions are pure intuition.
The Three-Layer Evaluation Framework
A complete agent evaluation system covers three layers, each answering a different question:
Layer 1: Component-Level Evaluation
Question: Is each sub-module working correctly on its own?
Components to test and their corresponding metrics:
| Component | Key Metrics | Test Method |
|---|---|---|
| Tool Selection | Accuracy, F1 | Given a task description, verify the correct tool and parameters are chosen |
| Planning | Step relevance, redundancy | Given a goal, assess whether the plan includes necessary steps without extras |
| Output Formatting | Compliance rate | Verify output strictly conforms to JSON Schema / function signatures |
| Retrieval | Recall, Precision | Standard IR evaluation, applicable to RAG-based agents |
The core value of component-level evaluation is problem localization. When end-to-end tests fail, component metrics tell you which layer broke.
Layer 2: Interaction-Level (Trajectory) Evaluation
Question: Is the agent's execution path reasonable?
Interaction-level evaluation focuses on the trajectory -- the complete path the agent takes from start to finish. Key metrics:
- Trajectory correctness: Did the agent follow the optimal path, or did it take unnecessary detours?
- Step efficiency: How many steps taken vs. minimum required
- Error recovery rate: When the agent makes a mistake, can it self-correct and return to the right path?
- Tool call efficiency: Were unnecessary tools invoked? Were parameters accurate?
This layer is the hardest to automate because it requires defining "what a correct trajectory looks like." In practice, LLM-as-judge comparing actual trajectories against reference trajectories is the standard approach.
Layer 3: Outcome-Level Evaluation
Question: Was the task actually completed?
This is the most important layer -- the one that ultimately matters:
- Task completion rate: Was the task genuinely completed (not "produced text that looks like an answer")
- User satisfaction: Human evaluation or implicit feedback (did the user follow up? did they accept the result?)
- Cost per task: Tokens consumed / API calls / latency to complete one task
- Cost-quality Pareto: Which configuration achieves the best quality within a given budget?
AWS's Agent Evaluation framework starts from the outcome level, using predefined task sets and judging criteria to measure agent performance across different scenarios.
Production Pattern 1: Building a Gold-Standard Eval Dataset
The first step in evaluation is not picking metrics -- it is having a reliable dataset. Here is a complete workflow for building one, including an LLM-as-judge implementation.
import json
import os
from dataclasses import dataclass, asdict
from openai import OpenAI
client = OpenAI()
@dataclass
class EvalCase:
task_id: str
task_description: str
expected_tools: list[str]
expected_steps: list[str]
expected_outcome: str
difficulty: str # easy / medium / hard
category: str # e.g. "code_search", "data_analysis"
@dataclass
class AgentResult:
task_id: str
trajectory: list[dict] # List of {action, tool, input, output}
final_output: str
total_tokens: int
total_steps: int
JUDGE_PROMPT = """You are an expert agent evaluator. Score the agent's execution based on:
Task: {task_description}
Expected outcome: {expected_outcome}
Agent's final output: {actual_output}
Agent trajectory:
{trajectory}
Rate each dimension (1-5):
1. task_completion: Did the agent achieve the core objective?
2. process_quality: Was the execution efficient without redundant steps?
3. output_quality: Is the final output accurate, complete, and correctly formatted?
Return JSON:
{{
"task_completion": <1-5>,
"process_quality": <1-5>,
"output_quality": <1-5>,
"reasoning": "<brief explanation for any deductions>"
}}"""
def judge_result(case: EvalCase, result: AgentResult) -> dict:
trajectory_str = json.dumps(result.trajectory, ensure_ascii=False, indent=2)
prompt = JUDGE_PROMPT.format(
task_description=case.task_description,
expected_outcome=case.expected_outcome,
actual_output=result.final_output,
trajectory=trajectory_str,
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0,
)
scores = json.loads(response.choices[0].message.content)
# Weighted composite score
weighted_score = (
scores["task_completion"] * 0.5
+ scores["process_quality"] * 0.3
+ scores["output_quality"] * 0.2
)
scores["weighted_total"] = round(weighted_score, 2)
return scores
def run_eval_dataset(
cases: list[EvalCase],
agent_fn, # Your agent function: (task_description) -> AgentResult
) -> list[dict]:
results = []
for case in cases:
agent_result = agent_fn(case.task_description)
judge_scores = judge_result(case, agent_result)
results.append({
"task_id": case.task_id,
"difficulty": case.difficulty,
"tokens_used": agent_result.total_tokens,
"steps_taken": agent_result.total_steps,
**judge_scores,
})
avg_score = sum(r["weighted_total"] for r in results) / len(results)
pass_rate = sum(1 for r in results if r["task_completion"] >= 4) / len(results)
print(f"Eval Summary: avg_score={avg_score:.2f}, pass_rate={pass_rate:.1%}")
return results
if __name__ == "__main__":
cases = [
EvalCase(
task_id="search_001",
task_description="Find the function handling user authentication in the codebase and list its parameters and return type",
expected_tools=["code_search"],
expected_steps=["Search for authentication code", "Locate function definition", "Extract signature info"],
expected_outcome="Structured info containing function name, parameter list, and return type",
difficulty="easy",
category="code_search",
),
]
# results = run_eval_dataset(cases, my_agent_fn)
A few recommendations for the dataset itself:
- Stratified sampling: Distribute across difficulty levels and categories. Do not fill it with only easy cases.
- Include edge cases: Ambiguous task descriptions, unavailable tools, multi-step reasoning requirements.
- Version control: Bind the dataset to agent versions to avoid running new agents against stale expectations.
- Refresh regularly: Audit the dataset quarterly and retire outdated cases.
Production Pattern 2: Regression Testing Pipeline
Once you have a dataset, wire it into CI so every prompt or model change automatically triggers an evaluation run.
import json
import os
from datetime import datetime, timezone
REGRESSION_THRESHOLD = 0.3 # Flag if any single score drops more than this
GLOBAL_PASS_LINE = 3.5 # Fail if the global average falls below this
def load_baseline(baseline_path: str) -> dict:
with open(baseline_path) as f:
return json.load(f)
def compare_with_baseline(
current_results: list[dict],
baseline: dict,
) -> dict:
regressions = []
improvements = []
for result in current_results:
task_id = result["task_id"]
if task_id not in baseline:
continue
prev = baseline[task_id]
score_diff = result["weighted_total"] - prev["weighted_total"]
if score_diff < -REGRESSION_THRESHOLD:
regressions.append({
"task_id": task_id,
"previous": prev["weighted_total"],
"current": result["weighted_total"],
"delta": round(score_diff, 2),
"reasoning": result.get("reasoning", ""),
})
elif score_diff > REGRESSION_THRESHOLD:
improvements.append({
"task_id": task_id,
"previous": prev["weighted_total"],
"current": result["weighted_total"],
"delta": round(score_diff, 2),
})
avg_current = sum(r["weighted_total"] for r in current_results) / len(current_results)
avg_baseline = sum(v["weighted_total"] for v in baseline.values()) / len(baseline)
return {
"timestamp": datetime.now(timezone.utc).isoformat(),
"avg_current": round(avg_current, 2),
"avg_baseline": round(avg_baseline, 2),
"avg_delta": round(avg_current - avg_baseline, 2),
"regressions": regressions,
"improvements": improvements,
"passed": avg_current >= GLOBAL_PASS_LINE and len(regressions) == 0,
}
def save_as_baseline(results: list[dict], path: str):
baseline = {r["task_id"]: r for r in results}
baseline["_meta"] = {
"created_at": datetime.now(timezone.utc).isoformat(),
"num_cases": len(results),
}
with open(path, "w") as f:
json.dump(baseline, f, ensure_ascii=False, indent=2)
# CI integration example
if __name__ == "__main__":
# baseline_path = os.environ.get("EVAL_BASELINE_PATH", "eval_baselines/latest.json")
# current = run_eval_dataset(cases, agent_fn)
# baseline = load_baseline(baseline_path)
# report = compare_with_baseline(current, baseline)
# print(json.dumps(report, ensure_ascii=False, indent=2))
# if not report["passed"]:
# sys.exit(1)
print("Regression pipeline ready. Integrate into CI with EVAL_BASELINE_PATH env var.")
Weave offers a managed version of this functionality, automatically tracking evaluation results and visualizing trends. If your team is already in the W&B ecosystem, Weave is the lowest-effort option.
Production Pattern 3: Cost-Quality Pareto Analysis
Agent cost is not linear. Switching from GPT-4o to GPT-4o-mini might cut cost by 10x while success rate drops only 5%. Pareto analysis helps you find the highest-value configuration.
import json
from dataclasses import dataclass
@dataclass
class ModelConfig:
name: str
model_id: str
cost_per_1k_input_tokens: float
cost_per_1k_output_tokens: float
@dataclass
class EvalRun:
config: ModelConfig
avg_score: float
pass_rate: float
avg_input_tokens: int
avg_output_tokens: int
avg_latency_ms: int
def calculate_cost_per_task(run: EvalRun) -> float:
input_cost = (run.avg_input_tokens / 1000) * run.config.cost_per_1k_input_tokens
output_cost = (run.avg_output_tokens / 1000) * run.config.cost_per_1k_output_tokens
return input_cost + output_cost
def pareto_analysis(runs: list[EvalRun]) -> list[dict]:
analyzed = []
for run in runs:
cost = calculate_cost_per_task(run)
analyzed.append({
"model": run.config.name,
"avg_score": run.avg_score,
"pass_rate": run.pass_rate,
"cost_per_task_usd": round(cost, 4),
"avg_latency_ms": run.avg_latency_ms,
"efficiency": round(run.avg_score / cost, 2) if cost > 0 else float("inf"),
})
# Sort by cost and find Pareto frontier
analyzed.sort(key=lambda x: x["cost_per_task_usd"])
pareto_frontier = []
best_score = 0
for item in analyzed:
if item["avg_score"] > best_score:
best_score = item["avg_score"]
pareto_frontier.append(item)
print("=== Cost-Quality Pareto Analysis ===")
print(f"{'Model':<20} {'Score':>8} {'Pass%':>8} {'Cost/Task':>12} {'Efficiency':>12}")
print("-" * 64)
for item in analyzed:
marker = " <-- Pareto" if item in pareto_frontier else ""
print(
f"{item['model']:<20} {item['avg_score']:>8.2f} "
f"{item['pass_rate']:>7.1%} ${item['cost_per_task_usd']:>10.4f} "
f"{item['efficiency']:>11.2f}{marker}"
)
return analyzed
if __name__ == "__main__":
configs = [
ModelConfig("gpt-4o-mini", "gpt-4o-mini", 0.00015, 0.0006),
ModelConfig("gpt-4o", "gpt-4o", 0.0025, 0.01),
ModelConfig("claude-sonnet", "claude-sonnet-4-20250514", 0.003, 0.015),
]
# Simulated eval results (replace with actual run_eval_dataset output)
mock_runs = [
EvalRun(configs[0], avg_score=3.6, pass_rate=0.72, avg_input_tokens=800, avg_output_tokens=300, avg_latency_ms=1200),
EvalRun(configs[1], avg_score=4.2, pass_rate=0.88, avg_input_tokens=850, avg_output_tokens=350, avg_latency_ms=2500),
EvalRun(configs[2], avg_score=4.3, pass_rate=0.90, avg_input_tokens=900, avg_output_tokens=320, avg_latency_ms=2800),
]
pareto_analysis(mock_runs)
The typical finding from Pareto analysis: a mid-tier model with good prompt engineering often outperforms a top-tier model with sloppy prompts. Run a cost-quality comparison before committing to a model choice -- you will frequently discover savings.
Decision Framework: Which Tool When
| Scenario | Recommended Tool | Why |
|---|---|---|
| Rapid prototyping with visualization | Weave | Out-of-the-box tracking and comparison UI |
| Compliance and safety audits | Giskard | Built-in bias, hallucination, and safety scanning |
| Large-scale benchmarking | OpenHarness | Standardized benchmark framework |
| Agent behavior diff tracking | Agent Diff | Precise trajectory comparison between runs |
| End-to-end production evaluation | AWS Agent Eval | Production-grade pipeline, AWS ecosystem integration |
| Full-stack observability + eval | LMNR | Unified trace + eval platform |
| Strong customization needs | Build your own | Use the code patterns above |
The selection criterion is not "which has the most features" but "which fits your current pain point best." Start with Weave for quick visibility, build your own regression pipeline as you mature, and bring in Giskard when compliance requirements demand it.
Three Common Pitfalls
Pitfall 1: Judge-Model Alignment Bias
Using GPT-4o as a judge to evaluate GPT-4o outputs produces systematically inflated scores. The academic term is self-preference bias.
Solutions:
- Use a judge model from a different provider than the model being evaluated (e.g., Claude to judge GPT outputs).
- Make the evaluation prompt specify concrete scoring criteria to reduce subjective space.
- Periodically calibrate judge scores against human evaluations.
Pitfall 2: Dataset Rot
Eval datasets expire just like code. When your agent gains new tools or capabilities, old test cases may no longer cover critical paths.
Solutions:
- Add new eval cases every time you add agent capabilities.
- Set a dataset shelf life -- cases older than 3 months need re-review.
- Use Agent Diff to track behavioral changes and identify dataset blind spots.
Pitfall 3: Overfitting to Golden Examples
If your eval dataset has only 20-30 cases and you iteratively tune prompts against them, you are effectively test-prepping -- the prompt will look perfect on those 20 cases but may collapse on novel scenarios.
Solutions:
- Maintain at least 100 cases spanning different difficulties and categories.
- Split into train/eval sets so tuning cases differ from final evaluation cases.
- Hold out a "secret set" that never participates in routine tuning and is only used before releases.
Summary
- Agent evaluation must operate on three layers: component-level for problem localization, interaction-level for path assessment, and outcome-level for final effectiveness. Combined, they transform "it feels worse" into "tool selection accuracy dropped 14%."
- Build the dataset before choosing metrics. Dataset quality and coverage matter more than the choice of evaluation algorithm.
- Keep judge models and evaluated models from the same provider apart. Self-preference bias will otherwise render your evaluation meaningless.
- Regression testing belongs in CI. Not "run it when we have time" but "run automatically on every change," treated the same as unit tests.
- Pareto analysis saves money. Run one cost-quality comparison and you will likely find that a cheaper model with good prompting is already sufficient.
Projects in this article
AWS Agent Evaluation
360 ⭐Amazon's AI agent evaluation tool for automated quality assessment of Bedrock Agents and other LLM agents with multi-dimensional metrics and benchmarks.
Weave
1.1k ⭐A toolkit by Weights & Biases for developing AI-powered applications, providing LLM call tracing, evaluation experiment management, and versioning from prototype to production.
Giskard
5.3k ⭐An open-source evaluation and testing library for LLM agents providing automated model scanning, bias detection, performance benchmarking, and compliance checks.
OpenHarness
12.4k ⭐OpenHarness is an open agent harness platform with a built-in personal agent called Ohmo, providing an integrated solution for agent development, testing, and deployment.
AgentDiff
32 ⭐Interactive sandboxes for AI agent evaluations and reinforcement learning on third-party APIs like Slack, LinkedIn, and more.
LMNR
2.9k ⭐LMNR is an open-source observability platform for LLM and agent applications, focused on tracing, quality analysis, and production diagnostics.