Agent Evaluation and Testing: From Vibe Checks to End-to-End Pipelines

Most teams evaluate agents by checking a few examples. Real evaluation needs layered metrics, non-rotting datasets, and judges that push back. This article provides runnable code patterns and a practical decision framework.

AgentList Team · April 28, 2026
Agent 评估LLM 评测自动化测试质量保障Eval

Agent Evaluation and Testing: From Vibe Checks to End-to-End Pipelines

Why "It Looks Right" Is Not an Evaluation Strategy

Agent systems share one property that causes every team pain: nondeterminism. Run the same prompt twice, and you may get completely different execution paths. A coding agent that successfully fixes a bug on one run might delete the test file on the next.

Most teams evaluate agents the same way in the early days: run a few cases manually, eyeball the output, and decide if it "seems correct." This works during the demo phase but breaks down fast once you enter a regular iteration cycle. Three problems emerge:

Silent regressions go unnoticed. You tweak the planning prompt. Tool selection accuracy drops from 92% to 78%. But two test cases happen to still pass, so you ship it. By the time users report problems, you have pushed three more versions.

Improvement has no direction. "The agent got worse" is a useless statement. Worse at what? Wrong tool selection? Too many planning steps? Incorrect output format? Without layered metrics, improvement is guesswork.

Trade-off decisions are blind. Switch to a cheaper but weaker model -- how much does success rate drop? Add a tool-selection validation step -- how much latency does it add? Without quantitative data, these decisions are pure intuition.

The Three-Layer Evaluation Framework

A complete agent evaluation system covers three layers, each answering a different question:

Layer 1: Component-Level Evaluation

Question: Is each sub-module working correctly on its own?

Components to test and their corresponding metrics:

Component Key Metrics Test Method
Tool Selection Accuracy, F1 Given a task description, verify the correct tool and parameters are chosen
Planning Step relevance, redundancy Given a goal, assess whether the plan includes necessary steps without extras
Output Formatting Compliance rate Verify output strictly conforms to JSON Schema / function signatures
Retrieval Recall, Precision Standard IR evaluation, applicable to RAG-based agents

The core value of component-level evaluation is problem localization. When end-to-end tests fail, component metrics tell you which layer broke.

Layer 2: Interaction-Level (Trajectory) Evaluation

Question: Is the agent's execution path reasonable?

Interaction-level evaluation focuses on the trajectory -- the complete path the agent takes from start to finish. Key metrics:

  • Trajectory correctness: Did the agent follow the optimal path, or did it take unnecessary detours?
  • Step efficiency: How many steps taken vs. minimum required
  • Error recovery rate: When the agent makes a mistake, can it self-correct and return to the right path?
  • Tool call efficiency: Were unnecessary tools invoked? Were parameters accurate?

This layer is the hardest to automate because it requires defining "what a correct trajectory looks like." In practice, LLM-as-judge comparing actual trajectories against reference trajectories is the standard approach.

Layer 3: Outcome-Level Evaluation

Question: Was the task actually completed?

This is the most important layer -- the one that ultimately matters:

  • Task completion rate: Was the task genuinely completed (not "produced text that looks like an answer")
  • User satisfaction: Human evaluation or implicit feedback (did the user follow up? did they accept the result?)
  • Cost per task: Tokens consumed / API calls / latency to complete one task
  • Cost-quality Pareto: Which configuration achieves the best quality within a given budget?

AWS's Agent Evaluation framework starts from the outcome level, using predefined task sets and judging criteria to measure agent performance across different scenarios.

Production Pattern 1: Building a Gold-Standard Eval Dataset

The first step in evaluation is not picking metrics -- it is having a reliable dataset. Here is a complete workflow for building one, including an LLM-as-judge implementation.

import json
import os
from dataclasses import dataclass, asdict
from openai import OpenAI

client = OpenAI()

@dataclass
class EvalCase:
    task_id: str
    task_description: str
    expected_tools: list[str]
    expected_steps: list[str]
    expected_outcome: str
    difficulty: str  # easy / medium / hard
    category: str    # e.g. "code_search", "data_analysis"

@dataclass
class AgentResult:
    task_id: str
    trajectory: list[dict]  # List of {action, tool, input, output}
    final_output: str
    total_tokens: int
    total_steps: int

JUDGE_PROMPT = """You are an expert agent evaluator. Score the agent's execution based on:

Task: {task_description}
Expected outcome: {expected_outcome}
Agent's final output: {actual_output}

Agent trajectory:
{trajectory}

Rate each dimension (1-5):
1. task_completion: Did the agent achieve the core objective?
2. process_quality: Was the execution efficient without redundant steps?
3. output_quality: Is the final output accurate, complete, and correctly formatted?

Return JSON:
{{
    "task_completion": <1-5>,
    "process_quality": <1-5>,
    "output_quality": <1-5>,
    "reasoning": "<brief explanation for any deductions>"
}}"""

def judge_result(case: EvalCase, result: AgentResult) -> dict:
    trajectory_str = json.dumps(result.trajectory, ensure_ascii=False, indent=2)
    prompt = JUDGE_PROMPT.format(
        task_description=case.task_description,
        expected_outcome=case.expected_outcome,
        actual_output=result.final_output,
        trajectory=trajectory_str,
    )
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    scores = json.loads(response.choices[0].message.content)

    # Weighted composite score
    weighted_score = (
        scores["task_completion"] * 0.5
        + scores["process_quality"] * 0.3
        + scores["output_quality"] * 0.2
    )
    scores["weighted_total"] = round(weighted_score, 2)
    return scores

def run_eval_dataset(
    cases: list[EvalCase],
    agent_fn,  # Your agent function: (task_description) -> AgentResult
) -> list[dict]:
    results = []
    for case in cases:
        agent_result = agent_fn(case.task_description)
        judge_scores = judge_result(case, agent_result)
        results.append({
            "task_id": case.task_id,
            "difficulty": case.difficulty,
            "tokens_used": agent_result.total_tokens,
            "steps_taken": agent_result.total_steps,
            **judge_scores,
        })

    avg_score = sum(r["weighted_total"] for r in results) / len(results)
    pass_rate = sum(1 for r in results if r["task_completion"] >= 4) / len(results)
    print(f"Eval Summary: avg_score={avg_score:.2f}, pass_rate={pass_rate:.1%}")
    return results

if __name__ == "__main__":
    cases = [
        EvalCase(
            task_id="search_001",
            task_description="Find the function handling user authentication in the codebase and list its parameters and return type",
            expected_tools=["code_search"],
            expected_steps=["Search for authentication code", "Locate function definition", "Extract signature info"],
            expected_outcome="Structured info containing function name, parameter list, and return type",
            difficulty="easy",
            category="code_search",
        ),
    ]
    # results = run_eval_dataset(cases, my_agent_fn)

A few recommendations for the dataset itself:

  • Stratified sampling: Distribute across difficulty levels and categories. Do not fill it with only easy cases.
  • Include edge cases: Ambiguous task descriptions, unavailable tools, multi-step reasoning requirements.
  • Version control: Bind the dataset to agent versions to avoid running new agents against stale expectations.
  • Refresh regularly: Audit the dataset quarterly and retire outdated cases.

Production Pattern 2: Regression Testing Pipeline

Once you have a dataset, wire it into CI so every prompt or model change automatically triggers an evaluation run.

import json
import os
from datetime import datetime, timezone

REGRESSION_THRESHOLD = 0.3  # Flag if any single score drops more than this
GLOBAL_PASS_LINE = 3.5      # Fail if the global average falls below this

def load_baseline(baseline_path: str) -> dict:
    with open(baseline_path) as f:
        return json.load(f)

def compare_with_baseline(
    current_results: list[dict],
    baseline: dict,
) -> dict:
    regressions = []
    improvements = []

    for result in current_results:
        task_id = result["task_id"]
        if task_id not in baseline:
            continue

        prev = baseline[task_id]
        score_diff = result["weighted_total"] - prev["weighted_total"]

        if score_diff < -REGRESSION_THRESHOLD:
            regressions.append({
                "task_id": task_id,
                "previous": prev["weighted_total"],
                "current": result["weighted_total"],
                "delta": round(score_diff, 2),
                "reasoning": result.get("reasoning", ""),
            })
        elif score_diff > REGRESSION_THRESHOLD:
            improvements.append({
                "task_id": task_id,
                "previous": prev["weighted_total"],
                "current": result["weighted_total"],
                "delta": round(score_diff, 2),
            })

    avg_current = sum(r["weighted_total"] for r in current_results) / len(current_results)
    avg_baseline = sum(v["weighted_total"] for v in baseline.values()) / len(baseline)

    return {
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "avg_current": round(avg_current, 2),
        "avg_baseline": round(avg_baseline, 2),
        "avg_delta": round(avg_current - avg_baseline, 2),
        "regressions": regressions,
        "improvements": improvements,
        "passed": avg_current >= GLOBAL_PASS_LINE and len(regressions) == 0,
    }

def save_as_baseline(results: list[dict], path: str):
    baseline = {r["task_id"]: r for r in results}
    baseline["_meta"] = {
        "created_at": datetime.now(timezone.utc).isoformat(),
        "num_cases": len(results),
    }
    with open(path, "w") as f:
        json.dump(baseline, f, ensure_ascii=False, indent=2)

# CI integration example
if __name__ == "__main__":
    # baseline_path = os.environ.get("EVAL_BASELINE_PATH", "eval_baselines/latest.json")
    # current = run_eval_dataset(cases, agent_fn)
    # baseline = load_baseline(baseline_path)
    # report = compare_with_baseline(current, baseline)
    # print(json.dumps(report, ensure_ascii=False, indent=2))
    # if not report["passed"]:
    #     sys.exit(1)
    print("Regression pipeline ready. Integrate into CI with EVAL_BASELINE_PATH env var.")

Weave offers a managed version of this functionality, automatically tracking evaluation results and visualizing trends. If your team is already in the W&B ecosystem, Weave is the lowest-effort option.

Production Pattern 3: Cost-Quality Pareto Analysis

Agent cost is not linear. Switching from GPT-4o to GPT-4o-mini might cut cost by 10x while success rate drops only 5%. Pareto analysis helps you find the highest-value configuration.

import json
from dataclasses import dataclass

@dataclass
class ModelConfig:
    name: str
    model_id: str
    cost_per_1k_input_tokens: float
    cost_per_1k_output_tokens: float

@dataclass
class EvalRun:
    config: ModelConfig
    avg_score: float
    pass_rate: float
    avg_input_tokens: int
    avg_output_tokens: int
    avg_latency_ms: int

def calculate_cost_per_task(run: EvalRun) -> float:
    input_cost = (run.avg_input_tokens / 1000) * run.config.cost_per_1k_input_tokens
    output_cost = (run.avg_output_tokens / 1000) * run.config.cost_per_1k_output_tokens
    return input_cost + output_cost

def pareto_analysis(runs: list[EvalRun]) -> list[dict]:
    analyzed = []
    for run in runs:
        cost = calculate_cost_per_task(run)
        analyzed.append({
            "model": run.config.name,
            "avg_score": run.avg_score,
            "pass_rate": run.pass_rate,
            "cost_per_task_usd": round(cost, 4),
            "avg_latency_ms": run.avg_latency_ms,
            "efficiency": round(run.avg_score / cost, 2) if cost > 0 else float("inf"),
        })

    # Sort by cost and find Pareto frontier
    analyzed.sort(key=lambda x: x["cost_per_task_usd"])

    pareto_frontier = []
    best_score = 0
    for item in analyzed:
        if item["avg_score"] > best_score:
            best_score = item["avg_score"]
            pareto_frontier.append(item)

    print("=== Cost-Quality Pareto Analysis ===")
    print(f"{'Model':<20} {'Score':>8} {'Pass%':>8} {'Cost/Task':>12} {'Efficiency':>12}")
    print("-" * 64)
    for item in analyzed:
        marker = " <-- Pareto" if item in pareto_frontier else ""
        print(
            f"{item['model']:<20} {item['avg_score']:>8.2f} "
            f"{item['pass_rate']:>7.1%} ${item['cost_per_task_usd']:>10.4f} "
            f"{item['efficiency']:>11.2f}{marker}"
        )

    return analyzed

if __name__ == "__main__":
    configs = [
        ModelConfig("gpt-4o-mini", "gpt-4o-mini", 0.00015, 0.0006),
        ModelConfig("gpt-4o", "gpt-4o", 0.0025, 0.01),
        ModelConfig("claude-sonnet", "claude-sonnet-4-20250514", 0.003, 0.015),
    ]
    # Simulated eval results (replace with actual run_eval_dataset output)
    mock_runs = [
        EvalRun(configs[0], avg_score=3.6, pass_rate=0.72, avg_input_tokens=800, avg_output_tokens=300, avg_latency_ms=1200),
        EvalRun(configs[1], avg_score=4.2, pass_rate=0.88, avg_input_tokens=850, avg_output_tokens=350, avg_latency_ms=2500),
        EvalRun(configs[2], avg_score=4.3, pass_rate=0.90, avg_input_tokens=900, avg_output_tokens=320, avg_latency_ms=2800),
    ]
    pareto_analysis(mock_runs)

The typical finding from Pareto analysis: a mid-tier model with good prompt engineering often outperforms a top-tier model with sloppy prompts. Run a cost-quality comparison before committing to a model choice -- you will frequently discover savings.

Decision Framework: Which Tool When

Scenario Recommended Tool Why
Rapid prototyping with visualization Weave Out-of-the-box tracking and comparison UI
Compliance and safety audits Giskard Built-in bias, hallucination, and safety scanning
Large-scale benchmarking OpenHarness Standardized benchmark framework
Agent behavior diff tracking Agent Diff Precise trajectory comparison between runs
End-to-end production evaluation AWS Agent Eval Production-grade pipeline, AWS ecosystem integration
Full-stack observability + eval LMNR Unified trace + eval platform
Strong customization needs Build your own Use the code patterns above

The selection criterion is not "which has the most features" but "which fits your current pain point best." Start with Weave for quick visibility, build your own regression pipeline as you mature, and bring in Giskard when compliance requirements demand it.

Three Common Pitfalls

Pitfall 1: Judge-Model Alignment Bias

Using GPT-4o as a judge to evaluate GPT-4o outputs produces systematically inflated scores. The academic term is self-preference bias.

Solutions:

  • Use a judge model from a different provider than the model being evaluated (e.g., Claude to judge GPT outputs).
  • Make the evaluation prompt specify concrete scoring criteria to reduce subjective space.
  • Periodically calibrate judge scores against human evaluations.

Pitfall 2: Dataset Rot

Eval datasets expire just like code. When your agent gains new tools or capabilities, old test cases may no longer cover critical paths.

Solutions:

  • Add new eval cases every time you add agent capabilities.
  • Set a dataset shelf life -- cases older than 3 months need re-review.
  • Use Agent Diff to track behavioral changes and identify dataset blind spots.

Pitfall 3: Overfitting to Golden Examples

If your eval dataset has only 20-30 cases and you iteratively tune prompts against them, you are effectively test-prepping -- the prompt will look perfect on those 20 cases but may collapse on novel scenarios.

Solutions:

  • Maintain at least 100 cases spanning different difficulties and categories.
  • Split into train/eval sets so tuning cases differ from final evaluation cases.
  • Hold out a "secret set" that never participates in routine tuning and is only used before releases.

Summary

  • Agent evaluation must operate on three layers: component-level for problem localization, interaction-level for path assessment, and outcome-level for final effectiveness. Combined, they transform "it feels worse" into "tool selection accuracy dropped 14%."
  • Build the dataset before choosing metrics. Dataset quality and coverage matter more than the choice of evaluation algorithm.
  • Keep judge models and evaluated models from the same provider apart. Self-preference bias will otherwise render your evaluation meaningless.
  • Regression testing belongs in CI. Not "run it when we have time" but "run automatically on every change," treated the same as unit tests.
  • Pareto analysis saves money. Run one cost-quality comparison and you will likely find that a cheaper model with good prompting is already sufficient.