Geodocs.dev

Agent End-to-End Testing Specification

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

Agent end-to-end testing combines scenario suites, golden trace baselines, LLM-as-judge evaluators, and deterministic replay to gate non-deterministic agents in CI. Treat traces — not just final outputs — as the unit of evaluation.

TL;DR

Agent E2E testing replays full scenarios through the live tool graph, scores outputs and trajectories with rubric-based or LLM-as-judge evaluators, and replays non-determinism (LLM and tool I/O) deterministically so the same scenario produces the same trace twice. CI gates fail builds when scores or trajectory diffs cross declared thresholds.

Definition

Agent end-to-end (E2E) testing is the practice of executing an AI agent against a fixed suite of scenarios — each defining inputs, environment state, and expected outcomes — and scoring the resulting traces and final outputs against deterministic assertions, rubric metrics, or LLM-as-judge evaluators. The unit under test is the full agent run (planner → tools → memory → output), not a single LLM call.

E2E tests for agents differ from traditional E2E tests in three ways:

  1. The system is non-deterministic. Repeating the same scenario can produce different traces because of model sampling, tool latency, and time-dependent state.
  2. Outputs are open-ended. Final answers rarely match a single golden string, so evaluation needs rubric or judge-based scoring.
  3. The trace matters as much as the output. A right answer reached via the wrong tool sequence is still a regression.

Why it matters

Agents fail silently. A prompt edit, a model upgrade, or a tool schema change can flip behavior in production without any code-level error. Without an E2E suite gating CI, regressions are detected by users, not engineers — which means cost spikes, malformed JSON, broken tool calls, and degraded answer quality reach production before anyone notices.

E2E testing closes this gap by:

  • Providing a frozen, versioned scenario set that prompt and model changes must clear before merge.
  • Generating labeled trace data that powers offline replay, fine-tuning datasets, and judge calibration.
  • Producing a regression signal that is robust to LLM nondeterminism — usually a score distribution rather than a single pass/fail.
  • Enabling deterministic post-mortem replay when production incidents require root-cause analysis.

How it works

A complete agent E2E test pipeline has six layers, all sharing the same trace schema so production incidents can be promoted into the suite without reformatting.

flowchart LR
    A["Scenario Suite"] --> B["Test Runner"]
    B --> C["Agent Under Test"]
    C --> D["Trace Recorder"]
    D --> E["Evaluators
(rule + judge)"]
    E --> F["CI Gate"]
    D --> G["Replay Store"]
    G --> C
LayerResponsibilityExample tooling
Scenario suiteVersioned inputs and expected outcomesYAML/JSON datasets, LangSmith datasets, promptfoo configs
Test runnerSpawns the agent for each scenariopromptfoo, pytest harness, custom CLI
Agent under testThe exact build that ships to prodLangGraph, Claude Agent SDK, OpenAI Agents
Trace recorderCaptures spans, tool I/O, model callsOpenTelemetry, LangSmith, Arize, Langfuse
EvaluatorsScore outputs and trajectoriesSchema validators, exact match, LLM-as-judge
CI gatePass/fail thresholds applied to scoresGitHub Actions, GitLab CI, promptfoo CI

The replay store is the bridge between testing and debugging: a recorded trace can be re-fed into the agent with the LLM and tool calls stubbed by their recorded responses, producing a deterministic re-run that engineers can step through.

Key concepts

Scenario taxonomy

Group scenarios by intent so coverage gaps are visible:

  • Happy-path scenarios — canonical user goals, full tool execution.
  • Edge-case scenarios — empty results, ambiguous inputs, multi-turn clarifications.
  • Adversarial scenarios — prompt injection, jailbreaks, tool misuse.
  • Regression scenarios — bugs caught in production, frozen as permanent fixtures.
  • Cost and latency scenarios — long-context inputs that exercise budget guardrails.

Each scenario declares: id, inputs, initial_state, expected_outcomes (loose, rubric, or strict), and tags for selective execution.

Golden traces

A golden trace is a recorded, human-approved trace for a scenario. It captures the canonical sequence of spans (planner step, tool call, model call, memory write) and the final output. Two evaluation patterns use it:

  • Trajectory diff — compare the new run's span sequence to the golden trace and fail if order, tool selection, or argument shape diverge.
  • Anchor assertions — pin specific spans (for example, must call search_docs before summarize) rather than the entire trajectory, which is more resilient to harmless reordering.

LLM-as-judge evaluation

LLM-as-judge uses a separate LLM with a rubric to score outputs on dimensions like correctness, faithfulness, helpfulness, and tool-use appropriateness. Practitioner guidance documented across LangSmith, Confident AI, and Microsoft's Azure AI Foundry studies converges on:

  • Run judges at temperature = 0 to reduce variance, but expect residual variance from probability ties.
  • Calibrate every judge prompt against a small human-labeled set; track agreement (Cohen's kappa or accuracy on agreement-only samples) before trusting it as a CI gate.
  • Prefer pairwise judging (A vs B) over single-output scoring when comparing two agent versions — it is more stable and easier to calibrate.

LLM-as-judge is appropriate for open-ended quality dimensions; it is a poor fit as the sole gate for tool-call correctness, where deterministic schema and trajectory checks are more reliable.

Deterministic replay

Deterministic replay means substituting every non-deterministic dependency — LLM calls, tool calls, clocks, RNG — with stubs that return previously recorded outputs. The replay engine looks up each call by a content-addressed key (often a hash of the request) and returns the recorded response. Replay is used for:

  • Reproducing a production failure trace in development.
  • Re-evaluating a frozen trace under a new judge prompt without re-spending tokens.
  • Stress-testing prompt edits by replaying historical traffic against the new prompt.

VCR-style libraries implement this pattern for Python and TypeScript agent stacks, recording every external call on first run and replaying it on subsequent runs.

CI gating thresholds

A gate is a boolean derived from one or more score distributions. Common patterns:

  • Hard floor — every scenario must pass schema and tool-call assertions; one failure blocks merge.
  • Aggregate threshold — judge score mean ≥ 0.85 across the suite, with no individual scenario below 0.7.
  • Regression delta — new build's score must not drop more than 2 points vs. the previous baseline on any scenario.
  • Cost and latency budget — p95 tokens and p95 wall-clock must stay within declared budgets.

Promptfoo and LangSmith both expose these gates as CI-native check outputs that GitHub Actions and GitLab CI can read directly.

LayerUnit under testDeterminismTypical signal
Unit testSingle function or promptFully deterministicPass/fail on string match or schema
Component evalOne model call or one toolMocked dependenciesScore distribution per metric
Agent E2E testFull agent runRecorded → replayedTrajectory + judge + budget gates
Production observabilityLive user trafficNon-deterministicOnline evals, SLO breaches

E2E tests sit between component evals and production observability. They use real tools and real models, but inside a controlled scenario suite, and they should reuse the same trace schema as observability so a production incident can be turned into an E2E fixture without reformatting.

Common misconceptions

  • "LLM-as-judge can replace deterministic assertions." Judges drift between model versions and have known biases. Use them for quality dimensions, not for tool-call correctness.
  • "Snapshot testing the final answer is enough." Final answers can match while the trajectory regresses (wrong tool, extra steps, higher cost). Score the trajectory too.
  • "Higher temperature judges produce more diverse evaluations." It mostly produces noisier evaluations. Use temperature = 0 and prompt diversity instead.
  • "Replay means rerunning the agent." Replay means substituting recorded LLM and tool outputs so the run is bit-for-bit reproducible.
  • "Golden traces should be rewritten on every prompt change." Update them deliberately, with code review, so regressions cannot be silently rebaselined.

How to apply this specification

  1. Define the scenario schema. Adopt the keys above (id, inputs, initial_state, expected_outcomes, tags) and check the suite into the agent repo.
  2. Instrument the agent for tracing. Use OpenTelemetry or a vendor SDK that emits spans for planner, tool, and model calls; see the Agent Tracing and Spans Specification.
  3. Record golden traces. Run the suite, review traces with humans, and freeze the approved ones.
  4. Wire evaluators. Combine deterministic checks (schema, tool-call sequence) with one or two calibrated LLM-as-judge metrics.
  5. Add the CI gate. Run the suite on every PR; fail on hard-floor violations and regression deltas.
  6. Stand up replay. Persist traces in a replay store keyed by request hash so any failure can be re-run deterministically.
  7. Curate from production. Promote real failures into the regression bucket of the suite; calibrate judges quarterly.

Set explicit thresholds in code, not in dashboards. A typical starter contract: 100% schema pass, 100% tool-trajectory anchor pass, ≥ 0.85 mean judge score, ≤ 2-point regression delta vs. main, and p95 tokens within budget.

FAQ

Q: How is agent E2E testing different from prompt evaluation?

Prompt evaluation scores a single model call against a static input. Agent E2E testing scores the entire run — planner decisions, tool calls, memory writes, and final output — against a scenario. A prompt eval tells you the prompt is good; an E2E test tells you the agent built around the prompt is good.

Q: Should I use LLM-as-judge as my only evaluator?

No. LLM-as-judge is appropriate for open-ended quality dimensions (helpfulness, faithfulness, tone) but is unreliable as the sole gate for tool-call correctness. Combine schema validators and trajectory anchors with one or two calibrated judge metrics.

Q: How do I make non-deterministic agents reproducible in tests?

Record every LLM and tool call with content-addressed keys, then replay by substituting stubs that return the recorded response for each key. This converts the run into a deterministic, debuggable replay.

Q: How many scenarios should an E2E suite contain?

Start with 20-50 scenarios that cover the happy-path, top edge cases, and any production regressions. Grow toward 200-500 once you have automated curation from production traces. Coverage matters more than count; tag scenarios so you can run a smoke subset on every PR and the full suite nightly.

Q: Where should the suite live?

In the agent repo, version-controlled alongside prompts and tool definitions. Treat scenarios, golden traces, and judge prompts as code: changes need pull-request review.

Q: How do I gate CI without flaky failures?

Use the regression-delta pattern: compare the new build's score distribution to a stable baseline (last green main) instead of an absolute threshold. Combine that with hard-floor schema and trajectory anchor checks, which are deterministic and never flaky.

Q: Can I run this on a free CI tier?

Mostly yes. Cache LLM responses for replay-mode runs, batch judge calls, and tag scenarios so PRs run a small smoke subset. Reserve the full suite for nightly or pre-release runs to control token spend.

: Sakura Sky — Trustworthy AI Agents: Deterministic Replay — https://www.sakurasky.com/blog/missing-primitives-for-trustworthy-ai-part-8/

: agentcheck (open source) — snapshot/replay/test pattern for AI agents — https://github.com/hvardhan878/agentcheck

: Tian Pan — Deterministic Replay: Debugging AI Agents That Never Run the Same Way Twice (2026) — https://tianpan.co/blog/2026-04-12-deterministic-replay-debugging-non-deterministic-ai-agents

: LangChain — The Agent Improvement Loop Starts with a Trace — https://www.langchain.com/blog/traces-start-agent-improvement-loop

: Confident AI — AI Agent Evaluation: Metrics, Traces, Human Review, and Workflows — https://www.confident-ai.com/blog/definitive-ai-agent-evaluation-guide

: Microsoft Azure AI Foundry — Evaluating AI Agents: Can LLM-as-a-Judge Evaluators Be Trusted? — https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/evaluating-ai-agents-can-llm-as-a-judge-evaluators-be-trusted/4480110

: r/AI_Agents practitioner discussion on LLM-as-judge limitations as a CI gate — https://www.reddit.com/r/AI_Agents/comments/1swsqgt/llmasjudge_is_the_wrong_default_heres_what_works/

: agentcheck Python library — VCR-style replay for LLM workflows — https://github.com/hvardhan878/agentcheck

: Promptfoo — CI/CD Integration for LLM Evaluation — https://www.promptfoo.dev/docs/integrations/ci-cd/

: LangSmith — LLM and AI Agent Evals Platform — https://www.langchain.com/langsmith/evaluation

Related Articles

specification

Agent Authentication Documentation Spec

Document authentication for autonomous agents: OAuth flows, API keys, scopes, error states, and consent UX patterns AI agents need to operate safely.

specification

Agent Trace Instrumentation Specification: OpenTelemetry for AI Agents

Specification for instrumenting AI agents with OpenTelemetry: span hierarchy, gen_ai semantic conventions, privacy-aware capture, sampling, and vendor integration.

guide

What Are AI Agents?

What AI agents are, how they work, and why they matter for content strategy in 2026 — autonomous AI systems that perceive, reason, plan, and act on behalf of users.

Stay Updated

GEO & AI Search Insights

New articles, framework updates, and industry analysis. No spam, unsubscribe anytime.