Agent End-to-End Testing Specification

Agent end-to-end testing combines scenario suites, golden trace baselines, LLM-as-judge evaluators, and deterministic replay to gate non-deterministic agents in CI. Treat traces — not just final outputs — as the unit of evaluation.

TL;DR

Agent E2E testing replays full scenarios through the live tool graph, scores outputs and trajectories with rubric-based or LLM-as-judge evaluators, and replays non-determinism (LLM and tool I/O) deterministically so the same scenario produces the same trace twice. CI gates fail builds when scores or trajectory diffs cross declared thresholds.

Definition

Agent end-to-end (E2E) testing is the practice of executing an AI agent against a fixed suite of scenarios — each defining inputs, environment state, and expected outcomes — and scoring the resulting traces and final outputs against deterministic assertions, rubric metrics, or LLM-as-judge evaluators. The unit under test is the full agent run (planner → tools → memory → output), not a single LLM call.

E2E tests for agents differ from traditional E2E tests in three ways:

The system is non-deterministic. Repeating the same scenario can produce different traces because of model sampling, tool latency, and time-dependent state.
Outputs are open-ended. Final answers rarely match a single golden string, so evaluation needs rubric or judge-based scoring.
The trace matters as much as the output. A right answer reached via the wrong tool sequence is still a regression.

Why it matters

Agents fail silently. A prompt edit, a model upgrade, or a tool schema change can flip behavior in production without any code-level error. Without an E2E suite gating CI, regressions are detected by users, not engineers — which means cost spikes, malformed JSON, broken tool calls, and degraded answer quality reach production before anyone notices.

E2E testing closes this gap by:

Providing a frozen, versioned scenario set that prompt and model changes must clear before merge.
Generating labeled trace data that powers offline replay, fine-tuning datasets, and judge calibration.
Producing a regression signal that is robust to LLM nondeterminism — usually a score distribution rather than a single pass/fail.
Enabling deterministic post-mortem replay when production incidents require root-cause analysis.

How it works

A complete agent E2E test pipeline has six layers, all sharing the same trace schema so production incidents can be promoted into the suite without reformatting.

flowchart LR
    A["Scenario Suite"] --> B["Test Runner"]
    B --> C["Agent Under Test"]
    C --> D["Trace Recorder"]
    D --> E["Evaluators
(rule + judge)"]
    E --> F["CI Gate"]
    D --> G["Replay Store"]
    G --> C

Layer	Responsibility	Example tooling
Scenario suite	Versioned inputs and expected outcomes	YAML/JSON datasets, LangSmith datasets, promptfoo configs
Test runner	Spawns the agent for each scenario	promptfoo, pytest harness, custom CLI
Agent under test	The exact build that ships to prod	LangGraph, Claude Agent SDK, OpenAI Agents
Trace recorder	Captures spans, tool I/O, model calls	OpenTelemetry, LangSmith, Arize, Langfuse
Evaluators	Score outputs and trajectories	Schema validators, exact match, LLM-as-judge
CI gate	Pass/fail thresholds applied to scores	GitHub Actions, GitLab CI, promptfoo CI

The replay store is the bridge between testing and debugging: a recorded trace can be re-fed into the agent with the LLM and tool calls stubbed by their recorded responses, producing a deterministic re-run that engineers can step through.

Key concepts

Scenario taxonomy

Group scenarios by intent so coverage gaps are visible:

Happy-path scenarios — canonical user goals, full tool execution.
Edge-case scenarios — empty results, ambiguous inputs, multi-turn clarifications.
Adversarial scenarios — prompt injection, jailbreaks, tool misuse.
Regression scenarios — bugs caught in production, frozen as permanent fixtures.
Cost and latency scenarios — long-context inputs that exercise budget guardrails.

Each scenario declares: id, inputs, initial_state, expected_outcomes (loose, rubric, or strict), and tags for selective execution.

Golden traces

A golden trace is a recorded, human-approved trace for a scenario. It captures the canonical sequence of spans (planner step, tool call, model call, memory write) and the final output. Two evaluation patterns use it:

Trajectory diff — compare the new run's span sequence to the golden trace and fail if order, tool selection, or argument shape diverge.
Anchor assertions — pin specific spans (for example, must call search_docs before summarize) rather than the entire trajectory, which is more resilient to harmless reordering.

LLM-as-judge evaluation

LLM-as-judge uses a separate LLM with a rubric to score outputs on dimensions like correctness, faithfulness, helpfulness, and tool-use appropriateness. Practitioner guidance documented across LangSmith, Confident AI, and Microsoft's Azure AI Foundry studies converges on:

Run judges at temperature = 0 to reduce variance, but expect residual variance from probability ties.
Calibrate every judge prompt against a small human-labeled set; track agreement (Cohen's kappa or accuracy on agreement-only samples) before trusting it as a CI gate.
Prefer pairwise judging (A vs B) over single-output scoring when comparing two agent versions — it is more stable and easier to calibrate.

LLM-as-judge is appropriate for open-ended quality dimensions; it is a poor fit as the sole gate for tool-call correctness, where deterministic schema and trajectory checks are more reliable.

Deterministic replay

Deterministic replay means substituting every non-deterministic dependency — LLM calls, tool calls, clocks, RNG — with stubs that return previously recorded outputs. The replay engine looks up each call by a content-addressed key (often a hash of the request) and returns the recorded response. Replay is used for:

Reproducing a production failure trace in development.
Re-evaluating a frozen trace under a new judge prompt without re-spending tokens.
Stress-testing prompt edits by replaying historical traffic against the new prompt.

VCR-style libraries implement this pattern for Python and TypeScript agent stacks, recording every external call on first run and replaying it on subsequent runs.

CI gating thresholds

A gate is a boolean derived from one or more score distributions. Common patterns:

Hard floor — every scenario must pass schema and tool-call assertions; one failure blocks merge.
Aggregate threshold — judge score mean ≥ 0.85 across the suite, with no individual scenario below 0.7.
Regression delta — new build's score must not drop more than 2 points vs. the previous baseline on any scenario.
Cost and latency budget — p95 tokens and p95 wall-clock must stay within declared budgets.

Promptfoo and LangSmith both expose these gates as CI-native check outputs that GitHub Actions and GitLab CI can read directly.

Layer	Unit under test	Determinism	Typical signal
Unit test	Single function or prompt	Fully deterministic	Pass/fail on string match or schema
Component eval	One model call or one tool	Mocked dependencies	Score distribution per metric
Agent E2E test	Full agent run	Recorded → replayed	Trajectory + judge + budget gates
Production observability	Live user traffic	Non-deterministic	Online evals, SLO breaches

E2E tests sit between component evals and production observability. They use real tools and real models, but inside a controlled scenario suite, and they should reuse the same trace schema as observability so a production incident can be turned into an E2E fixture without reformatting.

Common misconceptions

"LLM-as-judge can replace deterministic assertions." Judges drift between model versions and have known biases. Use them for quality dimensions, not for tool-call correctness.
"Snapshot testing the final answer is enough." Final answers can match while the trajectory regresses (wrong tool, extra steps, higher cost). Score the trajectory too.
"Higher temperature judges produce more diverse evaluations." It mostly produces noisier evaluations. Use temperature = 0 and prompt diversity instead.
"Replay means rerunning the agent." Replay means substituting recorded LLM and tool outputs so the run is bit-for-bit reproducible.
"Golden traces should be rewritten on every prompt change." Update them deliberately, with code review, so regressions cannot be silently rebaselined.

How to apply this specification

Define the scenario schema. Adopt the keys above (id, inputs, initial_state, expected_outcomes, tags) and check the suite into the agent repo.
Instrument the agent for tracing. Use OpenTelemetry or a vendor SDK that emits spans for planner, tool, and model calls; see the Agent Tracing and Spans Specification.
Record golden traces. Run the suite, review traces with humans, and freeze the approved ones.
Wire evaluators. Combine deterministic checks (schema, tool-call sequence) with one or two calibrated LLM-as-judge metrics.
Add the CI gate. Run the suite on every PR; fail on hard-floor violations and regression deltas.
Stand up replay. Persist traces in a replay store keyed by request hash so any failure can be re-run deterministically.
Curate from production. Promote real failures into the regression bucket of the suite; calibrate judges quarterly.

Set explicit thresholds in code, not in dashboards. A typical starter contract: 100% schema pass, 100% tool-trajectory anchor pass, ≥ 0.85 mean judge score, ≤ 2-point regression delta vs. main, and p95 tokens within budget.

FAQ

Q: How is agent E2E testing different from prompt evaluation?

Prompt evaluation scores a single model call against a static input. Agent E2E testing scores the entire run — planner decisions, tool calls, memory writes, and final output — against a scenario. A prompt eval tells you the prompt is good; an E2E test tells you the agent built around the prompt is good.

Q: Should I use LLM-as-judge as my only evaluator?

No. LLM-as-judge is appropriate for open-ended quality dimensions (helpfulness, faithfulness, tone) but is unreliable as the sole gate for tool-call correctness. Combine schema validators and trajectory anchors with one or two calibrated judge metrics.

Q: How do I make non-deterministic agents reproducible in tests?

Record every LLM and tool call with content-addressed keys, then replay by substituting stubs that return the recorded response for each key. This converts the run into a deterministic, debuggable replay.

Q: How many scenarios should an E2E suite contain?

Start with 20-50 scenarios that cover the happy-path, top edge cases, and any production regressions. Grow toward 200-500 once you have automated curation from production traces. Coverage matters more than count; tag scenarios so you can run a smoke subset on every PR and the full suite nightly.

Q: Where should the suite live?

In the agent repo, version-controlled alongside prompts and tool definitions. Treat scenarios, golden traces, and judge prompts as code: changes need pull-request review.

Q: How do I gate CI without flaky failures?

Use the regression-delta pattern: compare the new build's score distribution to a stable baseline (last green main) instead of an absolute threshold. Combine that with hard-floor schema and trajectory anchor checks, which are deterministic and never flaky.

Q: Can I run this on a free CI tier?

Mostly yes. Cache LLM responses for replay-mode runs, batch judge calls, and tag scenarios so PRs run a small smoke subset. Reserve the full suite for nightly or pre-release runs to control token spend.

: Sakura Sky — Trustworthy AI Agents: Deterministic Replay — https://www.sakurasky.com/blog/missing-primitives-for-trustworthy-ai-part-8/

: agentcheck (open source) — snapshot/replay/test pattern for AI agents — https://github.com/hvardhan878/agentcheck

: Tian Pan — Deterministic Replay: Debugging AI Agents That Never Run the Same Way Twice (2026) — https://tianpan.co/blog/2026-04-12-deterministic-replay-debugging-non-deterministic-ai-agents

: LangChain — The Agent Improvement Loop Starts with a Trace — https://www.langchain.com/blog/traces-start-agent-improvement-loop

: Confident AI — AI Agent Evaluation: Metrics, Traces, Human Review, and Workflows — https://www.confident-ai.com/blog/definitive-ai-agent-evaluation-guide

: Microsoft Azure AI Foundry — Evaluating AI Agents: Can LLM-as-a-Judge Evaluators Be Trusted? — https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/evaluating-ai-agents-can-llm-as-a-judge-evaluators-be-trusted/4480110

: r/AI_Agents practitioner discussion on LLM-as-judge limitations as a CI gate — https://www.reddit.com/r/AI_Agents/comments/1swsqgt/llmasjudge_is_the_wrong_default_heres_what_works/

: agentcheck Python library — VCR-style replay for LLM workflows — https://github.com/hvardhan878/agentcheck

: Promptfoo — CI/CD Integration for LLM Evaluation — https://www.promptfoo.dev/docs/integrations/ci-cd/

: LangSmith — LLM and AI Agent Evals Platform — https://www.langchain.com/langsmith/evaluation

Agent End-to-End Testing Specification

TL;DR

Definition

Why it matters

How it works

Key concepts

Scenario taxonomy

Golden traces

LLM-as-judge evaluation

Deterministic replay

CI gating thresholds

Common misconceptions

How to apply this specification

FAQ

Q: How is agent E2E testing different from prompt evaluation?

Q: Should I use LLM-as-judge as my only evaluator?

Q: How do I make non-deterministic agents reproducible in tests?

Q: How many scenarios should an E2E suite contain?

Q: Where should the suite live?

Q: How do I gate CI without flaky failures?

Q: Can I run this on a free CI tier?

Related Articles

Agent Authentication Documentation Spec

Agent Trace Instrumentation Specification: OpenTelemetry for AI Agents

What Are AI Agents?

GEO & AI Search Insights

Agent End-to-End Testing Specification

TL;DR

Definition

Why it matters

How it works

Key concepts

Scenario taxonomy

Golden traces

LLM-as-judge evaluation

Deterministic replay

CI gating thresholds

Comparison vs related testing layers

Common misconceptions

How to apply this specification

FAQ

Q: How is agent E2E testing different from prompt evaluation?

Q: Should I use LLM-as-judge as my only evaluator?

Q: How do I make non-deterministic agents reproducible in tests?

Q: How many scenarios should an E2E suite contain?

Q: Where should the suite live?

Q: How do I gate CI without flaky failures?

Q: Can I run this on a free CI tier?

Related Articles

Agent Authentication Documentation Spec

Agent Trace Instrumentation Specification: OpenTelemetry for AI Agents

What Are AI Agents?

GEO & AI Search Insights