Agent End-to-End Testing Specification
Agent end-to-end testing combines scenario suites, golden trace baselines, LLM-as-judge evaluators, and deterministic replay to gate non-deterministic agents in CI. Treat traces — not just final outputs — as the unit of evaluation.
TL;DR
Agent E2E testing replays full scenarios through the live tool graph, scores outputs and trajectories with rubric-based or LLM-as-judge evaluators, and replays non-determinism (LLM and tool I/O) deterministically so the same scenario produces the same trace twice. CI gates fail builds when scores or trajectory diffs cross declared thresholds.
Definition
Agent end-to-end (E2E) testing is the practice of executing an AI agent against a fixed suite of scenarios — each defining inputs, environment state, and expected outcomes — and scoring the resulting traces and final outputs against deterministic assertions, rubric metrics, or LLM-as-judge evaluators. The unit under test is the full agent run (planner → tools → memory → output), not a single LLM call.
E2E tests for agents differ from traditional E2E tests in three ways:
- The system is non-deterministic. Repeating the same scenario can produce different traces because of model sampling, tool latency, and time-dependent state.
- Outputs are open-ended. Final answers rarely match a single golden string, so evaluation needs rubric or judge-based scoring.
- The trace matters as much as the output. A right answer reached via the wrong tool sequence is still a regression.
Why it matters
Agents fail silently. A prompt edit, a model upgrade, or a tool schema change can flip behavior in production without any code-level error. Without an E2E suite gating CI, regressions are detected by users, not engineers — which means cost spikes, malformed JSON, broken tool calls, and degraded answer quality reach production before anyone notices.
E2E testing closes this gap by:
- Providing a frozen, versioned scenario set that prompt and model changes must clear before merge.
- Generating labeled trace data that powers offline replay, fine-tuning datasets, and judge calibration.
- Producing a regression signal that is robust to LLM nondeterminism — usually a score distribution rather than a single pass/fail.
- Enabling deterministic post-mortem replay when production incidents require root-cause analysis.
How it works
A complete agent E2E test pipeline has six layers, all sharing the same trace schema so production incidents can be promoted into the suite without reformatting.
flowchart LR
A["Scenario Suite"] --> B["Test Runner"]
B --> C["Agent Under Test"]
C --> D["Trace Recorder"]
D --> E["Evaluators
(rule + judge)"]
E --> F["CI Gate"]
D --> G["Replay Store"]
G --> C| Layer | Responsibility | Example tooling |
|---|---|---|
| Scenario suite | Versioned inputs and expected outcomes | YAML/JSON datasets, LangSmith datasets, promptfoo configs |
| Test runner | Spawns the agent for each scenario | promptfoo, pytest harness, custom CLI |
| Agent under test | The exact build that ships to prod | LangGraph, Claude Agent SDK, OpenAI Agents |
| Trace recorder | Captures spans, tool I/O, model calls | OpenTelemetry, LangSmith, Arize, Langfuse |
| Evaluators | Score outputs and trajectories | Schema validators, exact match, LLM-as-judge |
| CI gate | Pass/fail thresholds applied to scores | GitHub Actions, GitLab CI, promptfoo CI |
The replay store is the bridge between testing and debugging: a recorded trace can be re-fed into the agent with the LLM and tool calls stubbed by their recorded responses, producing a deterministic re-run that engineers can step through.
Key concepts
Scenario taxonomy
Group scenarios by intent so coverage gaps are visible:
- Happy-path scenarios — canonical user goals, full tool execution.
- Edge-case scenarios — empty results, ambiguous inputs, multi-turn clarifications.
- Adversarial scenarios — prompt injection, jailbreaks, tool misuse.
- Regression scenarios — bugs caught in production, frozen as permanent fixtures.
- Cost and latency scenarios — long-context inputs that exercise budget guardrails.
Each scenario declares: id, inputs, initial_state, expected_outcomes (loose, rubric, or strict), and tags for selective execution.
Golden traces
A golden trace is a recorded, human-approved trace for a scenario. It captures the canonical sequence of spans (planner step, tool call, model call, memory write) and the final output. Two evaluation patterns use it:
- Trajectory diff — compare the new run's span sequence to the golden trace and fail if order, tool selection, or argument shape diverge.
- Anchor assertions — pin specific spans (for example, must call search_docs before summarize) rather than the entire trajectory, which is more resilient to harmless reordering.
LLM-as-judge evaluation
LLM-as-judge uses a separate LLM with a rubric to score outputs on dimensions like correctness, faithfulness, helpfulness, and tool-use appropriateness. Practitioner guidance documented across LangSmith, Confident AI, and Microsoft's Azure AI Foundry studies converges on:
- Run judges at temperature = 0 to reduce variance, but expect residual variance from probability ties.
- Calibrate every judge prompt against a small human-labeled set; track agreement (Cohen's kappa or accuracy on agreement-only samples) before trusting it as a CI gate.
- Prefer pairwise judging (A vs B) over single-output scoring when comparing two agent versions — it is more stable and easier to calibrate.
LLM-as-judge is appropriate for open-ended quality dimensions; it is a poor fit as the sole gate for tool-call correctness, where deterministic schema and trajectory checks are more reliable.
Deterministic replay
Deterministic replay means substituting every non-deterministic dependency — LLM calls, tool calls, clocks, RNG — with stubs that return previously recorded outputs. The replay engine looks up each call by a content-addressed key (often a hash of the request) and returns the recorded response. Replay is used for:
- Reproducing a production failure trace in development.
- Re-evaluating a frozen trace under a new judge prompt without re-spending tokens.
- Stress-testing prompt edits by replaying historical traffic against the new prompt.
VCR-style libraries implement this pattern for Python and TypeScript agent stacks, recording every external call on first run and replaying it on subsequent runs.
CI gating thresholds
A gate is a boolean derived from one or more score distributions. Common patterns:
- Hard floor — every scenario must pass schema and tool-call assertions; one failure blocks merge.
- Aggregate threshold — judge score mean ≥ 0.85 across the suite, with no individual scenario below 0.7.
- Regression delta — new build's score must not drop more than 2 points vs. the previous baseline on any scenario.
- Cost and latency budget — p95 tokens and p95 wall-clock must stay within declared budgets.
Promptfoo and LangSmith both expose these gates as CI-native check outputs that GitHub Actions and GitLab CI can read directly.
Comparison vs related testing layers
| Layer | Unit under test | Determinism | Typical signal |
|---|---|---|---|
| Unit test | Single function or prompt | Fully deterministic | Pass/fail on string match or schema |
| Component eval | One model call or one tool | Mocked dependencies | Score distribution per metric |
| Agent E2E test | Full agent run | Recorded → replayed | Trajectory + judge + budget gates |
| Production observability | Live user traffic | Non-deterministic | Online evals, SLO breaches |
E2E tests sit between component evals and production observability. They use real tools and real models, but inside a controlled scenario suite, and they should reuse the same trace schema as observability so a production incident can be turned into an E2E fixture without reformatting.
Common misconceptions
- "LLM-as-judge can replace deterministic assertions." Judges drift between model versions and have known biases. Use them for quality dimensions, not for tool-call correctness.
- "Snapshot testing the final answer is enough." Final answers can match while the trajectory regresses (wrong tool, extra steps, higher cost). Score the trajectory too.
- "Higher temperature judges produce more diverse evaluations." It mostly produces noisier evaluations. Use temperature = 0 and prompt diversity instead.
- "Replay means rerunning the agent." Replay means substituting recorded LLM and tool outputs so the run is bit-for-bit reproducible.
- "Golden traces should be rewritten on every prompt change." Update them deliberately, with code review, so regressions cannot be silently rebaselined.
How to apply this specification
- Define the scenario schema. Adopt the keys above (id, inputs, initial_state, expected_outcomes, tags) and check the suite into the agent repo.
- Instrument the agent for tracing. Use OpenTelemetry or a vendor SDK that emits spans for planner, tool, and model calls; see the Agent Tracing and Spans Specification.
- Record golden traces. Run the suite, review traces with humans, and freeze the approved ones.
- Wire evaluators. Combine deterministic checks (schema, tool-call sequence) with one or two calibrated LLM-as-judge metrics.
- Add the CI gate. Run the suite on every PR; fail on hard-floor violations and regression deltas.
- Stand up replay. Persist traces in a replay store keyed by request hash so any failure can be re-run deterministically.
- Curate from production. Promote real failures into the regression bucket of the suite; calibrate judges quarterly.
Set explicit thresholds in code, not in dashboards. A typical starter contract: 100% schema pass, 100% tool-trajectory anchor pass, ≥ 0.85 mean judge score, ≤ 2-point regression delta vs. main, and p95 tokens within budget.
FAQ
Q: How is agent E2E testing different from prompt evaluation?
Prompt evaluation scores a single model call against a static input. Agent E2E testing scores the entire run — planner decisions, tool calls, memory writes, and final output — against a scenario. A prompt eval tells you the prompt is good; an E2E test tells you the agent built around the prompt is good.
Q: Should I use LLM-as-judge as my only evaluator?
No. LLM-as-judge is appropriate for open-ended quality dimensions (helpfulness, faithfulness, tone) but is unreliable as the sole gate for tool-call correctness. Combine schema validators and trajectory anchors with one or two calibrated judge metrics.
Q: How do I make non-deterministic agents reproducible in tests?
Record every LLM and tool call with content-addressed keys, then replay by substituting stubs that return the recorded response for each key. This converts the run into a deterministic, debuggable replay.
Q: How many scenarios should an E2E suite contain?
Start with 20-50 scenarios that cover the happy-path, top edge cases, and any production regressions. Grow toward 200-500 once you have automated curation from production traces. Coverage matters more than count; tag scenarios so you can run a smoke subset on every PR and the full suite nightly.
Q: Where should the suite live?
In the agent repo, version-controlled alongside prompts and tool definitions. Treat scenarios, golden traces, and judge prompts as code: changes need pull-request review.
Q: How do I gate CI without flaky failures?
Use the regression-delta pattern: compare the new build's score distribution to a stable baseline (last green main) instead of an absolute threshold. Combine that with hard-floor schema and trajectory anchor checks, which are deterministic and never flaky.
Q: Can I run this on a free CI tier?
Mostly yes. Cache LLM responses for replay-mode runs, batch judge calls, and tag scenarios so PRs run a small smoke subset. Reserve the full suite for nightly or pre-release runs to control token spend.
: Sakura Sky — Trustworthy AI Agents: Deterministic Replay — https://www.sakurasky.com/blog/missing-primitives-for-trustworthy-ai-part-8/
: agentcheck (open source) — snapshot/replay/test pattern for AI agents — https://github.com/hvardhan878/agentcheck
: Tian Pan — Deterministic Replay: Debugging AI Agents That Never Run the Same Way Twice (2026) — https://tianpan.co/blog/2026-04-12-deterministic-replay-debugging-non-deterministic-ai-agents
: LangChain — The Agent Improvement Loop Starts with a Trace — https://www.langchain.com/blog/traces-start-agent-improvement-loop
: Confident AI — AI Agent Evaluation: Metrics, Traces, Human Review, and Workflows — https://www.confident-ai.com/blog/definitive-ai-agent-evaluation-guide
: Microsoft Azure AI Foundry — Evaluating AI Agents: Can LLM-as-a-Judge Evaluators Be Trusted? — https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/evaluating-ai-agents-can-llm-as-a-judge-evaluators-be-trusted/4480110
: r/AI_Agents practitioner discussion on LLM-as-judge limitations as a CI gate — https://www.reddit.com/r/AI_Agents/comments/1swsqgt/llmasjudge_is_the_wrong_default_heres_what_works/
: agentcheck Python library — VCR-style replay for LLM workflows — https://github.com/hvardhan878/agentcheck
: Promptfoo — CI/CD Integration for LLM Evaluation — https://www.promptfoo.dev/docs/integrations/ci-cd/
: LangSmith — LLM and AI Agent Evals Platform — https://www.langchain.com/langsmith/evaluation
Related Articles
Agent Authentication Documentation Spec
Document authentication for autonomous agents: OAuth flows, API keys, scopes, error states, and consent UX patterns AI agents need to operate safely.
Agent Trace Instrumentation Specification: OpenTelemetry for AI Agents
Specification for instrumenting AI agents with OpenTelemetry: span hierarchy, gen_ai semantic conventions, privacy-aware capture, sampling, and vendor integration.
What Are AI Agents?
What AI agents are, how they work, and why they matter for content strategy in 2026 — autonomous AI systems that perceive, reason, plan, and act on behalf of users.