Geodocs.dev

Agent Evaluation Harness Documentation: How to Spec an Eval Suite for AI Agents

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

An agent evaluation harness documentation spec is a standardized way to describe an eval suite — its scorers, datasets, trajectory rubrics, and regression gates — so engineers, auditors, and documentation agents can read, run, and cite it without spelunking through code.

TL;DR

A production-grade agent evaluation harness only matters if its setup is legible. This specification defines the frontmatter, sections, and required artifacts you must publish for every eval suite so humans, CI runners, and documentation agents (LLMs that read your docs) can all answer the same questions: what does this suite measure, how does it grade, and when does it block a release? Use it as a contract between agent teams, platform owners, and quality reviewers.

Why this spec exists

Most public guides treat evaluation harnesses as code: how to write a scorer, how to wire a dataset, how to run trials concurrently. That framing leaves a documentation gap. Anthropic explicitly distinguishes the evaluation harness — the infrastructure that runs evals end-to-end, providing instructions and tools, running tasks, recording steps, grading outputs, and aggregating results — from the agent harness it tests.

The harness only delivers value when other people can interpret what it produces. If a regression eval drops from 98% to 91%, the on-call engineer needs to know which scorer fired, against which dataset slice, with what severity, and which rubric defined pass. That information must live in documentation, not Slack threads.

This specification standardizes that documentation. It is intentionally LLM-readable — so agents that consult your docs (RAG agents, doc bots, codegen agents) can resolve what does the policy-compliance suite check? without crawling source.

Scope

In scope:

  • Static documentation pages describing one eval suite per article.
  • Frontmatter fields and required body sections.
  • Cross-references to scorers, datasets, and rubrics.
  • Versioning, review cycles, and ownership metadata.

Out of scope:

  • Runner implementation in any specific language.
  • Storage format for trace artifacts.
  • CI/CD wiring (handled by your release-engineering docs).
  • Per-model performance benchmarks (those belong in capability cards).

Terminology

  • Evaluation harness — the system that runs evals end-to-end: provisions environment, executes the agent, captures trajectory, applies graders, aggregates results.
  • Agent harness — the scaffold around the model that turns it into an agent (system prompt, tools, orchestration loop, memory). LangChain summarizes it as everything that is not the model itself.
  • Eval suite — a coherent collection of tasks targeting a capability or behavior, e.g. refunds-and-cancellations.
  • Scorer (grader) — a function or rubric that assigns pass/fail or numeric score. Three families: code-based, model-based (LLM-as-judge), and human.
  • Trajectory — the ordered record of observations, decisions, tool calls, and outputs the agent emitted while solving a task.
  • Capability eval vs regression eval — the first asks what can this agent do?; the second asks did we break what already worked?. They have different baselines and gating semantics.

Required documentation artifacts

Every eval suite MUST publish three artifacts on your docs surface:

  1. Suite spec page — the document this specification governs. Describes intent, scorers, datasets, gating.
  2. Scorer reference page(s) — one per scorer; defines inputs, outputs, rubric, calibration notes.
  3. Dataset card — describes provenance, licensing, slice definitions, refresh cadence.

The suite spec page MUST link to each scorer reference and dataset card by canonical URL. Consumers (humans and agents) should traverse from suite to scorers and datasets in one click.

Frontmatter contract for suite specs

Every suite spec page MUST include the standard documentation frontmatter (identity, canonical layer, taxonomy, SEO, AI readability, lifecycle, relations, i18n, authorship) PLUS the following suite-specific fields under an eval_suite: block:

eval_suite:
  suite_id: "refunds-and-cancellations"
  suite_version: "2.3.0"
  agent_under_test:
    - "customer-support-agent"
  eval_type:
    - "capability"
    - "regression"
  trajectory_required: true
  environment:
    type: "sandboxed-sql"
    fixtures: "fixtures/refunds-v2.sql"
  scorers:
    - id: "refund-recorded"
      kind: "code"
      doc: "/ai-agents/scorers/refund-recorded"
    - id: "tone-empathy"
      kind: "llm-judge"
      doc: "/ai-agents/scorers/tone-empathy"
  datasets:
    - id: "refunds-golden-2026-q1"
      doc: "/ai-agents/datasets/refunds-golden-2026-q1"
      slices: ["happy-path", "edge-card-decline", "policy-violation"]
  gating:
    blocks_release: true
    capability_baseline: 0.55
    regression_floor: 0.97
    severity_on_fail: "high"
  owner: "agent-quality@example.com"
  review_cycle_days: 30

Field semantics:

  • suite_id — kebab-case stable identifier; never reused after retirement.
  • eval_type — array; a suite can serve as both capability and regression depending on the slice.
  • trajectory_required — true if any scorer reads the trajectory, not just the final output.
  • environment.type — one of stateless, sandboxed-sql, containerized, browser, multi-agent-arena, production-shadow.
  • gating.blocks_release — true if a CI gate fails the build on regression-floor breach.
  • gating.capability_baseline — minimum pass rate to claim the capability.
  • gating.regression_floor — minimum pass rate before the build is rejected.

Required body sections

Suite spec pages MUST include the following H2 sections, in order.

1. Purpose

State the single capability or risk the suite addresses, in two to four sentences. Begin with the capability (This suite measures whether the customer-support agent issues correct refunds…). End with the consequence of failure (…a regression here means real customers receive wrong refund amounts.).

2. Agent under test

List every agent harness and version this suite targets. If the suite is harness-agnostic, say so explicitly and list the contract the agent must satisfy (e.g. must expose submit_refund tool with schema X).

3. Environment and fixtures

Describe the sandbox: database fixtures, mocked APIs, network policies, time fixtures, seed values. Document determinism guarantees — Anthropic emphasizes that a stable environment is non-negotiable for an eval harness. If the environment is non-deterministic (for example, a live model-as-judge), document the seed, temperature, and model-version contract.

4. Datasets

For each dataset, link to its dataset card and describe:

  • Slice composition (counts per slice).
  • Labeling methodology (ground-truth source).
  • Known biases or coverage gaps.
  • Refresh cadence and last refresh date.

5. Scorers

For each scorer, link to its reference page and document:

  • Kind — code, llm-judge, or human.
  • Inputs — final output, full trajectory, environment state, or all three.
  • Output — boolean, ordinal, or numeric in [0, 1].
  • Rubric — for llm-judge, paste the rubric verbatim or link to the canonical rubric file.
  • Calibration — agreement rate with human labels on a holdout slice plus refresh date.

Anthropic's combine-all-three pattern (code for verifiable outcomes, LLM-as-judge for nuance, humans for calibration) is the recommended default. Document how each kind is used so reviewers know which scorers are ground truth and which require periodic recalibration.

6. Trajectory rubric (required when trajectory_required: true)

Trajectory grading is the layer most teams skip, yet it is essential for multi-step agents. Document, in plain language:

  • Required steps — ordered or unordered tool calls that MUST appear.
  • Forbidden actions — tools or sequences that fail the trajectory regardless of final output.
  • Loop detection — repetition thresholds (e.g. same tool with same args > 3 times = fail).
  • Efficiency band — soft expectation on step count or token cost; not gating but tracked.

Where applicable, include both deterministic (exact-match) and LLM-as-judge variants and explain when each applies.

7. Gating logic

Translate the gating frontmatter into prose so a release manager reading on a phone understands consequences. State:

  • Which CI job runs the suite.
  • What the capability baseline means to product (e.g. below 55% we do not ship the feature).
  • What the regression floor blocks (e.g. below 97% pass rate we block deploy and page on-call).
  • Override path (who can grant a waiver, on what evidence).

8. Reporting and trace artifacts

Describe what the harness writes for each run: trace-store URL pattern, retention period, PII policies, dashboards. Provide one example run URL — linkable, redacted if needed — so readers can preview the artifact shape.

9. Change log

Append-only list of suite-version bumps with date, author, and one-line rationale. Use semver: MAJOR for backward-incompatible scorer or schema changes, MINOR for added scorers, PATCH for clarifications.

10. FAQ

Three to five answer-first Q&As anticipating reviewer and agent questions (see template at the bottom of this article).

Scorer reference card schema

A scorer reference is its own page. It MUST include:

scorer:
  id: "tone-empathy"
  kind: "llm-judge"
  scale: "ordinal-1-5"
  inputs: ["final_output"]
  judge_model: "claude-sonnet-4.5"
  judge_prompt_path: "prompts/tone-empathy.v3.md"
  calibration:
    holdout: "tone-empathy-human-2026-03"
    agreement: 0.86
    last_refreshed: "2026-03-12"
  failure_modes:
    - "over-apologizes on policy violations"
    - "scores high on refusals if polite"

The body MUST include the rubric, two passing examples, two failing examples, and known failure modes. Recalibrate at least quarterly and document the agreement rate against the human holdout.

Dataset card schema

Datasets are first-class. A dataset card MUST cover provenance (scraped from synthetic personas, extracted from ticket logs with PII scrubbing rule X), licensing, slice definitions, refresh cadence, and a row count per slice. Track ground-truth labelers and inter-rater reliability when applicable.

dataset:
  id: "refunds-golden-2026-q1"
  version: "1.4.0"
  size: 412
  slices:
    happy-path: 220
    edge-card-decline: 84
    policy-violation: 108
  provenance: "ticket-export-2026-01 with synthetic augmentation"
  pii_policy: "scrubbed-v3"
  labelers: ["support-quality-team"]
  inter_rater_agreement: 0.91
  refresh_cadence_days: 90
  last_refreshed: "2026-03-30"

Capability vs regression eval discipline

Capability evals start with low pass rates and document a climb; regression evals must stay near 100% and document stability. The two have different gating logic and different review cadences. Suite spec pages SHOULD declare which mode they serve in eval_type and call out per-slice gating where the same dataset is used for both purposes.

Examples

A minimal-but-complete suite spec opens like this:

mdx


title: "Refunds and Cancellations Eval Suite"

slug: "refunds-and-cancellations-eval-suite"

section: "ai-agents"

content_type: "specification"

…full Geodocs frontmatter…

eval_suite:

suite_id: "refunds-and-cancellations"

suite_version: "2.3.0"

eval_type: ["capability", "regression"]

trajectory_required: true

# …


Refunds and Cancellations Eval Suite

Measures whether the customer-support agent issues correct refunds and respects cancellation policy across 412 ticketed scenarios.

Purpose

This suite verifies that the customer-support agent records the correct refund amount in the billing database and never bypasses the policy-violation block. Failure means real customers receive wrong refunds or policy is silently violated.

Agent under test

  • customer-support-agent v3.x harness, with submit_refund, cancel_subscription, and lookup_policy tools.

…remaining required sections…

Anti-patterns

  • Outcome-only grading on multi-step agents. If your suite never inspects the trajectory, document why (e.g. the task is single-tool). Otherwise you will miss right-answer-wrong-path failures.
  • Undocumented LLM-judge prompts. A judge prompt that lives only in code is unreviewable. Link to the canonical prompt file with a stable path and version.
  • Mixing capability and regression slices without labels. Reviewers cannot interpret a 78% pass rate without knowing which slices count toward which gate.
  • Stale calibration. An LLM-judge whose human-agreement rate has not been recomputed in six months is not trustworthy.
  • Hidden environment state. We always reset the DB is not a spec — link to the fixture file and document what reset means.

Implementation checklist

Before publishing a suite spec, confirm:

  • [ ] Frontmatter includes the eval_suite block with all required fields.
  • [ ] Every scorer is linked to its reference card.
  • [ ] Every dataset is linked to its dataset card.
  • [ ] Trajectory rubric is present if trajectory_required: true.
  • [ ] Gating thresholds match the CI configuration in your release-engineering docs.
  • [ ] FAQ contains at least three answer-first Q&As.
  • [ ] Change log records the current version and date.
  • [ ] Owner email and review cadence are set.

FAQ

Q: Is this spec only for production agents, or also for prototypes?

Both. Prototypes typically publish capability-only suite specs with looser gating and shorter change logs. The frontmatter shape is identical so the docs surface stays uniform as a prototype graduates to production.

Q: How do I document a suite that mixes code-based and LLM-as-judge scorers?

List each scorer with its kind in the scorers array and explain in the Scorers section which scorer is the source of truth for each task. Anthropic's pattern of code for verifiable outcomes, LLM-judge for nuance, and humans for calibration is the recommended default; mirror that hierarchy in prose.

Q: What goes in the suite spec versus the scorer reference?

The suite spec describes intent, environment, gating, and which scorers it composes. The scorer reference describes the rubric, calibration, judge prompt, and failure modes for one scorer. Avoid duplication — link, do not paste.

Q: How often must trajectory rubrics be reviewed?

Re-review whenever the agent's tool surface changes or whenever the rubric's required and forbidden lists no longer match production behavior. Default cadence is the suite's review_cycle_days; tighten it for safety-critical suites.

Q: Can a single page document multiple suites?

No. One suite per page. Bundle related suites in a series via the series and series_order frontmatter fields and surface them under a hub page so readers and docs agents can list them.

Related Articles

guide

AEO for How-To Queries: Winning Step-by-Step Answers in AI Engines

How to optimize step-by-step content so ChatGPT, AI Overviews, and Perplexity extract your procedures as the cited how-to answer.

guide

AEO for Statistical and Data Queries

AEO for statistical and data queries: how to win 'how many', 'what percent', and 'when did' answers with stat-first sentences, source attribution, and Dataset schema.

reference

What Is AI Answer Extractability? Score, Signals, and Optimization

AI answer extractability measures how easily answer engines can lift a clean, self-contained answer from a page. Definition, signals, scoring, and optimization.

Topics
Stay Updated

GEO & AI Search Insights

New articles, framework updates, and industry analysis. No spam, unsubscribe anytime.