Agent Hallucination Detection Spec

Agent hallucination detection is a layered runtime specification that catches fabricated claims inside an AI-agent run before the agent emits a final answer. It combines tool-call schema verification, groundedness scoring against retrieved evidence, self-consistency probes on critical claims, and a gating threshold that blocks or revises low-trust outputs. It is distinct from search-side hallucination detection because the unit of analysis is a multi-step agent run, not a single completion.

TL;DR

This specification defines a four-stage pipeline for detecting hallucinations during an agent run: (1) tool-call verification rejects calls whose arguments do not match the tool's declared schema, (2) groundedness scoring checks whether each load-bearing claim in the final answer is supported by retrieved evidence, (3) self-consistency probing re-asks the model with a paraphrased prompt and compares answers on critical claims, and (4) gating blocks, revises, or annotates the output based on the combined score. The spec is implementation-agnostic; it works with OpenAI, Anthropic, Google, and Azure agent runtimes. The output of a compliant detector is a HallucinationReport with per-claim scores, evidence citations, and a recommended action (emit, revise, block).

Definition

Agent hallucination detection is the runtime discipline of identifying claims in an AI-agent output that are not supported by the agent's available evidence. Unlike a single-shot LLM hallucination detector that scans one completion, an agent detector must reason about an entire run: the user request, the system prompt, every tool call and result, every intermediate model turn, and the final answer.

The scope of this spec is the agent runtime. Search-side hallucination detection (used by AI Overviews and answer engines) is a related but separate problem because the evidence corpus there is the public web; here the evidence is the set of tool results returned during the run. The two disciplines share groundedness scoring techniques but differ in evidence scope, latency budget, and acceptable false-positive rates.

A hallucination, for the purposes of this spec, is any factual claim in the agent's output that meets at least one of these conditions:

It contradicts a tool result returned during the run.
It introduces a specific entity, quantity, date, citation, or identifier that does not appear in any tool result.
It restates a tool result with a meaning-changing modification (negation, magnitude change, temporal shift).
It cites a source that was not actually retrieved or that does not exist.

Stylistic embellishment, summarization, and reasoned inference are not hallucinations under this spec, provided the inference is consistent with the evidence.

Why this matters

Agents fail differently from chatbots. A chatbot hallucination produces one wrong sentence; an agent hallucination can trigger an action — sending an email to the wrong address, writing the wrong number into a database, calling a tool with fabricated arguments. The cost of a missed hallucination scales with the agent's tool surface, which is why production agent stacks need a detection layer that runs at agent-output time, not after.

The second reason is auditability. Regulated deployments (finance, healthcare, legal, government) increasingly require a trail showing that each load-bearing claim in an agent answer was checked against evidence. A spec-compliant detector produces a HallucinationReport that is exactly this audit artifact.

The third reason is product trust. Users tolerate AI mistakes when the product clearly knows it might be wrong; they do not tolerate confident wrong answers. Detection enables the product to either revise the answer, hedge with calibrated uncertainty, or refuse to answer, and to do so consistently.

Specification

Stage 1 — Tool-call verification

For every tool call emitted by the model during the run, the detector MUST:

Parse the call against the tool's declared input schema (JSON Schema, OpenAPI, or equivalent).
Reject calls whose arguments fail validation. The agent runtime MUST surface the validation error back to the model and allow a retry.
Verify that each entity referenced in the arguments (URLs, IDs, user handles, file paths) was either supplied by the user, returned by a prior tool call, or is in an allowlist for the tool.
Record the call, the validation outcome, and any rejected attempts in the run log.

Fabricated tool arguments are the highest-frequency hallucination class in production agents. Stage 1 catches roughly the easiest cases — schema mismatches, invented IDs, hallucinated URLs — and prevents them from reaching downstream tools.

Stage 2 — Groundedness scoring

When the agent produces a candidate final answer, the detector MUST:

Decompose the answer into atomic claims (one factual proposition per unit). Decomposition can use a smaller LLM, a rule-based extractor, or sentence-level segmentation; the spec is agnostic.
For each claim, retrieve the supporting evidence span(s) from the run's tool results. A claim is grounded if its meaning is entailed by at least one retrieved span.
Compute a per-claim groundedness score in [0, 1]. Recommended methods include natural-language inference (NLI) entailment, embedding similarity with a calibrated threshold, or LLM-as-judge with a strict rubric.
Compute an overall groundedness score as the minimum (worst-claim) or weighted mean across claims, depending on configuration. The spec recommends minimum for high-stakes deployments.

Stage 3 — Self-consistency probing

For the subset of claims marked critical (numbers, dates, citations, named entities, action commitments), the detector SHOULD:

Re-ask the model the same underlying question with a paraphrased prompt and the same evidence.
Compare the new answer to the original on the critical claim. A divergence flags the claim for revision.
Optionally run more than one probe and require majority agreement.

Self-consistency probing catches a class of hallucinations that pass groundedness scoring because the evidence is ambiguous but the model picked an unsupported reading.

Stage 4 — Gating

The detector MUST gate the final answer using the combined signals from stages 1-3:

emit — overall groundedness ≥ emit_threshold AND no critical-claim divergence AND no Stage-1 rejections in the final turn.
revise — at least one claim below revise_threshold but above block_threshold. The agent re-runs with a feedback prompt that names the unsupported claims.
block — at least one critical claim below block_threshold, or repeated revise loops have failed. The agent returns a refusal or hedge.

The thresholds are deployment-specific. The spec recommends emit_threshold = 0.85, revise_threshold = 0.6, block_threshold = 0.4 as defaults for general-purpose agents; high-stakes deployments tighten these.

Output: HallucinationReport

A compliant detector produces a HallucinationReport with:

run_id — opaque identifier for the agent run.
claims[] — array of { text, evidence_spans[], score, critical, status }.
tool_call_validations[] — array of { tool, args, status, errors[] }.
consistency_probes[] — array of { claim, original, probe_answers[], agreement }.
overall_score — the gating score.
action — emit, revise, or block.
version — spec version implemented.

Practical application

A typical implementation wires the detector as middleware between the agent's final-turn model call and the user-facing emit. Steps:

Capture the full run trace (system prompt, user input, tool calls, tool results, model turns).
After the model proposes a final answer, invoke the detector synchronously.
If action = revise, prepend the unsupported-claim feedback to the next model turn and re-run.
If action = block, emit a refusal template that names the failure mode without revealing internal details.
Persist the HallucinationReport to the run log for audit.

Latency budget is real: a full detection pass adds 200-1,500 ms depending on claim count and probe configuration. Production deployments often run Stages 1 and 2 inline and Stage 3 asynchronously for non-critical paths.

Common mistakes

Treating embedding similarity as groundedness. Cosine similarity above a threshold does not mean entailment. Use NLI or LLM-as-judge for the final groundedness call.

Decomposing too coarsely. A paragraph-level claim hides multiple atomic claims. Decompose to sentence-or-finer granularity.

Skipping self-consistency on numbers and citations. These are the highest-impact hallucination classes; probing them is non-optional.

Conflating refusal with low confidence. A blocked answer should be a deliberate refusal template, not a low-quality emit with a hedge.

No retry loop for tool-call validation failures. Stage 1 only works if the runtime feeds the validation error back to the model and allows a corrected call.

FAQ

Q: How is this different from search-side hallucination detection?

Search-side detectors check claims against the public web; agent-side detectors check claims against the run's own tool results. The methodology overlaps (entailment, NLI), but the evidence scope, latency budget, and acceptable false-positive rates differ.

Q: Do I need all four stages?

Stages 1, 2, and 4 are required for a compliant detector. Stage 3 (self-consistency) is recommended for critical claims and required for high-stakes deployments.

Q: What groundedness score method should I use?

NLI entailment is the strongest signal but the most compute-intensive. LLM-as-judge with a strict rubric is the common production choice. Embedding similarity is acceptable only as a pre-filter, never as the final groundedness call.

Q: How do I tune the thresholds?

Start with the defaults (0.85 / 0.6 / 0.4), measure precision/recall on a labeled set of historical runs, and tighten for high-stakes verticals. Threshold tuning is a continuous process, not a one-time setup.

Q: How does this interact with agent self-correction loops?

The detector's revise action is the trigger for a self-correction loop. The unsupported-claim feedback is the input to the next model turn. The two specs are complementary: detection identifies the failure, self-correction fixes it.

Q: What about non-factual outputs (creative, brainstorm, draft)?

The detector is configurable per task type. For creative or brainstorm tasks, groundedness scoring is typically disabled or set to a permissive threshold; tool-call verification still runs because action safety is independent of factuality.