Agent Error Recovery Patterns Specification

Agent error recovery is the runtime discipline of classifying failures, retrying safely with idempotency, compensating partial side effects through Sagas, and isolating poison inputs in dead letter queues — all surfaced through a stable error-code taxonomy.

TL;DR

An LLM agent runtime must treat errors as a first-class signal, not an exception to be swallowed. This specification defines five mandatory patterns: classified retries with jittered exponential backoff, idempotency keys for safe retry, Saga-style compensation for multi-step side-effecting workflows, dead letter queues for poison inputs, and a stable error-code taxonomy that operators and the agent itself can reason about.

Scope

This specification covers the runtime patterns an agent platform must implement to recover from tool failures, model failures, and workflow interruptions. It is distinct from the agent-error-handling-docs format (which covers how errors are documented for human operators); this document covers the runtime patterns themselves.

1. Error Classification

Every error surfaced inside the agent loop must carry a classification before any recovery decision is made.

Class	Examples	Default action
transient	HTTP 429, 5xx, network timeout, connection reset	Retry with backoff
permanent	HTTP 4xx (except 408/429), validation error, auth failure	No retry; escalate or fail
semantic	Tool returned syntactically valid but wrong-content output (per arxiv 2508.07935)	Re-plan with constraints, do not blindly retry
policy	Safety / content / quota refusal	Fallback path; do not retry same input
state	Lost context, expired session, corrupted memory	Recover from checkpoint, then retry

Classification must be derived from a structured error envelope (HTTP code + provider error code + tool error code), not from string matching on the message body.

2. Retry Strategy

For transient errors, the runtime MUST implement exponential backoff with full jitter. The canonical formula, per the AWS Architecture Blog "Exponential Backoff and Jitter" guidance, is:

sleep = random_between(0, min(cap, base 2 * attempt))

Default parameters:

base: 250 ms for tool calls, 1,000 ms for LLM calls
cap: 30,000 ms
max_attempts: 5 for tool calls, 3 for LLM calls
retry_budget: a per-run cap of total retry seconds, default 60s, to prevent retry storms inside long workflows

The runtime MUST honour Retry-After headers when present. The runtime MUST NOT retry permanent errors. The runtime MUST emit one telemetry span per attempt (see agent-tracing-and-spans-spec).

3. Idempotency

Any tool that creates, updates, or deletes external state MUST be invoked with an idempotency key. The runtime MUST generate a deterministic key per logical action — typically hash(run_id + step_id + tool_name + tool_args) — and pass it through the tool boundary.

The target service is expected to honour the key per Stripe-style idempotent requests: the first request with a key is processed, and subsequent requests with the same key return the cached response (status code and body), even if the original failed. This makes a retry safe in the presence of mid-flight network failures, where the agent does not know whether the original call succeeded.

For tools that do not support idempotency keys natively, the runtime MUST implement a client-side idempotency cache — keyed by the same hash — that records the outcome of each attempt and short-circuits duplicate calls within a TTL (default 24 hours). The AWS Builders' Library article on idempotent APIs is the canonical reference for the design tradeoffs.

4. Compensation (Saga)

For any agent workflow that takes more than one externally-visible side effect, the runtime MUST implement compensation via the Saga pattern. Each side-effecting step is paired with a compensating action that semantically reverses it (refund instead of charge; release instead of reserve; archive instead of delete). The Azure Architecture Center Saga reference and the compensating transaction pattern describe the variants.

Design rules:

Compensations MUST themselves be idempotent and retryable.
Compensations SHOULD execute in reverse order of the original side effects.
A compensation that fails after exhausting its own retry budget MUST be written to the dead letter queue and an operator alerted; it MUST NOT silently succeed.
The runtime SHOULD use orchestration over choreography for agent workflows — a single orchestrator (the agent loop or a workflow engine like Temporal or LangGraph) is easier to reason about than emergent message-bus choreography.

5. Poison Inputs and Dead Letter Queues

Not every failure is recoverable. Inputs that repeatedly cause the agent to fail — malformed payloads, references to deleted resources, content that triggers policy refusals — are poison messages in the classical queue sense.

The runtime MUST:

Bound the per-input retry budget. Default: 5 total attempts across the lifetime of the input.
After the budget is exhausted, move the input to a dead letter queue with full metadata: original payload, attempt count, error trail, last error envelope, timestamps.
Emit a metric per DLQ write and an alert when DLQ depth exceeds a configured threshold.
Provide a documented replay path that re-injects DLQ items after a fix, with a fresh idempotency key.

DLQs MUST NOT be treated as graveyards. Each DLQ MUST have a named owner and a runbook entry.

6. Error Code Taxonomy

Every error surface — tool, LLM provider, runtime — MUST be normalised into a stable error_code of the form ... Examples:

tool.http.429_rate_limited
tool.http.503_unavailable
llm.policy.refusal
llm.context.overflow
runtime.state.checkpoint_missing
runtime.budget.retry_exhausted

The codes MUST be:

Stable across versions — additions allowed, renames forbidden without deprecation.
Documented in a single registry, one row per code, with cause + recovery guidance.
Machine-readable by the agent itself — included in the tool result envelope so the model can plan around them.

7. Recovery Decision Flow

The runtime MUST evaluate every failure through the following decision flow:

Classify the error.
If permanent → fail the step, surface the error code, do not retry.
If transient → check retry budget; if available, sleep with backoff+jitter, retry with the same idempotency key.
If semantic → re-plan the step with the error code as additional input to the model; do not blindly retry.
If policy → trigger the fallback chain (see agent-fallback-strategies-spec).
If state → restore from the latest checkpoint (see agent-state-management-patterns-spec) and retry.
On retry exhaustion → run the compensation chain in reverse order, then write the input to the DLQ.

Layered resilience — retries inside fallbacks inside circuit breakers — is the production norm and is covered in the related agent-circuit-breaker-pattern-spec.

8. Observability Requirements

Every recovery action MUST emit telemetry:

One span per attempt with error_code, attempt_number, delay_ms, idempotency_key_hash.
One counter per error_code per minute for alerting.
One log line per DLQ write with the full error envelope.
One trace per compensation chain showing the original steps and their compensations.

FAQ

Q: Should agents retry on HTTP 4xx errors?

No, except for 408 (timeout) and 429 (rate limit). All other 4xx errors are permanent and indicate a request the server understood but rejected; retrying without changing the request is wasteful and can mask bugs.

Q: How is this different from a circuit breaker?

Retries handle the per-call failure of a single attempt. A circuit breaker handles the systemic state of a downstream — when failures cross a threshold, it short-circuits all calls to that target for a cooldown window. They are complementary. See agent-circuit-breaker-pattern-spec.

Q: Do I need Saga if my agent only calls one tool?

No. Saga is only required when an agent run has more than one externally-visible side effect that must be undone together on failure. Single-tool agents only need retry + idempotency.

Q: What goes in the dead letter queue?

The original input plus an error envelope with error_code, attempt history, last error message, and run/step IDs. Enough metadata for an operator to triage without reproducing the run.