Geodocs.dev

Agent Retry Strategy Specification

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

An agent retry strategy specification classifies errors into transient and permanent, mandates exponential backoff with jitter, requires idempotency keys for side-effectful tool calls, honors Retry-After, and caps retry consumption per request and per second so the agent does not amplify the outage it is trying to recover from.

TL;DR

Agents call slow, rate-limited, occasionally-failing dependencies on every turn. Naive retry logic — retry on every error, no delay, no cap — turns a downstream blip into a thundering herd that knocks the dependency back down each time it almost recovers (Hermanto, 2026). This spec defines retry classes, an exponential-backoff-with-full-jitter formula, idempotency keys for side-effectful calls, the rule that Retry-After is always honored, and per-request and per-second retry budgets that cap fan-out. It is enforceable at the model client, the tool client, and the agent loop.

Definition

An agent retry strategy specification is the contract that says, for every kind of failure an agent encounters, whether to retry, how long to wait, how many times, and with what idempotency guarantee. The spec is layered: the model client retries model calls, the tool client retries tool calls, and the agent loop retries turns. Each layer follows the same formula but with different budgets (AWS Builders' Library, 2024).

Why this matters

Retries are the single most common cause of self-inflicted outages in distributed systems. The classic failure mode is the thundering herd: every client retrying in lockstep at the same fixed interval, with each retry simultaneously hammering a recovering downstream service back to failure (Hermanto, 2026). Agents amplify this risk for three reasons:

  1. Multiplicative fan-out. A single user turn can produce 5-20 model and tool calls. Each retried independently turns a 1% per-call failure rate into a 5-20% per-turn failure rate without retries, or a sustained 10x amplification with bad retries.
  2. Long turns are expensive to lose. A failed turn can mean 30 seconds to 30 minutes of wasted work, so the temptation is to retry aggressively. Without budgets, that temptation produces retry storms.
  3. Side-effectful tool calls. Retrying a non-idempotent tool call double-fires emails, payments, or writes. Idempotency must be a first-class part of the retry contract, not an afterthought.

The AWS pattern (capped exponential backoff with jitter, plus a retry budget) has served as the foundation for resilient client libraries for over eight years (AWS Architecture Blog, 2023).

Retry classes

Every error is classified at the boundary into one of three classes. Class determines whether to retry.

ClassExamplesRetry?
Transient429 rate limit, 503 service unavailable, network timeout, connection reset, model 500Yes, with backoff
Permanent400 bad request, 401 unauthorized, 403 forbidden, 404 not found, 422 validationNo
Ambiguous502, 504, gateway-level errors, timeouts on non-idempotent callsOnly with idempotency key

Ambiguous errors are the dangerous middle: the call may have succeeded server-side even though the client saw a failure. Retrying without idempotency double-executes side effects; refusing to retry sacrifices availability. The spec resolves this by requiring an idempotency key for any retryable side-effectful call.

Backoff formula

The spec mandates exponential backoff with full jitter:

  • base_delay_ms — starting delay (e.g., 100 ms for fast tools, 500 ms for LLM calls).
  • max_delay_ms — cap on a single delay (e.g., 10,000 ms for tools, 30,000 ms for LLM).
  • delay = random_uniform(0, min(max_delay_ms, base_delay_ms 2 * attempt))

Full jitter (uniform random between 0 and the computed exponential) is preferred over equal jitter or no jitter because it best decorrelates retries across clients (AWS SDK FullJitterBackoffStrategy). Decorrelation is the entire point: synchronized retries are the thundering herd.

Max attempts:

  • Model calls: 3 attempts.
  • Tool calls: 3-5 attempts depending on tool tier (Tier 3 augmenting tools: 2 attempts).
  • Agent turn loop: 1 retry on a hard failure, then refuse.

When attempts are exhausted, fall through to the graceful degradation spec rather than re-retrying.

Honoring Retry-After

When a server returns a Retry-After header (HTTP 429 or 503 in particular), the agent must use the server-supplied delay, not the computed backoff. Two rules:

  • Use the header value verbatim. If the server says wait 30 seconds, wait at least 30 seconds before the next attempt.
  • Respect across retries. A subsequent Retry-After extends the wait further; do not collapse to the computed backoff because the server is signaling actual capacity.

Ignoring Retry-After is a documented cause of provider rate-limit blocking on agent platforms; provider SDKs increasingly enforce honoring it as a default (AWS SDK retry behavior, 2024).

Idempotency keys

Every side-effectful tool call must be issued with an idempotency key. The spec mandates:

  • Key derivation. Key = stable hash of (tenant_id, turn_id, tool_call_id) so the same logical call produces the same key on retry, but distinct logical calls (even within one turn) produce distinct keys.
  • Persistence. The key is persisted alongside the pending tool call so a checkpoint-then-resume cycle (per the agent startup/shutdown spec) reuses the same key rather than generating a new one.
  • Server-side dedup window. The tool server must deduplicate by key for at least the longest legitimate retry window (typically 24 hours).

Idempotent tools (pure reads, GET-style queries) do not require keys but should accept and ignore them to keep the client API uniform.

Retry budgets

Retry budgets cap the aggregate retry rate so a single sick dependency cannot consume unbounded resources:

  • Per-request budget. No single turn exceeds N retries across all calls (e.g., 10). Past the budget, the turn fails fast with a degradation event.
  • Per-second budget. Across all in-flight turns, retry rate to any single downstream is capped at K%/s of the success rate (the AWS adaptive retry mode pattern). When success drops, retry rate drops proportionally rather than amplifying the outage.
  • Per-tenant budget. Retries do not bypass the per-tenant rate limit defined in the multi-tenant isolation spec. A failing tenant cannot exhaust the global LLM budget by retrying.

Without budgets, retries become the outage. With budgets, retries shave the long tail of transient failures off the success rate without ever pushing the system into self-inflicted instability.

Cost-aware retry

LLM retries cost real money. The spec layers cost controls on top of the budget:

  • A failed LLM call still consumed tokens (or a partial completion); the retry pays again. Budget per dollar, not just per attempt.
  • On rate-limit failures, prefer the secondary model (per the graceful degradation spec) over a long backoff. The secondary's cost is usually less than the lost user latency.
  • For background or batch agents, raise max_delay_ms and lower max attempts — there is no user waiting, and a failed turn can be retried by the queue minutes later.

Common pitfalls (anti-patterns)

  • Fixed-interval retries. Synchronized clients form a thundering herd that knocks recovering services back down (Hermanto, 2026).
  • No jitter, even with backoff. Exponential without jitter still synchronizes; full jitter is required.
  • Unlimited attempts. Caps the wrong way — the system retries forever and pins resources. Always set max attempts and a per-request budget.
  • Retrying 4xx errors. A 400 will not become a 200 on retry; the request body is wrong. Classify 4xx as permanent.
  • Retries without idempotency keys on side-effectful calls. Double-executes emails, payments, writes — the most user-visible failure mode.
  • Ignoring Retry-After. Providers explicitly tell you the safe wait; computing your own delay either retries too soon (re-rate-limited) or too late (wasted user time).
  • No per-second retry budget. A sick downstream becomes a self-inflicted outage; success rate drops, retry rate climbs, the dependency stays down.

FAQ

Q: How does this interact with the circuit breaker spec?

The circuit breaker is the upstream gate — it stops new calls (and therefore retries) to a dependency that is failing too often. The retry strategy applies between breaker checks. When the breaker is open, retries are skipped entirely and the call falls through to graceful degradation.

Q: Should I retry inside the model client, the tool client, or the agent loop?

At the lowest layer that has the necessary context. Transient HTTP errors retry inside the SDK client; agent-level failures (tool timeout exhausted, ambiguous validation) retry at the agent loop with a new model call. Stacked retries multiply attempts — if both layers retry 3 times, you get 9 attempts. Coordinate budgets across layers explicitly.

Q: How do I pick base_delay_ms and max_delay_ms?

Use the dependency's typical recovery time. For LLM rate limits, 500-800 ms base, 30 s max. For internal tools, 50-100 ms base, 5-10 s max. For external HTTP APIs, 200-500 ms base, 10-20 s max.

Q: What about streaming model responses?

Mid-stream failures are ambiguous — part of the response was emitted, the rest was lost. Treat them as ambiguous and retry only with idempotency: replay the prompt with the partial output as a prefix and a stable request_id so the model server can deduplicate. Without deduplication, retrying re-bills the prefix.

Q: How is this enforced for third-party tool servers we don't control?

Wrap them in a client that enforces the spec: classify their errors, apply backoff with jitter, attach idempotency keys, honor Retry-After, and apply budgets. The third-party server cannot be made to enforce idempotency, but the client can refuse to retry on ambiguous errors when the server is known not to deduplicate.

Q: What metrics should I expose for retry behavior?

retry_rate per dependency, retry_attempts_per_request distribution, retry_budget_exhaustion_rate, and retry_after_honored_rate. Alert on rate-of-change, not absolute numbers — a spike in retry_rate is the leading indicator of a downstream incident.

Related Articles

specification

Agent Circuit Breaker Specification

Specification for circuit breakers protecting AI agent calls to LLM providers and tools, including state transitions, threshold tuning, fallback strategies, and observability hooks.

specification

Agent Graceful Degradation Specification

Specification for graceful degradation when AI agent dependencies fail: model fallback chains, tool-skip policies, cached-response serving, and user-facing failure messaging.

specification

Agent Health Check Specification

Specification for liveness, readiness, and startup probes in production AI agents, including LLM-provider ping patterns, dependency probing, and degraded-mode signaling.

Stay Updated

GEO & AI Search Insights

New articles, framework updates, and industry analysis. No spam, unsubscribe anytime.