Agent Graceful Degradation Specification

An agent graceful degradation specification turns implicit, scattered catch-block behavior into a reviewable contract: dependency tiers, a model fallback chain, tool-skip versus tool-mock policies, cached-answer eligibility, user messaging tone, and observability for degraded turns.

TL;DR

When an LLM provider rate-limits, a tool times out, or a vector store returns stale data, agents either fail loudly or hallucinate quietly. Graceful degradation is the design mode where the agent serves a reduced but truthful answer instead. This spec defines four service levels (full, reduced, fallback, refusal), a model fallback chain, the rules for skipping versus mocking tools, when cached answers are safe to serve, and what the user must be told. Without it, individual engineers ship five different fallback behaviors and the agent's overall posture under failure is whatever those choices add up to.

Definition

Graceful degradation for an agent is the design discipline of treating dependency failure as a normal operating mode rather than an exception. Each dependency — the primary model, the secondary model, each tool, the memory store, the retrieval index — has a documented fallback that yields a degraded but still-useful answer when it is unavailable. The spec is the artifact that makes those fallbacks reviewable in a single document instead of buried in catch blocks (Pan, 2026).

Why this matters

Agent platforms have more dependencies than typical web apps: the model itself, often a backup model on a different vendor, several tool servers, a vector store, and a memory store. Probability of at least one being degraded at any moment is substantially higher than for a single-vendor web service. Without an explicit spec, three failure modes emerge:

Cascading hard failure. A tool timeout bubbles up as an unhandled exception and the whole turn fails. Users see a generic error.
Silent fabrication. The agent retries with no guardrail, the model fills the gap with parametric memory, and the user gets a confident but wrong answer.
Inconsistent posture. Each tool wrapper handles failure differently; the same outage produces a refusal in one path and a hallucination in another.

Netflix codified the alternative two decades ago: when recommendations fail, streaming and search keep working. Recommendations degrade, the revenue path survives (Aggarwal, 2023). The same principle applies inside an agent.

Service levels

Four levels, declared in order of preference. Each turn is served at the highest level its dependencies allow.

Level	Conditions	What the user gets	Disclosure
Full	All dependencies healthy	Best-quality answer with all tools	None
Reduced	One non-critical tool degraded	Answer with that tool skipped or mocked	Footnote-level disclosure
Fallback	Primary model degraded	Answer from secondary model or cached decision	Explicit disclosure in answer
Refusal	Primary and secondary models both unavailable	Refusal with retry guidance	Explicit, primary message

A fifth level — cached-answer-only with no live LLM — is permitted only for a curated set of high-confidence intents where the cached decision is the safe fallback (Reddit r/mlops, 2026).

Dependency tier classification

Every dependency is tagged at deploy time:

Tier 1 — critical. Primary model, secondary model. Failure of all Tier 1 dependencies forces refusal.
Tier 2 — important. Memory store, retrieval index. Failure forces fallback level: the agent must still answer, but with a disclosure that recall is reduced.
Tier 3 — augmenting. Per-task tools (calendar, CRM, code execution). Failure forces reduced level: the agent skips or mocks the tool and explains in a footnote.

Classification is deliberate, not derived from frequency of use. A rarely-called tool can still be Tier 3 if its absence is acceptable; a frequently-called tool can be Tier 1 if the answer is not safe without it.

Model fallback chain

The spec requires an ordered chain, declared in configuration:

Primary model. Highest-quality, default first call.
Secondary model. Different vendor where possible (provider-outage independence). Equivalent or smaller capability is acceptable; cheaper is acceptable.
Cached decision. Only for intents covered by a structured-intent cache. Cached decisions are safer than cached responses because they store what to do, not what was said (Reddit r/mlops, 2026).
Refusal. Templated message containing a retry hint and an incident reference if applicable.

Falling through the chain emits a degradation_event with the level reached, the failure reason, and the latency cost. These events are the single most important observability signal during incidents (Tombas, 2025).

Tool-skip versus tool-mock

When a Tier 3 tool fails, the spec offers two policies:

Tool-skip. Remove the tool from the working set and re-prompt the model with a system note that the tool is unavailable. Use this when the answer is still useful without the tool (e.g., a calendar tool failing on a general question).
Tool-mock. Return a deterministic mock value that the model is taught (in the system prompt) to interpret as "unknown". Use this when the agent's reasoning depends on the tool's type signature, not its data — for example, an agent that always checks weather before recommending an outdoor activity must still produce an answer when the weather tool is down.

Never return synthetic data from a mocked tool. Mocks must be sentinel values; otherwise the model treats them as truth.

Cached-answer eligibility

Serving a cached answer during degradation is permitted only if all four conditions hold:

The intent is on the curated cached-answer allowlist.
The cache key was derived from a structured intent decomposition (not a raw-text hash); response caching keyed on text alone has 10-30% hit rates and high false positives (Reddit r/mlops, 2026).
The cached decision is younger than the intent's freshness budget (e.g., 24 hours for a docs lookup, 5 minutes for a status query).
The user is told the answer was served from cache.

Violating any condition produces silent staleness, which is one of the easiest ways for an agent to feel smart in demos and unreliable in production (Thinking Loop, 2026).

User messaging tone

Degraded answers are still user-facing, so the spec mandates messaging conventions:

Footnote tone for reduced level. "I couldn't reach the calendar tool, so this estimate doesn't account for your schedule." Single sentence, end of answer.
Inline tone for fallback level. "I'm using a backup model right now and confidence is lower than usual." Lead with the disclosure, not bury it.
Primary tone for refusal. "I can't answer this right now because [model] is unavailable. Please retry in a few minutes; reference incident ID xyz."

Never apologize without saying what failed and what the user can do. Generic apologies erode trust faster than the original failure.

Observability for degraded mode

Four metrics must be exposed:

degraded_turn_rate — fraction of turns served below full level, by level.
fallback_chain_depth — distribution of how far each turn fell through the chain.
tool_skip_rate and tool_mock_rate — per-tool, per-route.
degradation_user_impact — measured by abandonment, retry, or thumbs-down rate on degraded turns versus full turns.

Alerts fire on rate-of-change, not absolute thresholds. A spike in fallback_chain_depth is the leading indicator of a vendor outage; treat it as a P1 signal.

Common pitfalls

Implicit fallbacks in tool wrappers. Each engineer ships their own behavior; nobody can describe the agent's posture under outage end-to-end.
Caching the response, not the decision. Stale text leaks personal data and feels obviously copy-pasted.
No refusal template. Generic 500 errors send users to support; a templated refusal with retry guidance does not.
Same-vendor secondary model. Provider outages take both down. Choose a different vendor or a self-hosted model for the secondary.
Silent degradation. Users notice answer-quality drops and lose trust; explicit disclosure preserves trust even when the answer is reduced.
No observability on tool_skip_rate. Tools quietly stay broken because nobody sees the skip-rate climb.

FAQ

Q: How does this spec interact with the circuit breaker spec?

The circuit breaker decides whether to call a dependency; this spec decides what to do when the breaker is open. The two are paired: every breaker-open transition triggers a degradation event handled by this spec.

Q: Can the secondary model be the same vendor as the primary?

Only if there is no alternative. Same-vendor outages take both models down. The reliability gain comes from vendor independence, not just model independence.

Q: When is a refusal better than a degraded answer?

When the user-visible quality of the degraded answer would be misleading — financial calculations without live data, medical guidance without retrieval, legal answers without source citation. Refuse loudly; do not degrade silently.

Q: How do I test graceful degradation?

Chaos-style integration tests: kill each Tier 1, 2, and 3 dependency in turn and assert the agent reaches the documented service level. Run on every release and sample in production via fault injection.

Q: Does graceful degradation reduce cost?

It can. Falling back to a cheaper secondary model on overload is a cost-control mechanism as well as a reliability mechanism. The same observability metrics inform both axes (Tombas, 2025).

Q: How does this differ from a circuit breaker fallback?

A circuit breaker fallback is one entry in this spec — the action taken when the breaker for a specific dependency is open. This spec is the broader contract that includes fallback chains, tool policies, cached-answer rules, and user messaging.