Agent Circuit Breaker Specification

An agent circuit breaker wraps every LLM-provider and external tool call in a three-state machine — closed, open, and half-open — that short-circuits requests when failure rates or latency exceed thresholds, then probes the dependency before reopening the gate. Per-dependency scoping, error-rate plus latency triggers, and a defined fallback are the three configuration decisions every agent platform must make.

TL;DR

A circuit breaker has three states: closed (calls flow), open (calls fail fast), half-open (calls test recovery).
Trigger on error rate plus latency, not raw error count — both signal degradation.
Scope breakers per LLM provider and per tool, never one global breaker.
Always define a fallback — a cheaper model, cached answer, or graceful 503 — before opening the breaker.
Pair with retries inside the breaker, not around it; otherwise retries amplify outages.

Definition

An agent circuit breaker is a state machine that sits between an AI agent and an external dependency (LLM provider, vector store, tool API) and prevents cascading failures by short-circuiting calls when the dependency is unhealthy. It is a direct adaptation of the Circuit Breaker pattern popularized by Martin Fowler (martinfowler.com) and is implemented in production libraries including resilience4j (JVM), Polly and cockatiel (.NET, Node.js), opossum (Node.js), and pybreaker (Python).

The breaker exposes three states (Wikipedia):

Closed — calls pass through to the dependency. Successes and failures are recorded.
Open — calls fail immediately without contacting the dependency. After a cooldown window, the breaker transitions to half-open.
Half-open — a small number of probe calls are allowed through. If they succeed, the breaker closes; if they fail, it reopens.

For AI agents, the pattern matters more than for traditional services because LLM calls are slow (seconds), expensive (cents per call), and prone to provider-wide outages. Without a breaker, an agent retries failed calls into a degraded provider, exhausts its token budget, and propagates timeouts up the call stack into user-facing latency.

Why it matters

LLM and tool failures are not Bernoulli — they are correlated and bursty. A single provider rate-limit event can cause every agent in a fleet to retry simultaneously, multiplying load on a system that is already at capacity. A circuit breaker is the most effective control for this failure mode because it stops new calls before they enter the queue (Portkey).

The cost dimension is also distinctive. A naive retry policy of "three attempts with exponential backoff" against a per-call LLM endpoint can multiply inference cost during an outage while delivering zero successful responses. A breaker that opens after a sustained error rate caps that wastage and frees the agent to use a cheaper fallback model.

User-facing latency is the third reason. Agents that block on LLM calls until they exceed timeouts produce 30-60 second latencies on every request during a degradation. With a breaker open, the same requests fail fast in tens of milliseconds, allowing a fallback handler to respond inside the user's tolerance window.

How it works

The breaker tracks two metrics in a sliding window: error rate and call latency. Either can trip the breaker:

Trigger	Threshold (default)	Window	Action
Error rate	≥50%	last 30s or 100 calls	Open
Slow-call rate (P95)	≥3x baseline	last 30s	Open
Cooldown timer	30s	n/a	Open → Half-open
Half-open probes	5 calls	n/a	Decide closed vs open

Implementations vary on which trigger they expose. Resilience4j supports both error-rate and slow-call-rate windows configurable independently. Polly and cockatiel provide advanced circuit breakers with sliding-window aggregation. Opossum exposes error-percentage thresholds with a rolling time window.

Per-dependency scoping is the most important design choice. A single global breaker conflates failures across providers — when one LLM provider degrades, fallbacks to a different provider should still work. Maintain one breaker instance per (provider, model, region) tuple at minimum; for tools, one per tool API endpoint.

The fallback policy decides what happens when the breaker is open. Common choices in agent systems (Maxim AI):

Model downgrade — switch from a primary model to a cheaper or alternative model on a different provider.
Cached answer — serve a previously computed response when freshness allows.
Graceful 503 — return a typed error to the orchestrator so retries happen at a higher layer.
Skip and continue — for non-critical tools, omit the tool call from the agent loop and proceed with reduced capability.

Retries belong inside the breaker, not around it. The pattern is: retry once or twice on transient errors, count the final outcome toward the breaker's window. Wrapping the breaker in an outer retry loop nullifies the breaker because retries see the open state, fail fast, and increment the error counter again.

Practical application

A reference TypeScript example using opossum:

import CircuitBreaker from "opossum"

const llmCall = async (req: ChatRequest) =>

openai.chat.completions.create(req)

const breaker = new CircuitBreaker(llmCall, {

timeout: 15000,

errorThresholdPercentage: 50,

rollingCountTimeout: 30000,

rollingCountBuckets: 10,

resetTimeout: 30000,

})

breaker.fallback((req) => fallbackModel.chat.completions.create(req))

breaker.on("open", () => metrics.increment("llm.breaker.open"))

breaker.on("halfOpen", () => metrics.increment("llm.breaker.halfopen"))

export const protectedLlm = (req: ChatRequest) => breaker.fire(req)

In Python with the pybreaker library, the structure is similar: instantiate a breaker per provider, attach a listener for state transitions, and wire the fallback into the call site rather than the breaker itself.

For observability, emit four signals per breaker: state transition events, current state gauge, call count by outcome (success / failure / short-circuited), and time-in-state. Pair these with the agent's overall request rate to distinguish a localized provider issue from agent-side regression.

For tuning, start with conservative defaults — 50 percent error rate, 30-second window, 30-second cooldown — and adjust after observing one or two real incidents. Aggressive thresholds (e.g., 20 percent error rate) cause flapping during normal noise; permissive ones (e.g., 80 percent) let outages propagate too far before tripping.

Common mistakes

Single global breaker. One breaker for all dependencies hides which one failed and prevents per-provider fallbacks. Scope per dependency.
Outer-loop retries. Retrying around the breaker negates it. Retry inside, count results outward.
No fallback. A breaker without a fallback just fails faster. Always pair with a model downgrade, cache, or graceful error.
Tuning to cost not failure. Setting thresholds based on dollars rather than reliability metrics often opens the breaker on healthy traffic.
Skipping half-open probes. A breaker that goes straight from open to closed after the cooldown can re-trigger immediately. Always probe with a small number of calls before closing fully.

FAQ

Q: Should I use one circuit breaker per provider or per model?

Per (provider, model, region) tuple. A model-specific outage on one region should not block calls to the same model in another region or a different model on the same provider.

Q: How does a circuit breaker differ from a retry policy?

Retries handle transient failures (one bad call, immediate recovery). A circuit breaker handles sustained degradation (many bad calls in a window). Use both: retries inside the breaker, breaker around the call site.

Q: What's the right error threshold to open the breaker?

Start at 50 percent error rate over 30 seconds and adjust from there. Lower is more aggressive (more flapping); higher lets more bad calls through. The right number is workload-specific and best tuned with one or two real incidents.

Q: Can I use a circuit breaker for token budget enforcement?

Indirectly. The breaker reduces wasted token spend during outages but is not a budget enforcer. Use a separate rate-limiter or budget guard for hard spend caps; let the breaker handle reliability.

Q: Does the half-open state need its own timeout?

Yes. If half-open probes hang, the breaker can stall. Set a probe timeout equal to or shorter than the closed-state timeout, and treat hung probes as failures.