Agent Rate Limit Documentation Checklist: Disclosing Quotas, Retries, and Burst Limits to AI Agents
A complete agent-ready rate-limit disclosure ships four things — machine-readable quotas in the API spec, standard RateLimit and Retry-After response headers, hierarchical limits (user → agent → tool), and explicit burst guardrails. Publishers who document all four cut 429 cascades and let autonomous agents back off without human intervention.
TL;DR
AI agents do not feel throttled the way humans do. They retry deterministically, parallelize aggressively, and amplify failures. If your API does not publish quotas, headers, and hierarchical limits in a parseable form, every agent that hits your tool will eventually trigger a runaway. This checklist is the minimum disclosure surface a tool publisher should ship before exposing an endpoint to AI agents.
Why this matters
Traditional rate-limit docs were written for humans skimming a getting-started page. AI agents do not skim — they parse. Three behaviors break human-era assumptions:
- Autonomous retries. Agents retry faster and more uniformly than any human ever will.
- Parallel fan-out. A single task can issue dozens of sequential or parallel calls.
- Cascading 429s. When throttled, naive agents amplify load instead of backing off.
Nordic APIs and Fast.io both report that legitimate agent traffic now resembles DDoS patterns. Without explicit, machine-readable disclosure, even a well-intentioned agent will exhaust your quota in seconds.
The checklist
Use this as a pre-launch gate for any endpoint or MCP tool you want safe for AI consumption.
1. Publish machine-readable quotas
- [ ] Quotas are declared in your OpenAPI / MCP tool manifest, not only in prose.
- [ ] Each endpoint or tool lists at least: requests per minute, requests per day, and (for LLM-style endpoints) tokens per minute.
- [ ] Burst capacity is stated separately from sustained capacity.
- [ ] Quotas are versioned alongside the API spec so agents detect changes on schema refresh.
2. Emit standard rate-limit headers on every response
- [ ] Successful responses include RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset (per the IETF draft, Polli & Martinez).
- [ ] 429 responses additionally include Retry-After as integer seconds (RFC 7231 / RFC 6585).
- [ ] Header names and units are documented — agents cannot guess between X-RateLimit- and RateLimit-.
- [ ] If you also expose token budgets (LLMs), document X-RateLimit-Remaining-Tokens and X-RateLimit-Reset-Tokens.
3. Document hierarchical limits
- [ ] You list limits at every layer enforcement happens: workspace → user → agent → tool/function.
- [ ] You state which layer triggered a 429 (for example via RateLimit-Policy or a vendor *-Error-Code header).
- [ ] High-risk tools (send_email, delete_*, make_payment) declare separately tighter limits, as recommended by Pignati's hierarchical pattern.
- [ ] You name the shared resource: a multi-agent system sharing one quota must know it is shared (Dresher, 9 AI Agents, One API Quota).
4. Specify burst and concurrency guardrails
- [ ] Maximum concurrent in-flight requests per credential is documented.
- [ ] Sliding-window vs. fixed-window behavior is explicit — agents tune backoff differently for each.
- [ ] Quantization is disclosed (OpenAI, for example, enforces 60k/min as 1k/sec). Without this, short bursts surprise compliant clients.
- [ ] Batch / queue endpoints disclose their own queue depth and TPD ceilings, not just RPM.
5. Define the retry contract
- [ ] You specify a recommended backoff (exponential with jitter) and reference Retry-After as authoritative.
- [ ] You state which 4xx codes are retryable (429, 408) and which are not (400, 403).
- [ ] Idempotency keys are documented for any write endpoint that may be retried.
- [ ] You clarify whether retried requests count against quota.
6. Make the 429 body machine-friendly
- [ ] 429 responses use application/problem+json (RFC 7807) with type, title, and detail.
- [ ] The body includes retry_after_seconds and the limit class that fired (per_user, per_tool, per_org).
- [ ] Error codes are stable strings agents can branch on, not free-form prose.
7. Provide a programmatic budget endpoint
- [ ] A GET /usage (or equivalent) returns current consumption per quota class.
- [ ] The response is versioned and rate-limit-exempt (or has a generous floor) so agents can poll safely.
- [ ] The budget endpoint mirrors the same hierarchy as enforcement.
8. Document multi-agent coordination expectations
- [ ] You state whether quotas are per credential, per IP, or per tenant.
- [ ] You recommend a coordination pattern (centralized gateway, token-bucket service) for fleets of agents sharing one key.
- [ ] You publish a soft "fair-use" target separate from the hard ceiling so well-behaved agents stay clear of throttling.
9. Surface limits in the tool / function description
- [ ] If the API is exposed via an MCP server or function-calling schema, the rate-limit summary appears in the tool's description.
- [ ] The description includes the most likely failure mode and the recommended backoff.
- [ ] The description fits in the system prompt without truncation (typically ≤ 300 tokens).
10. Version, change-log, and deprecate
- [ ] A dedicated rate-limit changelog records every quota change with date and reason.
- [ ] Deprecated headers are kept for at least one major version with Deprecation and Sunset headers.
- [ ] You announce tightening changes ≥ 30 days in advance so agent developers can adjust.
Minimum viable disclosure (for tight launches)
If you cannot ship the full checklist, the smallest safe surface is:
- RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset on every response.
- Retry-After on every 429.
- RFC 7807 problem-document body on 429.
- A documented hard ceiling and recommended backoff in your public docs.
Anything less, and autonomous agents have no signal to throttle on.
Common mistakes
- Returning -1 or 0 in RateLimit-* headers. Several Azure OpenAI users reported broken backoff loops because their clients treated -1 as "infinite remaining." Never emit sentinel values without documenting them.
- Documenting limits in prose only. Agents do not read your pricing page. Limits must live in the spec.
- Single global limit. A flat per-key limit lets one runaway tool starve every other tool the agent depends on.
- Silent quota changes. Tightening a limit without a changelog entry breaks every fielded agent overnight.
- No 429 body. A bare 429 with no JSON forces agents to fall back to fixed backoff and ignore your Retry-After.
FAQ
Q: Which rate-limit headers should I prefer — X-RateLimit- or RateLimit-?
Prefer the unprefixed RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset from the IETF draft (Polli & Martinez). They are becoming the convergent standard. Keep X-RateLimit-* aliases for backward compatibility, but document the canonical names first.
Q: Should Retry-After be in seconds or HTTP-date?
For agents, integer seconds is safer — date parsing introduces timezone bugs. MDN and RFC 7231 allow both, but prose docs should commit to one. State it explicitly in your reference.
Q: Do I need hierarchical limits for a small API?
Yes, if any single credential can drive multiple distinct tools. Even a two-level split (per-key plus per-tool) prevents one runaway action from starving everything else. Without it, an agent that loops on delete_record will also block its own read_record recovery calls.
Q: How do I rate-limit a fleet of agents sharing one API key?
Document that the quota is per credential and recommend a centralized gateway (LiteLLM, Zuplo, an internal token-bucket service). Independent retry logic across agents that share a key always devolves into thundering-herd 429s; coordination must live outside the agent.
Q: What about LLM-specific token limits?
Document tokens-per-minute (TPM) alongside requests-per-minute (RPM), expose X-RateLimit-Remaining-Tokens, and clarify whether prompt + completion counts together. Token budgets are the dominant constraint on LLM endpoints, and counting calls alone misleads agents into over-committing context.
Related Articles
Agent Authentication Documentation Spec
Document authentication for autonomous agents: OAuth flows, API keys, scopes, error states, and consent UX patterns AI agents need to operate safely.
Agent Circuit Breaker Specification
Specification for circuit breakers protecting AI agent calls to LLM providers and tools, including state transitions, threshold tuning, fallback strategies, and observability hooks.
Agent Citation Attribution Specification: Verifiable Source Tracking for Autonomous AI Agents
Specification defining HTTP headers, provenance manifests, and chain-of-citation markup so autonomous AI agents produce verifiable citations to source content.