Agent Rate Limiting Documentation Specification

An agent rate limiting documentation specification defines the headers, JSON quota descriptors, and retry semantics an API must publish so autonomous AI agents — Claude with tool use, ChatGPT actions, MCP clients, and custom agent frameworks — can comply with limits without human intervention. The core levers are standard RateLimit headers, an explicit Retry-After, a JSON-described quota schema, and machine-readable burst rules linked from OpenAPI.

TL;DR

Treat rate limits as a contract that AI agents read at runtime. Adopt the IETF RateLimit header family, return a typed 429 problem detail, expose a /.well-known/ai-rate-limits.json quota descriptor, and reference all of it from your OpenAPI spec so any agent — not just yours — can plan request rates and resume gracefully on throttle.

Why Agents Need a Dedicated Rate Limit Spec

Human developers can read prose API docs and adjust manually. Autonomous agents cannot. They make N+1 calls in tight loops, parallelize tools across threads, and have no visual feedback when they cross a quota line. If your rate limit policy is hidden inside marketing copy and inconsistent JSON shapes, an agent will hammer your service, get blocked, and either stop or retry forever.

Agent traffic also has new shapes:

Bursty plan-execute cycles. Agents fan out a plan into many parallel tool calls.
Long-running sessions. A single conversation can span hours, accumulating quota.
Multi-tenant headers. Hosts (OpenAI, Anthropic, MCP gateways) multiplex many users behind a single API key.

A documentation spec gives agents a deterministic place to find limits and a predictable shape to parse them.

Required Headers

Every rate-limited response — both successful and throttled — MUST include the IETF draft RateLimit header family:

RateLimit-Limit: 100, 100;w=60

RateLimit-Remaining: 42

RateLimit-Reset: 17

RateLimit-Policy: "default"; q=100; w=60

Header	Meaning
RateLimit-Limit	Maximum requests allowed in the current window
RateLimit-Remaining	Requests remaining in the current window
RateLimit-Reset	Seconds until the window resets
RateLimit-Policy	Named policy (default, burst, premium…) plus quota and window

Throttled responses (HTTP 429 Too Many Requests) MUST also return:

Retry-After: 17

The Retry-After header MUST be expressed in seconds (not HTTP-date) so agents can compute backoff with a single integer parse.

Required Response Body for 429

Throttled responses MUST return an application/problem+json body conforming to RFC 7807 (refreshed by RFC 9457), extended with rate-limit-specific fields:

{
  "type": "https://api.example.com/errors/rate-limited",
  "title": "Rate limit exceeded",
  "status": 429,
  "detail": "You exceeded the 100 requests per minute limit on the default policy.",
  "policy": "default",
  "limit": 100,
  "remaining": 0,
  "reset_seconds": 17,
  "retry_after_seconds": 17,
  "scope": "tenant"
}

scope MUST take one of three values: request, user, tenant. This lets agents decide whether to retry the same request, switch users, or back off the whole tenant.

Quota Descriptor: /.well-known/ai-rate-limits.json

Headers describe the current state, but agents also need the full policy at planning time. Publish a discovery document at /.well-known/ai-rate-limits.json:

{
  "version": "1.0",
  "policies": [
    {
      "name": "default",
      "quota": 100,
      "window_seconds": 60,
      "burst_quota": 20,
      "burst_window_seconds": 1,
      "scope": "tenant",
      "applies_to": ["*"]
    },
    {
      "name": "search",
      "quota": 30,
      "window_seconds": 60,
      "scope": "user",
      "applies_to": ["GET /search", "POST /search"]
    }
  ],
  "backoff": {
    "strategy": "exponential",
    "base_seconds": 1,
    "max_seconds": 60,
    "jitter": "full"
  },
  "contact": "mailto:api@example.com"
}

The descriptor MUST include policy names, quotas, windows, burst rules, scope, and a default backoff strategy. Agents that cannot reach the descriptor MUST fall back to header-based behavior.

OpenAPI Integration

Reference the quota descriptor and the per-operation policy from your OpenAPI spec so SDK generators and agent runtimes pick it up automatically:

info:
  x-rate-limit-discovery: /.well-known/ai-rate-limits.json

paths:

/search:

get:

x-rate-limit-policy: search

responses:

'200':

description: OK

'429':

description: Rate limit exceeded

headers:

RateLimit-Limit: { schema: { type: string } }

RateLimit-Remaining: { schema: { type: integer } }

RateLimit-Reset: { schema: { type: integer } }

Retry-After: { schema: { type: integer } }

content:

application/problem+json:

schema:

$ref: '#/components/schemas/RateLimitError'

Backoff and Burst Rules

Agents MUST follow this contract on 429:

Read Retry-After. If present, sleep for that many seconds, plus full jitter.
If absent, fall back to the descriptor's exponential backoff (base_seconds * 2^attempt, capped at max_seconds).
After 5 consecutive 429s, escalate to a circuit-breaker open state for max_seconds * 2.
When 429 is scoped to tenant, halt the agent's tenant-wide concurrency, not just the failing call.

Bursts MUST be documented separately from sustained quotas because agents often parallelize tool calls. A 100/min sustained quota with a 20/sec burst limit must publish both numbers; otherwise an agent's plan-execute fan-out trips the burst rule invisibly.

Required Documentation Sections

Every rate-limited API SHOULD publish a single canonical "Rate Limits" page that contains:

Quick reference table of policies, quotas, windows, burst limits, and scopes.
Header contract with example responses for 200 and 429.
Backoff algorithm in pseudo-code.
Discovery URL for the JSON descriptor.
OpenAPI extensions used.
Changelog with versioned policy updates.
FAQ addressing tenant scoping, burst behavior, and key rotation.

The page MUST live at a stable URL (for example, /docs/rate-limits) and SHOULD be referenced from the API's root, OpenAPI info, and /.well-known/ai-plugin.json if applicable.

Common Mistakes

Custom headers without the standard. Shipping X-RateLimit-Foo without the IETF RateLimit-* family forces every agent to special-case your API.
HTTP-date Retry-After. Forces agents to parse HTTP dates and clock-skew correct. Use seconds.
Hidden burst limits. Agents parallelize aggressively; an undocumented burst rule causes mysterious throttling.
Missing scope. Without a scope field, agents do not know whether to switch users, switch keys, or stop entirely.
Documentation drift. Updating quotas in code without bumping the descriptor version causes agents to cache stale plans.

FAQ

Q: Why use the IETF RateLimit- headers instead of X-RateLimit-?

The IETF draft headers (RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset, RateLimit-Policy) are converging into a standard. Most modern agent runtimes already parse them. Custom X- headers force every consumer to write a per-API adapter.

Q: Do I need both response headers and a /.well-known descriptor?

Yes. Headers describe the current window; the descriptor describes the full policy at planning time. Agents need both: headers to react, the descriptor to plan.

Q: How do I document burst limits separately from sustained quotas?

In the descriptor, give each policy a burst_quota and burst_window_seconds. In headers, RateLimit-Policy can list multiple policies (default;q=100;w=60, burst;q=20;w=1). Document both numbers explicitly in the human-readable docs.

Q: What HTTP status should I use besides 429?

Only 429 Too Many Requests for sustained or burst limits. Use 503 Service Unavailable with Retry-After when the throttle is server-wide rather than client-specific. Never use 403 for rate limiting — it conflates authorization with throttling and breaks agent retry logic.

Q: Should rate limits differ for AI agent traffic vs human traffic?

You can offer named policies with higher quotas for verified agent integrations, but the contract MUST be uniform — same headers, same descriptor shape, same backoff rules — so agents do not have to detect their own context.