AI Crawler Rate Limiting Reference: Throttling GPTBot, ClaudeBot, and PerplexityBot Without Losing Citations
Different AI crawlers serve different purposes. Training crawlers (GPTBot, ClaudeBot, Google-Extended) can be throttled aggressively without losing AI citations, while user-triggered agents (ChatGPT-User, Perplexity-User, Claude-User) must stay near unrestricted to preserve real-time citation share. This reference lists safe thresholds per bot and tags each with its citation-impact tradeoff.
TL;DR
Rate limiting AI crawlers is not a single decision — it is two decisions. Throttle training crawlers per bot, per IP range, with crawl-delay and edge rules that aim for stable cost. Leave user-triggered crawlers nearly unthrottled because every blocked request is a missing AI citation in real time. The thresholds below are starting points based on observed traffic patterns; tune them against your origin capacity and your AI citation telemetry.
Crawler classes (read this before the table)
AI crawlers split into three classes with different rate-limit calculus:
- Training crawlers — GPTBot, ClaudeBot, Google-Extended, CCBot. Their job is bulk corpus collection. Blocking or throttling them affects future model versions, not today's citations. Aggressive throttling is acceptable.
- Search-index crawlers — OAI-SearchBot, PerplexityBot. They feed AI search retrieval indexes. Throttling them quietly degrades current citation share. Throttle moderately.
- User-triggered agents — ChatGPT-User, Perplexity-User, Claude-User, Google-Agent. They fetch on behalf of a real user mid-conversation. Throttling them returns 429s in front of a human asking a question. Avoid throttling unless the source IP is clearly abusive.
Cloudflare's Q1 2026 traffic analysis reports that user-triggered agent traffic grew roughly 15x in 2025 to ~3% of AI bot traffic, and Search Engine Journal documented that around 30% of all web traffic now comes from bots, with AI's share growing.
Reference table: safe rate limits per bot
| Bot | Class | User-Agent contains | Recommended limit | Aggressive limit | Citation impact if throttled |
|---|---|---|---|---|---|
| GPTBot | Training | GPTBot/ | 60 req/min/IP | 20 req/min/IP | None today; affects future GPT training corpus |
| ClaudeBot | Training | ClaudeBot or anthropic-ai | 60 req/min/IP | 20 req/min/IP | None today; affects future Claude training |
| Google-Extended | Training | Google-Extended | 120 req/min/IP | 30 req/min/IP | None today; affects future Gemini training |
| CCBot | Training | CCBot | 30 req/min/IP | 10 req/min/IP | Indirect (Common Crawl feeds many models) |
| Bytespider | Training | Bytespider | 10 req/min/IP | Block | Minimal; widely considered abusive |
| OAI-SearchBot | Search-index | OAI-SearchBot/ | 120 req/min/IP | 60 req/min/IP | Reduces ChatGPT Search citation freshness |
| PerplexityBot | Search-index | PerplexityBot/ | 120 req/min/IP | 60 req/min/IP | Reduces Perplexity citation share within ~1-2 weeks |
| ChatGPT-User | User-triggered | ChatGPT-User/ | No throttle | 300 req/min/IP | Direct: 429s become missing answers in live ChatGPT chats |
| Perplexity-User | User-triggered | Perplexity-User/ | No throttle | 300 req/min/IP | Direct: missing live Perplexity answers |
| Claude-User / Claude-Web | User-triggered | Claude-User/ or Claude-Web/ | No throttle | 300 req/min/IP | Direct: missing live Claude.ai answers |
| Google-Agent | User-triggered | Google-Agent | No throttle | 300 req/min/IP | Direct: missing AI Overviews live fetches |
Notes:
- Limits are per source IP. Most AI crawler operators rotate IPs frequently, so combine with cf.unique_visitor_id or fingerprint-based identifiers as Cloudflare recommends in its WAF parameters guide.
- Recommended limits keep AI corpora growing; Aggressive limits cut origin cost in half but slow corpus refresh.
- Always pair rate limits with an HTTP Retry-After header so well-behaved crawlers back off cleanly.
robots.txt directives that complement rate limiting
Honor Crawl-delay for the bots that respect it, even though OpenAI's official GPTBot docs do not explicitly commit to it.
User-agent: GPTBot
Crawl-delay: 5
User-agent: ClaudeBot
Crawl-delay: 5
User-agent: Google-Extended
Crawl-delay: 2
User-agent: PerplexityBot
Crawl-delay: 2
User-agent: Bytespider
Disallow: /
Crawl-delay does not bind every bot. Treat it as a polite signal, not enforcement. Disallow: / for clearly abusive crawlers; rate limit the rest at the edge.
Edge recipes
Cloudflare WAF rate limiting rule
Field: Verified Bot Category equals AI Crawler. Threshold: 60 requests per 1 minute per cf.unique_visitor_id. Action: Managed challenge. Per Cloudflare's WAF best practices, prefer challenge over outright block to preserve verified-bot traffic when limits are crossed.
Watch for the Cloudflare AI Scrapers and Crawlers toggle: enabling it silently injects Disallow: / into your robots.txt. Verify with curl https://yoursite.com/robots.txt | grep "Cloudflare Managed" before you trust your published policy.
ModSecurity / Apache .htaccess
A conservative ModSecurity rule from InMotion keeps GPTBot and ClaudeBot to one request every three seconds. That is too aggressive for sites that depend on AI citations — use it only as a temporary brake during incidents. Long term, prefer per-bot rules with the thresholds in the reference table.
Origin-level token bucket
For Node.js / Python origins, implement a token bucket keyed by (user_agent_class, source_ip_prefix) rather than raw IP. Match user-agent class to the table above; user-triggered agents bypass the bucket.
Anti-patterns
- Blocking all AI crawlers at the edge. Tencent Cloud flags this as the most common mistake — it kills AI search visibility entirely.
- Treating every AI bot the same. Training and user-triggered crawlers have opposite cost profiles.
- Relying solely on robots.txt. It is advisory; aggressive bots ignore it.
- IP-based rate limiting against IP-rotating crawlers. Use unique-visitor identifiers or behavioral fingerprints; raw IP keys are easily evaded as Cloudflare's own community guidance warns.
- Throttling without monitoring AI citation share. Citation share telemetry is the ground truth for whether a rate limit is too tight.
Internal links
- Hub: GEO for Developers: Technical Implementation
- Sibling: AI Crawl Budget
- Sibling: robots.txt for AI Crawlers
- Sibling: AI Search Crawler User-Agents
- Reference: AI Citation Patterns
FAQ
Q: Will throttling GPTBot or ClaudeBot affect my ChatGPT or Claude citations today?
No. GPTBot and ClaudeBot are training crawlers; their fetches feed future model versions, not the live retrieval index used for current answers. ChatGPT's live answers come from OAI-SearchBot and ChatGPT-User. Throttle training crawlers freely; preserve user-triggered ones.
Q: What rate limit makes PerplexityBot citations decline?
Field data is noisy, but Perplexity's vector-based retrieval refreshes quickly. Sustained limits below ~30 req/min/IP for one to two weeks correlate with measurable drops in Perplexity citation share for fast-moving content. The 60 req/min/IP recommended threshold gives a comfortable margin.
Q: Should I serve 429 Too Many Requests or 503 Service Unavailable to over-limit AI crawlers?
Use 429 with a Retry-After header. 429 is the documented signal that the bot is exceeding rate limits; well-behaved bots will reduce their crawl rate. 503 implies a server-wide problem and may cause crawlers to deprioritize your domain entirely.
Q: How do I rate limit user-triggered agents like ChatGPT-User without breaking live AI answers?
Prefer not to. If you must, set a high ceiling (~300 req/min/IP) and only after observing real abuse from a specific IP range. ChatGPT-User and Perplexity-User fetches are tied to live user prompts; a 429 is a missed answer.
Q: Is Cloudflare's AI Scrapers and Crawlers toggle safe to leave on?
Only if you genuinely want to block training crawlers from reading your content. Many GEO incidents stem from this toggle being enabled by default and silently overriding the site's published robots.txt. Verify by fetching /robots.txt from outside your origin and checking for the # BEGIN Cloudflare Managed content marker.
Related Articles
AI Citation Patterns: How AI Engines Cite Sources (2026)
Reference of how ChatGPT, Perplexity, Google AI Overviews, Google AI Mode, Gemini, Microsoft Copilot, and Claude attribute sources in 2026 — with platform-specific optimization tactics.
AI Crawl Signals: How AI Discovers Content
Technical reference for the signals AI systems use to discover, access, and prioritize web content — including sitemaps, llms.txt, robots.txt, structured data, and HTTP headers.