AI Crawler Rate Limiting Reference: Throttling GPTBot, ClaudeBot, and PerplexityBot Without Losing Citations

Different AI crawlers serve different purposes. Training crawlers (GPTBot, ClaudeBot, Google-Extended) can be throttled aggressively without losing AI citations, while user-triggered agents (ChatGPT-User, Perplexity-User, Claude-User) must stay near unrestricted to preserve real-time citation share. This reference lists safe thresholds per bot and tags each with its citation-impact tradeoff.

TL;DR

Rate limiting AI crawlers is not a single decision — it is two decisions. Throttle training crawlers per bot, per IP range, with crawl-delay and edge rules that aim for stable cost. Leave user-triggered crawlers nearly unthrottled because every blocked request is a missing AI citation in real time. The thresholds below are starting points based on observed traffic patterns; tune them against your origin capacity and your AI citation telemetry.

Crawler classes (read this before the table)

AI crawlers split into three classes with different rate-limit calculus:

Training crawlers — GPTBot, ClaudeBot, Google-Extended, CCBot. Their job is bulk corpus collection. Blocking or throttling them affects future model versions, not today's citations. Aggressive throttling is acceptable.
Search-index crawlers — OAI-SearchBot, PerplexityBot. They feed AI search retrieval indexes. Throttling them quietly degrades current citation share. Throttle moderately.
User-triggered agents — ChatGPT-User, Perplexity-User, Claude-User, Google-Agent. They fetch on behalf of a real user mid-conversation. Throttling them returns 429s in front of a human asking a question. Avoid throttling unless the source IP is clearly abusive.

Cloudflare's Q1 2026 traffic analysis reports that user-triggered agent traffic grew roughly 15x in 2025 to ~3% of AI bot traffic, and Search Engine Journal documented that around 30% of all web traffic now comes from bots, with AI's share growing.

Reference table: safe rate limits per bot

Bot	Class	User-Agent contains	Recommended limit	Aggressive limit	Citation impact if throttled
GPTBot	Training	GPTBot/	60 req/min/IP	20 req/min/IP	None today; affects future GPT training corpus
ClaudeBot	Training	ClaudeBot or anthropic-ai	60 req/min/IP	20 req/min/IP	None today; affects future Claude training
Google-Extended	Training	Google-Extended	120 req/min/IP	30 req/min/IP	None today; affects future Gemini training
CCBot	Training	CCBot	30 req/min/IP	10 req/min/IP	Indirect (Common Crawl feeds many models)
Bytespider	Training	Bytespider	10 req/min/IP	Block	Minimal; widely considered abusive
OAI-SearchBot	Search-index	OAI-SearchBot/	120 req/min/IP	60 req/min/IP	Reduces ChatGPT Search citation freshness
PerplexityBot	Search-index	PerplexityBot/	120 req/min/IP	60 req/min/IP	Reduces Perplexity citation share within ~1-2 weeks
ChatGPT-User	User-triggered	ChatGPT-User/	No throttle	300 req/min/IP	Direct: 429s become missing answers in live ChatGPT chats
Perplexity-User	User-triggered	Perplexity-User/	No throttle	300 req/min/IP	Direct: missing live Perplexity answers
Claude-User / Claude-Web	User-triggered	Claude-User/ or Claude-Web/	No throttle	300 req/min/IP	Direct: missing live Claude.ai answers
Google-Agent	User-triggered	Google-Agent	No throttle	300 req/min/IP	Direct: missing AI Overviews live fetches

Notes:

Limits are per source IP. Most AI crawler operators rotate IPs frequently, so combine with cf.unique_visitor_id or fingerprint-based identifiers as Cloudflare recommends in its WAF parameters guide.
Recommended limits keep AI corpora growing; Aggressive limits cut origin cost in half but slow corpus refresh.
Always pair rate limits with an HTTP Retry-After header so well-behaved crawlers back off cleanly.

robots.txt directives that complement rate limiting

Honor Crawl-delay for the bots that respect it, even though OpenAI's official GPTBot docs do not explicitly commit to it.

User-agent: GPTBot

Crawl-delay: 5

User-agent: ClaudeBot

Crawl-delay: 5

User-agent: Google-Extended

Crawl-delay: 2

User-agent: PerplexityBot

Crawl-delay: 2

User-agent: Bytespider

Disallow: /

Crawl-delay does not bind every bot. Treat it as a polite signal, not enforcement. Disallow: / for clearly abusive crawlers; rate limit the rest at the edge.

Edge recipes

Cloudflare WAF rate limiting rule

Field: Verified Bot Category equals AI Crawler. Threshold: 60 requests per 1 minute per cf.unique_visitor_id. Action: Managed challenge. Per Cloudflare's WAF best practices, prefer challenge over outright block to preserve verified-bot traffic when limits are crossed.

Watch for the Cloudflare AI Scrapers and Crawlers toggle: enabling it silently injects Disallow: / into your robots.txt. Verify with curl https://yoursite.com/robots.txt | grep "Cloudflare Managed" before you trust your published policy.

ModSecurity / Apache .htaccess

A conservative ModSecurity rule from InMotion keeps GPTBot and ClaudeBot to one request every three seconds. That is too aggressive for sites that depend on AI citations — use it only as a temporary brake during incidents. Long term, prefer per-bot rules with the thresholds in the reference table.

Origin-level token bucket

For Node.js / Python origins, implement a token bucket keyed by (user_agent_class, source_ip_prefix) rather than raw IP. Match user-agent class to the table above; user-triggered agents bypass the bucket.

Anti-patterns

Blocking all AI crawlers at the edge. Tencent Cloud flags this as the most common mistake — it kills AI search visibility entirely.
Treating every AI bot the same. Training and user-triggered crawlers have opposite cost profiles.
Relying solely on robots.txt. It is advisory; aggressive bots ignore it.
IP-based rate limiting against IP-rotating crawlers. Use unique-visitor identifiers or behavioral fingerprints; raw IP keys are easily evaded as Cloudflare's own community guidance warns.
Throttling without monitoring AI citation share. Citation share telemetry is the ground truth for whether a rate limit is too tight.

Internal links

Hub: GEO for Developers: Technical Implementation
Sibling: AI Crawl Budget
Sibling: robots.txt for AI Crawlers
Sibling: AI Search Crawler User-Agents
Reference: AI Citation Patterns

FAQ

Q: Will throttling GPTBot or ClaudeBot affect my ChatGPT or Claude citations today?

No. GPTBot and ClaudeBot are training crawlers; their fetches feed future model versions, not the live retrieval index used for current answers. ChatGPT's live answers come from OAI-SearchBot and ChatGPT-User. Throttle training crawlers freely; preserve user-triggered ones.

Q: What rate limit makes PerplexityBot citations decline?

Field data is noisy, but Perplexity's vector-based retrieval refreshes quickly. Sustained limits below ~30 req/min/IP for one to two weeks correlate with measurable drops in Perplexity citation share for fast-moving content. The 60 req/min/IP recommended threshold gives a comfortable margin.

Q: Should I serve 429 Too Many Requests or 503 Service Unavailable to over-limit AI crawlers?

Use 429 with a Retry-After header. 429 is the documented signal that the bot is exceeding rate limits; well-behaved bots will reduce their crawl rate. 503 implies a server-wide problem and may cause crawlers to deprioritize your domain entirely.

Q: How do I rate limit user-triggered agents like ChatGPT-User without breaking live AI answers?

Prefer not to. If you must, set a high ceiling (~300 req/min/IP) and only after observing real abuse from a specific IP range. ChatGPT-User and Perplexity-User fetches are tied to live user prompts; a 429 is a missed answer.

Q: Is Cloudflare's AI Scrapers and Crawlers toggle safe to leave on?

Only if you genuinely want to block training crawlers from reading your content. Many GEO incidents stem from this toggle being enabled by default and silently overriding the site's published robots.txt. Verify by fetching /robots.txt from outside your origin and checking for the # BEGIN Cloudflare Managed content marker.