AI Crawler IP Allowlist Reference
Major AI crawlers publish official IP ranges as JSON endpoints (OpenAI, Perplexity, Google, Apple), while Anthropic relies on user-agent plus robots.txt because it uses public cloud IPs. Use reverse-DNS verification for Googlebot and JSON allowlists for the rest, refreshed at least weekly.
TL;DR
OpenAI, Perplexity, Google, and Apple publish CIDR ranges via versioned JSON endpoints; allowlist them at the WAF and refresh on a schedule. Anthropic does not publish ranges — verify ClaudeBot, Claude-SearchBot, and Claude-User through user-agent and request behavior, not IP filtering. Googlebot, Google-Extended, and OAI-SearchBot also support reverse-DNS verification.
Why this reference exists
User-agent strings can be spoofed in seconds, so any allowlist that depends on User-Agent: GPTBot alone is unreliable. Production teams that want to allow legitimate AI training and answer-grounding crawlers — while rejecting impersonators — pair user-agent matching with either an authoritative IP allowlist or a reverse-DNS check. This page consolidates the canonical sources for both paths so you can build WAF rules, log filters, and analytics segments without scraping forum posts.
Quick reference table
| Crawler family | Bot name | Published IP source | Reverse-DNS hostname | Respects robots.txt | Purpose |
|---|---|---|---|---|---|
| OpenAI | GPTBot | https://openai.com/gptbot.json | n/a | Yes | Foundation-model training data. |
| OpenAI | OAI-SearchBot | https://openai.com/searchbot.json | openai.com suffix | Yes | ChatGPT search index. |
| OpenAI | ChatGPT-User | https://openai.com/chatgpt-user.json | n/a | Yes | On-demand user-initiated fetch. |
| Anthropic | ClaudeBot | Not published | Not published | Yes | Training data. |
| Anthropic | Claude-SearchBot | Not published | Not published | Yes | Claude web search. |
| Anthropic | Claude-User | Not published | Not published | Yes | User-initiated fetch. |
| Perplexity | PerplexityBot | https://www.perplexity.com/perplexitybot.json | n/a | Yes | Indexing crawler. |
| Perplexity | Perplexity-User | https://www.perplexity.com/perplexity-user.json | n/a | User-initiated; may bypass robots.txt as a user action. | On-demand fetch. |
| Googlebot | https://developers.google.com/static/search/apis/ipranges/googlebot.json | googlebot.com / google.com | Yes | Search crawler. | |
| Google-Extended | Same Googlebot ranges + robots token | Same as Googlebot | Yes (separate token) | Gemini and AI Overviews training opt-out. | |
| Special-case fetchers | https://developers.google.com/static/search/apis/ipranges/special-crawlers.json | google.com | Yes | AdsBot, FeedFetcher, etc. | |
| Apple | Applebot / Applebot-Extended | https://search.developer.apple.com/applebot.json | applebot.apple.com | Yes | Applebot-Extended is the AI-training opt-out token. |
JSON endpoints are the source of truth — treat the table above as a starting map, then pull live ranges before building WAF rules.
Verification approaches
There are three accepted ways to verify that a crawler is who it claims to be.
1. JSON IP allowlist (OpenAI, Perplexity, Google, Apple)
The publishing vendor exposes a small JSON document with prefixes containing CIDR ranges. Fetch it from your build pipeline, normalize to a flat list, and push the result into your CDN or WAF as an allowlist. Re-run the job at least every seven days; OpenAI in particular has rotated ranges multiple times in the last 18 months, and stale lists silently drop legitimate crawl traffic.
A typical Cloudflare WAF expression looks like:
(http.user_agent contains "GPTBot" and not ip.src in $openai_gptbot_ranges)The expression matches requests that claim to be GPTBot but originate from outside the allowlist, then takes a Block or Managed Challenge action.
2. Reverse-DNS plus forward confirmation (Google, OAI-SearchBot)
Google's official guidance is the four-step reverse-DNS dance: take the source IP, run host
3. User-agent + behavioral signals (Anthropic)
Anthropic explicitly does not publish IP ranges because ClaudeBot runs on shared cloud infrastructure where blocking the underlying ranges would also prevent the bot from fetching robots.txt. Site owners should rely on the user-agent strings (ClaudeBot/1.0, Claude-SearchBot/1.0, Claude-User/1.0), enforce policy through robots.txt, and use rate limiting plus behavioral fingerprints (request cadence, header order, TLS JA3) for defense in depth.
WAF allowlist recipes
The recipes below assume you have already imported the JSON ranges into a named list (openai_gptbot_ranges, perplexity_bot_ranges, etc.).
Cloudflare WAF custom rule
(http.user_agent contains "GPTBot" and not ip.src in $openai_gptbot_ranges) or
(http.user_agent contains "OAI-SearchBot" and not ip.src in $openai_searchbot_ranges) or
(http.user_agent contains "PerplexityBot" and not ip.src in $perplexity_bot_ranges)Action: Block. The rule targets impersonators only; legitimate bots fall through to the default allow path.
AWS WAFv2 rule sketch
Use an IPSet per crawler, a RegexPatternSet for user agents, and combine them in a Statement of type AndStatement with a NotStatement around the IP match. For each crawler, action Allow on match, then a fallback Block for any UA that contains the bot name without the IP membership.
Akamai / Fastly
Both platforms support ACLs sourced from external JSON. Set the refresh interval to 24 hours; Akamai's siteshield-managed-list and Fastly's acl_cidrs accept CIDR notation directly from the OpenAI and Perplexity feeds.
Refresh cadence
Treat AI crawler ranges as living data:
- OpenAI: re-pull every 24 hours; ranges have changed multiple times per year.
- Perplexity: weekly minimum; the Perplexity-User agent rotates Azure ranges frequently.
- Google: weekly is sufficient; ranges are stable but updates happen when new regions are added.
- Apple: monthly is fine for Applebot.
- Anthropic: re-read the help-center article each quarter; if Anthropic begins publishing ranges, update this reference.
Set a recurring task in your editorial or DevOps calendar tied to this page's review_cycle_days (90).
Common mistakes
- Hard-coding IPs from blog posts. Lists copied from third-party blogs go stale within weeks. Always pull from the vendor JSON.
- Blocking entire cloud /16s. Anthropic and OpenAI both run on Azure ranges shared with other tenants. Wide blocks kill legitimate user traffic.
- Forgetting Google-Extended. Google-Extended uses the same IP ranges as Googlebot but a different robots.txt token; allowing the IP without honoring the token still excludes you from AI training.
- Allowing UA only. Without an IP or DNS check, every spoofer claiming GPTBot walks straight in.
- Skipping Anthropic on the policy side. Because there is no IP allowlist, teams sometimes forget to publish robots.txt rules for ClaudeBot at all, leaving training inclusion ambiguous.
FAQ
Q: Why doesn't Anthropic publish IP ranges for ClaudeBot?
Anthropic runs ClaudeBot on shared public-cloud IP space, so publishing a list would either be incomplete or accidentally allowlist non-Anthropic tenants. Their official guidance is to use robots.txt plus user-agent matching and to avoid IP-based blocking that could also prevent the bot from fetching robots.txt itself.
Q: Can I use Googlebot's reverse-DNS method for OAI-SearchBot?
Yes — OpenAI's search crawler advertises hostnames terminating in openai.com. Run a reverse lookup, verify the suffix, then forward-resolve to confirm the IP matches. JSON allowlists are still the authoritative source, but DNS is a useful runtime fallback.
Q: How often do AI crawler IP ranges change?
OpenAI's ranges have changed multiple times in the past year. Perplexity rotates its user-fetch ranges frequently. Google's ranges are stable but expand when new data centers come online. Plan for weekly refresh as a baseline and daily for OpenAI if your traffic depends on training-data inclusion.
Q: Do I need to allowlist Perplexity-User to be cited?
Citations come from the indexing crawler PerplexityBot, not from the on-demand Perplexity-User agent. However, blocking Perplexity-User will prevent your page from being summarized when a user pastes a URL into Perplexity. Most sites allow both.
Q: Is Google-Extended a separate crawler from Googlebot?
No. Google-Extended is a robots.txt user-agent token that controls whether your content is used to train Gemini and improve Google AI Overviews. The actual fetching is done by Googlebot using the standard Googlebot IP ranges.
Related Articles
What Is GEO? Generative Engine Optimization Defined
GEO (Generative Engine Optimization) is the practice of structuring content so AI search engines retrieve, understand, synthesize, and cite it in generated answers.
AggregateRating Schema for AI Citations
AggregateRating schema specification for AI citations: required fields, decimal handling, parent-type pairings (Product, Course, SoftwareApplication, LocalBusiness), Google policy violations.
Canonical Tag for AI Search
Specification for rel=canonical implementation across HTML and HTTP-header methods, with guidance on how AI engines resolve canonicals for parameterized URLs and AMP variants.