AI Crawler IP Allowlist Reference

Major AI crawlers publish official IP ranges as JSON endpoints (OpenAI, Perplexity, Google, Apple), while Anthropic relies on user-agent plus robots.txt because it uses public cloud IPs. Use reverse-DNS verification for Googlebot and JSON allowlists for the rest, refreshed at least weekly.

TL;DR

OpenAI, Perplexity, Google, and Apple publish CIDR ranges via versioned JSON endpoints; allowlist them at the WAF and refresh on a schedule. Anthropic does not publish ranges — verify ClaudeBot, Claude-SearchBot, and Claude-User through user-agent and request behavior, not IP filtering. Googlebot, Google-Extended, and OAI-SearchBot also support reverse-DNS verification.

Why this reference exists

User-agent strings can be spoofed in seconds, so any allowlist that depends on User-Agent: GPTBot alone is unreliable. Production teams that want to allow legitimate AI training and answer-grounding crawlers — while rejecting impersonators — pair user-agent matching with either an authoritative IP allowlist or a reverse-DNS check. This page consolidates the canonical sources for both paths so you can build WAF rules, log filters, and analytics segments without scraping forum posts.

Quick reference table

Crawler family	Bot name	Published IP source	Reverse-DNS hostname	Respects robots.txt	Purpose
OpenAI	GPTBot	https://openai.com/gptbot.json	n/a	Yes	Foundation-model training data.
OpenAI	OAI-SearchBot	https://openai.com/searchbot.json	openai.com suffix	Yes	ChatGPT search index.
OpenAI	ChatGPT-User	https://openai.com/chatgpt-user.json	n/a	Yes	On-demand user-initiated fetch.
Anthropic	ClaudeBot	Not published	Not published	Yes	Training data.
Anthropic	Claude-SearchBot	Not published	Not published	Yes	Claude web search.
Anthropic	Claude-User	Not published	Not published	Yes	User-initiated fetch.
Perplexity	PerplexityBot	https://www.perplexity.com/perplexitybot.json	n/a	Yes	Indexing crawler.
Perplexity	Perplexity-User	https://www.perplexity.com/perplexity-user.json	n/a	User-initiated; may bypass robots.txt as a user action.	On-demand fetch.
Google	Googlebot	https://developers.google.com/static/search/apis/ipranges/googlebot.json	googlebot.com / google.com	Yes	Search crawler.
Google	Google-Extended	Same Googlebot ranges + robots token	Same as Googlebot	Yes (separate token)	Gemini and AI Overviews training opt-out.
Google	Special-case fetchers	https://developers.google.com/static/search/apis/ipranges/special-crawlers.json	google.com	Yes	AdsBot, FeedFetcher, etc.
Apple	Applebot / Applebot-Extended	https://search.developer.apple.com/applebot.json	applebot.apple.com	Yes	Applebot-Extended is the AI-training opt-out token.

JSON endpoints are the source of truth — treat the table above as a starting map, then pull live ranges before building WAF rules.

Verification approaches

There are three accepted ways to verify that a crawler is who it claims to be.

1. JSON IP allowlist (OpenAI, Perplexity, Google, Apple)

The publishing vendor exposes a small JSON document with prefixes containing CIDR ranges. Fetch it from your build pipeline, normalize to a flat list, and push the result into your CDN or WAF as an allowlist. Re-run the job at least every seven days; OpenAI in particular has rotated ranges multiple times in the last 18 months, and stale lists silently drop legitimate crawl traffic.

A typical Cloudflare WAF expression looks like:

(http.user_agent contains "GPTBot" and not ip.src in $openai_gptbot_ranges)

The expression matches requests that claim to be GPTBot but originate from outside the allowlist, then takes a Block or Managed Challenge action.

2. Reverse-DNS plus forward confirmation (Google, OAI-SearchBot)

Google's official guidance is the four-step reverse-DNS dance: take the source IP, run host , check that the returned hostname ends in googlebot.com, google.com, or googleusercontent.com, then run a forward host and confirm the IP matches. OAI-SearchBot's hostnames terminate in openai.com and follow the same pattern. Reverse-DNS is the only verification path that does not require fetching a remote JSON file, but it adds DNS latency on every request, so most teams cache positive verifications for 24 hours.

3. User-agent + behavioral signals (Anthropic)

Anthropic explicitly does not publish IP ranges because ClaudeBot runs on shared cloud infrastructure where blocking the underlying ranges would also prevent the bot from fetching robots.txt. Site owners should rely on the user-agent strings (ClaudeBot/1.0, Claude-SearchBot/1.0, Claude-User/1.0), enforce policy through robots.txt, and use rate limiting plus behavioral fingerprints (request cadence, header order, TLS JA3) for defense in depth.

WAF allowlist recipes

The recipes below assume you have already imported the JSON ranges into a named list (openai_gptbot_ranges, perplexity_bot_ranges, etc.).

Cloudflare WAF custom rule

(http.user_agent contains "GPTBot" and not ip.src in $openai_gptbot_ranges) or
(http.user_agent contains "OAI-SearchBot" and not ip.src in $openai_searchbot_ranges) or
(http.user_agent contains "PerplexityBot" and not ip.src in $perplexity_bot_ranges)

Action: Block. The rule targets impersonators only; legitimate bots fall through to the default allow path.

AWS WAFv2 rule sketch

Use an IPSet per crawler, a RegexPatternSet for user agents, and combine them in a Statement of type AndStatement with a NotStatement around the IP match. For each crawler, action Allow on match, then a fallback Block for any UA that contains the bot name without the IP membership.

Akamai / Fastly

Both platforms support ACLs sourced from external JSON. Set the refresh interval to 24 hours; Akamai's siteshield-managed-list and Fastly's acl_cidrs accept CIDR notation directly from the OpenAI and Perplexity feeds.

Refresh cadence

Treat AI crawler ranges as living data:

OpenAI: re-pull every 24 hours; ranges have changed multiple times per year.
Perplexity: weekly minimum; the Perplexity-User agent rotates Azure ranges frequently.
Google: weekly is sufficient; ranges are stable but updates happen when new regions are added.
Apple: monthly is fine for Applebot.
Anthropic: re-read the help-center article each quarter; if Anthropic begins publishing ranges, update this reference.

Set a recurring task in your editorial or DevOps calendar tied to this page's review_cycle_days (90).

Common mistakes

Hard-coding IPs from blog posts. Lists copied from third-party blogs go stale within weeks. Always pull from the vendor JSON.
Blocking entire cloud /16s. Anthropic and OpenAI both run on Azure ranges shared with other tenants. Wide blocks kill legitimate user traffic.
Forgetting Google-Extended. Google-Extended uses the same IP ranges as Googlebot but a different robots.txt token; allowing the IP without honoring the token still excludes you from AI training.
Allowing UA only. Without an IP or DNS check, every spoofer claiming GPTBot walks straight in.
Skipping Anthropic on the policy side. Because there is no IP allowlist, teams sometimes forget to publish robots.txt rules for ClaudeBot at all, leaving training inclusion ambiguous.

FAQ

Q: Why doesn't Anthropic publish IP ranges for ClaudeBot?

Anthropic runs ClaudeBot on shared public-cloud IP space, so publishing a list would either be incomplete or accidentally allowlist non-Anthropic tenants. Their official guidance is to use robots.txt plus user-agent matching and to avoid IP-based blocking that could also prevent the bot from fetching robots.txt itself.

Q: Can I use Googlebot's reverse-DNS method for OAI-SearchBot?

Yes — OpenAI's search crawler advertises hostnames terminating in openai.com. Run a reverse lookup, verify the suffix, then forward-resolve to confirm the IP matches. JSON allowlists are still the authoritative source, but DNS is a useful runtime fallback.

Q: How often do AI crawler IP ranges change?

OpenAI's ranges have changed multiple times in the past year. Perplexity rotates its user-fetch ranges frequently. Google's ranges are stable but expand when new data centers come online. Plan for weekly refresh as a baseline and daily for OpenAI if your traffic depends on training-data inclusion.

Q: Do I need to allowlist Perplexity-User to be cited?

Citations come from the indexing crawler PerplexityBot, not from the on-demand Perplexity-User agent. However, blocking Perplexity-User will prevent your page from being summarized when a user pastes a URL into Perplexity. Most sites allow both.

Q: Is Google-Extended a separate crawler from Googlebot?

No. Google-Extended is a robots.txt user-agent token that controls whether your content is used to train Gemini and improve Google AI Overviews. The actual fetching is done by Googlebot using the standard Googlebot IP ranges.