Will allowing AI crawlers expose my content to model training without consent?

You can split the policy. Allow user-facing crawlers (ChatGPT-User, Claude-Web, Perplexity-User, OAI-SearchBot) for citation visibility, while disallowing training-only bots (GPTBot, ClaudeBot, Google-Extended, CCBot) in robots.txt and at the CDN. This keeps you discoverable in AI answers without feeding training pipelines.

Do AI crawlers respect robots.txt at the edge?

Most do — GPTBot, ClaudeBot, Google-Extended, and OAI-SearchBot follow it consistently. Some, including stealth Perplexity crawlers documented by Cloudflare, have been observed ignoring it and rotating user-agents. Treat robots.txt as a policy signal and back it up with CDN-level rules for hard enforcement.

Why does Google index my SPA but ChatGPT cannot?

Googlebot uses a headless Chrome rendering engine that executes JavaScript before indexing. Every major AI crawler — GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Meta-ExternalAgent, and Bytespider — fetches raw HTML only. If your content is rendered client-side, Google sees it and AI assistants do not.

How often should I re-audit my CDN for AI crawler access?

Quarterly at minimum, and immediately whenever your CDN provider releases a managed bot rule update. AI crawler user-agents and IP ranges change frequently, and managed rule sets are often updated automatically without notifying customers.

Should I serve AI crawlers from a separate subdomain?

No. Splitting ai.example.com from www.example.com fragments your canonical URLs and dilutes citation signals across both domains. Configure both human and AI traffic on the same hostname and differentiate at the bot management layer instead.

CDN Configuration Checklist for AI Crawler Discoverability

Configure your CDN to allow verified AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot) by tuning bot management policies, adding user-agent and IP allowlists, adjusting cache headers and TTLs for repeat-scan traffic, removing geographic restrictions on public docs, ensuring critical content is server-rendered, and verifying with origin logs. Most "invisible-to-AI" sites are blocked by default CDN bot rules — not by their robots.txt.

TL;DR

If your site is public on Google but missing from ChatGPT, Claude, and Perplexity answers, the cause is usually your CDN — not your content. Default bot management presets on Cloudflare, Akamai, Fastly, and AWS CloudFront often classify AI crawlers as "scrapers" and silently block, challenge, or rate-limit them. Run this checklist to confirm that every AI crawler that should be reading your content actually can.

How to use this checklist

Work through each section in order. Every check is binary: it either passes or fails. Fix any failure before moving on, because cascade rules in most CDNs short-circuit on the first deny.

You will need:

Admin access to your CDN dashboard (Cloudflare, Akamai, Fastly, AWS CloudFront, etc.)
Access to origin server logs for at least the last 7 days
A user-agent test tool such as curl -A or a hosted crawlability checker

For the policy layer that sits above this checklist, see the Robots.txt configuration guide for AI crawlers and the Technical SEO hub.

Section 1 — Robots.txt at the edge

[ ] robots.txt is reachable at https://yourdomain.com/robots.txt and returns 200
[ ] robots.txt is not cached at the CDN longer than 1 hour
[ ] Crawler rules explicitly Allow: / for the bots you want (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot, ChatGPT-User, Perplexity-User, Claude-Web)
[ ] No CDN-managed robots.txt is overriding your file (on Cloudflare, the is_robots_txt_managed zone flag is false if you maintain your own)
[ ] The file is served as Content-Type: text/plain with no double-compression issues

Cloudflare's "AI Scrapers and Crawlers" toggle under Security → Bots can silently inject Disallow: rules for AI bots without changing your file. Disable it if you want AI discoverability.

Section 2 — Bot management policies

[ ] AI crawler user-agents are categorized as verified bots, not as "web scrapers"
[ ] Action for verified AI bots is set to Allow, not Challenge, JS Challenge, or Managed Challenge
[ ] Rate limits for verified AI bots are at least 100 requests/minute per IP for documentation domains
[ ] Custom WAF rules do not block on User-Agent contains "bot" (a frequent false-positive pattern)
[ ] Bot Fight Mode / Super Bot Fight Mode does not apply to AI crawler IP ranges

Per-CDN quick checks:

Cloudflare: Security → Bots → "AI Scrapers and Crawlers" disabled; AI Crawl Control → set verified crawlers to Allow.
Akamai: Bot Manager → reclassify ChatGPT Agent, GPTBot, ClaudeBot, and PerplexityBot from Web Scrapers to a custom Allow category.
Fastly: Add a VCL allowlist that sets X-Bot-Verified: ai-crawler for matching User-Agents and bypasses bot-protection logic downstream.
AWS CloudFront + WAF: Exclude AI crawler user-agent strings from the AWS Managed Bot Control rule group.

Section 3 — User-agent and IP allowlists

[ ] User-agent allowlist contains exact strings for: GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-Web, PerplexityBot, Perplexity-User, Google-Extended, Amazonbot, Meta-ExternalAgent
[ ] IP allowlists pull from the official sources (e.g., openai.com/gptbot.json, openai.com/chatgpt-user.json, plus published Anthropic and Perplexity ranges)
[ ] Allowlist is refreshed at least monthly — AI crawler IP ranges change frequently
[ ] You verify reverse-DNS or signed-request signatures before allowing — never trust user-agent alone

ChatGPT Agent signs every outbound HTTP request, so platforms like Akamai can verify authenticity without you maintaining IP lists yourself.

Section 4 — Cache headers and TTLs

[ ] HTML pages return Cache-Control: public, max-age=3600 or longer so AI crawler scans hit cache
[ ] Vary: User-Agent is not set on public documentation (it fragments cache and starves AI crawlers)
[ ] CF-Cache-Status / X-Cache headers show HIT for AI-crawler requests on repeat scans
[ ] Origin shielding or tiered cache is enabled if AI crawler traffic exceeds 1% of total requests
[ ] You have a separate, higher-TTL cache key for verified AI bot traffic if your CDN supports it

AI crawlers issue many repeat scans of the same URL. Without aggressive caching, every scan becomes an origin request, raising both latency and egress costs — Cloudflare has documented measurable cache-hit-rate drops under AI crawler load.

Section 5 — Geographic and protocol restrictions

[ ] No geo-block rules drop traffic from AWS, GCP, Azure, or OCI ranges (most AI crawlers run there)
[ ] HTTP/2 and HTTP/3 are enabled at the edge
[ ] Both TLS 1.2 and 1.3 are supported (some crawlers still negotiate 1.2)
[ ] IPv6 is enabled — several AI crawlers prefer or require it
[ ] No country-level firewalls drop U.S., EU, or Singapore data-center traffic

Section 6 — JavaScript rendering at the edge

[ ] Critical content (title, headings, body copy, FAQ blocks, schema.org JSON-LD) is present in the initial HTML payload, not injected by client-side JS
[ ] If the site is a SPA, you have SSR or static prerendering for crawler user-agents
[ ] Edge functions or workers do not strip