Robots.txt for AI Crawlers: Specification & Configuration
Robots.txt for AI crawlers uses the same Robots Exclusion Protocol syntax (User-agent, Allow, Disallow, Sitemap) as traditional SEO, but each AI vendor publishes its own user-agent strings — GPTBot and OAI-SearchBot for OpenAI, ClaudeBot and Claude-User for Anthropic, PerplexityBot and Perplexity-User for Perplexity, Google-Extended for Google's Gemini training, and CCBot for Common Crawl. Vendors honor the standard directives; controlling training and live retrieval requires separate User-agent groups.
TL;DR
Use the standard robots.txt syntax with one User-agent group per AI bot. Block training-only bots (GPTBot, Google-Extended, CCBot, Bytespider, Applebot-Extended) when you do not want your content used to train foundation models. Allow live-retrieval bots (OAI-SearchBot, ChatGPT-User, PerplexityBot, Perplexity-User, Claude-User) when you want to be cited in AI search answers. Treat the two decisions as independent: you can disallow training while allowing citation.
Definition
/robots.txt is the canonical file at the root of a domain that implements the Robots Exclusion Protocol (REP, formalized in RFC 9309). It instructs compliant crawlers which paths they may fetch. AI crawlers honor REP voluntarily, like traditional bots; the file is advisory, not enforcing. For enforcement, use IP-level blocking, WAF rules, or authentication.
Why it matters
A misconfigured robots.txt has two failure modes for AI search:
- Default-block CMS/CDN templates silently disallow User-agent: *, removing your content from AI search citations entirely.
- No vendor-specific groups force you to choose between blocking everything or allowing everything, instead of separating training (long-term IP risk) from live retrieval (visibility upside).
Getting robots.txt right is the cheapest, highest-leverage AI search lever. It costs minutes to edit and has 24-hour propagation on most vendors (OpenAI explicitly documents ~24h propagation for robots.txt updates).
Supported directive syntax
AI crawlers support the same four core directives as Googlebot:
| Directive | Required | Description |
|---|---|---|
| User-agent: | Yes | Names the bot group. * matches any bot not explicitly named. |
| Allow: | No | Permits crawling of the given path. Overrides Disallow on more specific paths. |
| Disallow: | No | Forbids crawling. Disallow: (empty) is equivalent to allow all. |
| Sitemap: | No | Absolute URL to a sitemap. Not tied to a User-agent group. |
Google's robots.txt parser, which most AI vendors emulate, formally supports only these four fields. Crawl-delay is not supported by Google or by GPTBot; it is honored by Bing, Yandex, and some smaller AI crawlers but should not be relied upon.
Path matching rules:
- Paths are case-sensitive and must start with /.
- * is a wildcard for any sequence of characters.
- $ anchors the end of the URL.
- More specific paths win when both Allow and Disallow match.
- An empty Disallow: means allow everything; an empty Allow: is meaningless.
AI crawler user-agent reference
The following table covers the major AI crawler user-agents observed in production and documented by the operating vendors. Each row notes the bot's purpose: train (foundation model training), search (real-time index for AI search products), or fetch (on-demand fetch from a user prompt).
| User-agent | Vendor | Purpose | Honors robots.txt | Source |
|---|---|---|---|---|
| GPTBot | OpenAI | train | Yes | OpenAI bots docs |
| OAI-SearchBot | OpenAI | search | Yes | OpenAI bots docs |
| ChatGPT-User | OpenAI | fetch (user-initiated) | Yes | OpenAI bots docs |
| OAI-AdsBot | OpenAI | ad landing-page validation | Yes | OpenAI bots docs |
| ClaudeBot | Anthropic | train | Yes | Anthropic published guidance |
| Claude-Web | Anthropic | search/fetch | Yes | Anthropic published guidance |
| Claude-User | Anthropic | fetch (user-initiated) | Yes | Anthropic published guidance |
| anthropic-ai | Anthropic | legacy alias | Yes | Anthropic published guidance |
| PerplexityBot | Perplexity | search/index | Yes | Perplexity bots docs |
| Perplexity-User | Perplexity | fetch (user-initiated) | Limited (user-initiated) | Perplexity bots docs |
| Google-Extended | train (Gemini, Vertex AI) | Yes | Google docs | |
| Googlebot | search/index (also feeds AI Overviews) | Yes | Google docs | |
| CCBot | Common Crawl | train (open dataset) | Yes | Common Crawl docs |
| Bytespider | ByteDance | train | Partial — known violations reported | Vendor docs |
| Applebot-Extended | Apple | train (Apple Intelligence) | Yes | Apple docs |
| Applebot | Apple | search/index | Yes | Apple docs |
| Meta-ExternalAgent | Meta | train | Yes | Meta docs |
| Meta-ExternalFetcher | Meta | fetch | Yes | Meta docs |
| Amazonbot | Amazon | search/Alexa | Yes | Amazon docs |
Key insight: most vendors split the bot fleet so training and search/retrieval can be controlled independently. Blocking GPTBot while allowing OAI-SearchBot is a coherent strategy — you opt out of training while staying visible in ChatGPT Search citations.
Recommended configurations
Configuration A: Allow all AI crawlers (maximum AI visibility)
Use this when you want maximum exposure in AI search results and are comfortable with content being used for foundation model training.
AI search visibility — allow all
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: CCBot
Allow: /
Sitemap: https://example.com/sitemap.xml
Configuration B: Block training, allow citation (recommended for most publishers)
Use this to prevent foundation-model training while remaining citable in ChatGPT Search, Perplexity, Claude, and Google AI Overviews.
Block training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
Allow live-retrieval / citation crawlers
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Googlebot
Allow: /
Sitemap: https://example.com/sitemap.xml
Note: Google-Extended controls Gemini training and grounding but does not control Google AI Overviews, which fall under Googlebot. Disallowing Google-Extended does not remove your content from AI Overviews.
Configuration C: Block sensitive paths only
Use this when you want broad AI access but need to protect specific sections (admin, gated content, internal APIs).
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /private/
Disallow: /api/
Disallow: /*?session=
User-agent: GPTBot
Allow: /
Disallow: /admin/
Disallow: /private/
Disallow: /api/
Disallow: /paid/
Sitemap: https://example.com/sitemap.xml
The User-agent: * group does not propagate to a named user-agent group — you must repeat the disallows in each group, or the named group will inherit no restrictions.
Validation pipeline
- Syntax validation: Use Google's robots.txt Tester (in Search Console) or the open-source google/robotstxt parser. Both implement RFC 9309.
- Bot-specific test: Fetch the file via curl -A "GPTBot/1.0" https://example.com/robots.txt and verify the response. The Content-Type must be text/plain.
- Live behavior check: OpenAI's bot logs propagate ~24 hours. Anthropic and Perplexity do not document propagation; assume 24-48 hours.
- CI test: Add a build step that diffs robots.txt against a golden file and fails the build on unexpected changes — a single misplaced Disallow: / can deindex an entire site.
- Monitoring: Track crawler hits per user-agent in CDN logs. Sudden drops in named AI crawlers indicate misconfiguration.
Differences vs traditional SEO robots.txt
| Dimension | Traditional SEO | AI crawler config |
|---|---|---|
| Number of named user-agents | 1-3 (Googlebot, Bingbot, AdsBot) | 10-20 (per-vendor split) |
| Crawl frequency cadence | Predictable | Bursty, can be 100x normal volume |
| Training vs serving split | Not applicable | Critical — separate user agents per vendor |
| Crawl-delay support | Bing yes, Google no | Mostly no |
| Compliance | Mature, near-100% | Voluntary; some bots ignore robots.txt |
| File propagation | Hours | 24-48 hours |
| Default behavior | If allowed implicitly, crawl | If allowed implicitly, train |
Misconceptions
- "AI bots ignore robots.txt." Most major AI vendors (OpenAI, Anthropic, Perplexity, Google, Apple, Meta) honor robots.txt. Bytespider and some smaller crawlers have a track record of partial compliance.
- "Disallowing GPTBot stops ChatGPT from citing me." No. GPTBot controls training only. ChatGPT Search citations are governed by OAI-SearchBot.
- "Google-Extended controls AI Overviews." No. Google-Extended controls Gemini training and grounding, not AI Overviews.
- "User-agent: * covers all AI bots." Technically yes if no specific group is named, but vendor-specific groups override the wildcard. Always be explicit for the bots you care about.
- "DisallowAITraining is a supported directive." It is a Microsoft draft IETF proposal, not a deployed standard. Do not rely on it.
Common mistakes
- Blocking User-agent: * in CMS templates and forgetting to whitelist AI crawlers
- Putting Sitemap directives inside a User-agent group (Sitemap is global)
- Using uppercase USER-AGENT: (case in field names is tolerated, but inconsistency causes parser bugs in older tooling)
- Relying on Crawl-delay against GPTBot or Googlebot — not supported
- Editing robots.txt without a CI golden-file diff, allowing a typo to deindex the site
- Confusing training opt-out with citation opt-out (they are separate controls)
How to apply
- Audit your current /robots.txt and inventory which AI bot user-agents are present.
- Pick a configuration profile (A, B, or C above) based on whether you prioritize visibility or training opt-out.
- Stage the new file in a non-production branch, run a CI golden-file diff, and review with stakeholders.
- Deploy to production and verify with curl -A "GPTBot/1.0" https://example.com/robots.txt.
- Wait 24-48 hours, then check CDN logs for the expected user-agent traffic patterns.
- Add a quarterly calendar reminder to refresh the user-agent list — new AI crawlers are launched every quarter.
FAQ
Q: Does blocking GPTBot remove me from ChatGPT Search citations?
No. GPTBot controls foundation-model training. ChatGPT Search citations are governed by OAI-SearchBot. To opt out of training while staying citable, disallow GPTBot and allow OAI-SearchBot.
Q: Does Google-Extended control AI Overviews?
No. Google-Extended controls whether your content is used to train Gemini models and to ground Gemini answers. AI Overviews in Google Search are powered by Googlebot-indexed content; there is no robots.txt directive to opt out of AI Overviews specifically without also opting out of Google Search.
Q: Is Crawl-delay supported by AI crawlers?
Mostly no. Google's parser ignores Crawl-delay, and OpenAI does not document support. Bing and Yandex honor it. For AI crawl rate control, use server-side rate limiting or CDN rules instead.
Q: How long does a robots.txt change take to propagate?
OpenAI documents ~24 hours for robots.txt changes to take effect. Anthropic and Perplexity do not publish a propagation SLA; assume 24-48 hours.
Q: Should I add a Sitemap directive for AI crawlers?
Yes. AI crawlers benefit from a Sitemap directive even when they prefer llms.txt. The Sitemap directive is global (not user-agent-specific) and provides a fallback URL inventory for any compliant bot.
Q: Do agentic browsers like ChatGPT Atlas or Perplexity Comet honor robots.txt?
User-initiated browsing by agentic browsers is treated like a human user; robots.txt enforcement is limited or non-existent for these on-demand fetches. Use authentication or rate limiting to control access.
: Google, "How Google Interprets the robots.txt Specification" — verified 2026-05-03 — supports four-directive REP definition. https://developers.google.com/crawling/docs/robots-txt/robots-txt-spec
: Wikipedia, "robots.txt" — verified 2026-05-03 — supports REP history and voluntary-compliance nature. https://en.wikipedia.org/wiki/Robots.txt
: OpenAI, "Overview of OpenAI Crawlers" — verified 2026-05-03 — supports GPTBot/OAI-SearchBot/ChatGPT-User split and 24h propagation. https://developers.openai.com/api/docs/bots
: Scrunch, "Guide to AI User Agents" — verified 2026-05-03 — supports PerplexityBot and Perplexity-User behavior. https://scrunch.com/resources/guides/guide-to-ai-user-agents/
: Marie Haynes, "Should you block Google Extended in Robots.txt?" — verified 2026-05-03 — supports Google-Extended scope (training, not AI Overviews). https://www.mariehaynes.com/should-you-use-google-extended-in-robots-txt/
: Cite.sh, "GPTBot, ClaudeBot, PerplexityBot: The AI Crawler Guide" — verified 2026-05-03 — supports user-agent group inheritance behavior. https://www.cite.sh/blog/ai-crawler-guide/
: Taskade, "11 Best AI Robots.txt & SEO Config Generators" — verified 2026-05-03 — supports CI golden-file diff caution. https://www.taskade.com/blog/ai-robots-txt-generators
: robotstxt.com, "AI / LLM User-Agents: Blocking Guide" — verified 2026-05-03 — supports DisallowAITraining draft status. https://robotstxt.com/ai
Related Articles
AI Crawler IP Allowlist Reference
Reference list of official AI crawler IP range endpoints, user agents, and reverse-DNS verification methods for GPTBot, ClaudeBot, PerplexityBot, Googlebot, and more.
How to Create llms.txt: Step-by-Step Tutorial for AI Search
Step-by-step tutorial for creating, deploying, and validating an llms.txt file so AI systems and LLMs can discover your site's most important content.
HTTP/2 vs HTTP/3 for AI Crawlers
HTTP/3 AI crawlers support is uneven: GPTBot and most AI bots still default to HTTP/2 over TCP. Compare protocols, fallback behavior, and CDN config.