Geodocs.dev

Robots.txt for AI Crawlers: Specification & Configuration

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

Robots.txt for AI crawlers uses the same Robots Exclusion Protocol syntax (User-agent, Allow, Disallow, Sitemap) as traditional SEO, but each AI vendor publishes its own user-agent strings — GPTBot and OAI-SearchBot for OpenAI, ClaudeBot and Claude-User for Anthropic, PerplexityBot and Perplexity-User for Perplexity, Google-Extended for Google's Gemini training, and CCBot for Common Crawl. Vendors honor the standard directives; controlling training and live retrieval requires separate User-agent groups.

TL;DR

Use the standard robots.txt syntax with one User-agent group per AI bot. Block training-only bots (GPTBot, Google-Extended, CCBot, Bytespider, Applebot-Extended) when you do not want your content used to train foundation models. Allow live-retrieval bots (OAI-SearchBot, ChatGPT-User, PerplexityBot, Perplexity-User, Claude-User) when you want to be cited in AI search answers. Treat the two decisions as independent: you can disallow training while allowing citation.

Definition

/robots.txt is the canonical file at the root of a domain that implements the Robots Exclusion Protocol (REP, formalized in RFC 9309). It instructs compliant crawlers which paths they may fetch. AI crawlers honor REP voluntarily, like traditional bots; the file is advisory, not enforcing. For enforcement, use IP-level blocking, WAF rules, or authentication.

Why it matters

A misconfigured robots.txt has two failure modes for AI search:

  1. Default-block CMS/CDN templates silently disallow User-agent: *, removing your content from AI search citations entirely.
  2. No vendor-specific groups force you to choose between blocking everything or allowing everything, instead of separating training (long-term IP risk) from live retrieval (visibility upside).

Getting robots.txt right is the cheapest, highest-leverage AI search lever. It costs minutes to edit and has 24-hour propagation on most vendors (OpenAI explicitly documents ~24h propagation for robots.txt updates).

Supported directive syntax

AI crawlers support the same four core directives as Googlebot:

DirectiveRequiredDescription
User-agent: YesNames the bot group. * matches any bot not explicitly named.
Allow: NoPermits crawling of the given path. Overrides Disallow on more specific paths.
Disallow: NoForbids crawling. Disallow: (empty) is equivalent to allow all.
Sitemap: NoAbsolute URL to a sitemap. Not tied to a User-agent group.

Google's robots.txt parser, which most AI vendors emulate, formally supports only these four fields. Crawl-delay is not supported by Google or by GPTBot; it is honored by Bing, Yandex, and some smaller AI crawlers but should not be relied upon.

Path matching rules:

  • Paths are case-sensitive and must start with /.
  • * is a wildcard for any sequence of characters.
  • $ anchors the end of the URL.
  • More specific paths win when both Allow and Disallow match.
  • An empty Disallow: means allow everything; an empty Allow: is meaningless.

AI crawler user-agent reference

The following table covers the major AI crawler user-agents observed in production and documented by the operating vendors. Each row notes the bot's purpose: train (foundation model training), search (real-time index for AI search products), or fetch (on-demand fetch from a user prompt).

User-agentVendorPurposeHonors robots.txtSource
GPTBotOpenAItrainYesOpenAI bots docs
OAI-SearchBotOpenAIsearchYesOpenAI bots docs
ChatGPT-UserOpenAIfetch (user-initiated)YesOpenAI bots docs
OAI-AdsBotOpenAIad landing-page validationYesOpenAI bots docs
ClaudeBotAnthropictrainYesAnthropic published guidance
Claude-WebAnthropicsearch/fetchYesAnthropic published guidance
Claude-UserAnthropicfetch (user-initiated)YesAnthropic published guidance
anthropic-aiAnthropiclegacy aliasYesAnthropic published guidance
PerplexityBotPerplexitysearch/indexYesPerplexity bots docs
Perplexity-UserPerplexityfetch (user-initiated)Limited (user-initiated)Perplexity bots docs
Google-ExtendedGoogletrain (Gemini, Vertex AI)YesGoogle docs
GooglebotGooglesearch/index (also feeds AI Overviews)YesGoogle docs
CCBotCommon Crawltrain (open dataset)YesCommon Crawl docs
BytespiderByteDancetrainPartial — known violations reportedVendor docs
Applebot-ExtendedAppletrain (Apple Intelligence)YesApple docs
ApplebotApplesearch/indexYesApple docs
Meta-ExternalAgentMetatrainYesMeta docs
Meta-ExternalFetcherMetafetchYesMeta docs
AmazonbotAmazonsearch/AlexaYesAmazon docs

Key insight: most vendors split the bot fleet so training and search/retrieval can be controlled independently. Blocking GPTBot while allowing OAI-SearchBot is a coherent strategy — you opt out of training while staying visible in ChatGPT Search citations.

Configuration A: Allow all AI crawlers (maximum AI visibility)

Use this when you want maximum exposure in AI search results and are comfortable with content being used for foundation model training.

AI search visibility — allow all

User-agent: GPTBot

Allow: /

User-agent: OAI-SearchBot

Allow: /

User-agent: ChatGPT-User

Allow: /

User-agent: ClaudeBot

Allow: /

User-agent: Claude-User

Allow: /

User-agent: PerplexityBot

Allow: /

User-agent: Perplexity-User

Allow: /

User-agent: Google-Extended

Allow: /

User-agent: Applebot-Extended

Allow: /

User-agent: CCBot

Allow: /

Sitemap: https://example.com/sitemap.xml

Use this to prevent foundation-model training while remaining citable in ChatGPT Search, Perplexity, Claude, and Google AI Overviews.

Block training crawlers

User-agent: GPTBot

Disallow: /

User-agent: ClaudeBot

Disallow: /

User-agent: Google-Extended

Disallow: /

User-agent: Applebot-Extended

Disallow: /

User-agent: CCBot

Disallow: /

User-agent: Bytespider

Disallow: /

User-agent: Meta-ExternalAgent

Disallow: /

Allow live-retrieval / citation crawlers

User-agent: OAI-SearchBot

Allow: /

User-agent: ChatGPT-User

Allow: /

User-agent: PerplexityBot

Allow: /

User-agent: Perplexity-User

Allow: /

User-agent: Claude-User

Allow: /

User-agent: Googlebot

Allow: /

Sitemap: https://example.com/sitemap.xml

Note: Google-Extended controls Gemini training and grounding but does not control Google AI Overviews, which fall under Googlebot. Disallowing Google-Extended does not remove your content from AI Overviews.

Configuration C: Block sensitive paths only

Use this when you want broad AI access but need to protect specific sections (admin, gated content, internal APIs).

User-agent: *

Allow: /

Disallow: /admin/

Disallow: /private/

Disallow: /api/

Disallow: /*?session=

User-agent: GPTBot

Allow: /

Disallow: /admin/

Disallow: /private/

Disallow: /api/

Disallow: /paid/

Sitemap: https://example.com/sitemap.xml

The User-agent: * group does not propagate to a named user-agent group — you must repeat the disallows in each group, or the named group will inherit no restrictions.

Validation pipeline

  1. Syntax validation: Use Google's robots.txt Tester (in Search Console) or the open-source google/robotstxt parser. Both implement RFC 9309.
  2. Bot-specific test: Fetch the file via curl -A "GPTBot/1.0" https://example.com/robots.txt and verify the response. The Content-Type must be text/plain.
  3. Live behavior check: OpenAI's bot logs propagate ~24 hours. Anthropic and Perplexity do not document propagation; assume 24-48 hours.
  4. CI test: Add a build step that diffs robots.txt against a golden file and fails the build on unexpected changes — a single misplaced Disallow: / can deindex an entire site.
  5. Monitoring: Track crawler hits per user-agent in CDN logs. Sudden drops in named AI crawlers indicate misconfiguration.

Differences vs traditional SEO robots.txt

DimensionTraditional SEOAI crawler config
Number of named user-agents1-3 (Googlebot, Bingbot, AdsBot)10-20 (per-vendor split)
Crawl frequency cadencePredictableBursty, can be 100x normal volume
Training vs serving splitNot applicableCritical — separate user agents per vendor
Crawl-delay supportBing yes, Google noMostly no
ComplianceMature, near-100%Voluntary; some bots ignore robots.txt
File propagationHours24-48 hours
Default behaviorIf allowed implicitly, crawlIf allowed implicitly, train

Misconceptions

  • "AI bots ignore robots.txt." Most major AI vendors (OpenAI, Anthropic, Perplexity, Google, Apple, Meta) honor robots.txt. Bytespider and some smaller crawlers have a track record of partial compliance.
  • "Disallowing GPTBot stops ChatGPT from citing me." No. GPTBot controls training only. ChatGPT Search citations are governed by OAI-SearchBot.
  • "Google-Extended controls AI Overviews." No. Google-Extended controls Gemini training and grounding, not AI Overviews.
  • "User-agent: * covers all AI bots." Technically yes if no specific group is named, but vendor-specific groups override the wildcard. Always be explicit for the bots you care about.
  • "DisallowAITraining is a supported directive." It is a Microsoft draft IETF proposal, not a deployed standard. Do not rely on it.

Common mistakes

  • Blocking User-agent: * in CMS templates and forgetting to whitelist AI crawlers
  • Putting Sitemap directives inside a User-agent group (Sitemap is global)
  • Using uppercase USER-AGENT: (case in field names is tolerated, but inconsistency causes parser bugs in older tooling)
  • Relying on Crawl-delay against GPTBot or Googlebot — not supported
  • Editing robots.txt without a CI golden-file diff, allowing a typo to deindex the site
  • Confusing training opt-out with citation opt-out (they are separate controls)

How to apply

  1. Audit your current /robots.txt and inventory which AI bot user-agents are present.
  2. Pick a configuration profile (A, B, or C above) based on whether you prioritize visibility or training opt-out.
  3. Stage the new file in a non-production branch, run a CI golden-file diff, and review with stakeholders.
  4. Deploy to production and verify with curl -A "GPTBot/1.0" https://example.com/robots.txt.
  5. Wait 24-48 hours, then check CDN logs for the expected user-agent traffic patterns.
  6. Add a quarterly calendar reminder to refresh the user-agent list — new AI crawlers are launched every quarter.

FAQ

Q: Does blocking GPTBot remove me from ChatGPT Search citations?

No. GPTBot controls foundation-model training. ChatGPT Search citations are governed by OAI-SearchBot. To opt out of training while staying citable, disallow GPTBot and allow OAI-SearchBot.

Q: Does Google-Extended control AI Overviews?

No. Google-Extended controls whether your content is used to train Gemini models and to ground Gemini answers. AI Overviews in Google Search are powered by Googlebot-indexed content; there is no robots.txt directive to opt out of AI Overviews specifically without also opting out of Google Search.

Q: Is Crawl-delay supported by AI crawlers?

Mostly no. Google's parser ignores Crawl-delay, and OpenAI does not document support. Bing and Yandex honor it. For AI crawl rate control, use server-side rate limiting or CDN rules instead.

Q: How long does a robots.txt change take to propagate?

OpenAI documents ~24 hours for robots.txt changes to take effect. Anthropic and Perplexity do not publish a propagation SLA; assume 24-48 hours.

Q: Should I add a Sitemap directive for AI crawlers?

Yes. AI crawlers benefit from a Sitemap directive even when they prefer llms.txt. The Sitemap directive is global (not user-agent-specific) and provides a fallback URL inventory for any compliant bot.

Q: Do agentic browsers like ChatGPT Atlas or Perplexity Comet honor robots.txt?

User-initiated browsing by agentic browsers is treated like a human user; robots.txt enforcement is limited or non-existent for these on-demand fetches. Use authentication or rate limiting to control access.

: Google, "How Google Interprets the robots.txt Specification" — verified 2026-05-03 — supports four-directive REP definition. https://developers.google.com/crawling/docs/robots-txt/robots-txt-spec

: Wikipedia, "robots.txt" — verified 2026-05-03 — supports REP history and voluntary-compliance nature. https://en.wikipedia.org/wiki/Robots.txt

: OpenAI, "Overview of OpenAI Crawlers" — verified 2026-05-03 — supports GPTBot/OAI-SearchBot/ChatGPT-User split and 24h propagation. https://developers.openai.com/api/docs/bots

: Scrunch, "Guide to AI User Agents" — verified 2026-05-03 — supports PerplexityBot and Perplexity-User behavior. https://scrunch.com/resources/guides/guide-to-ai-user-agents/

: Marie Haynes, "Should you block Google Extended in Robots.txt?" — verified 2026-05-03 — supports Google-Extended scope (training, not AI Overviews). https://www.mariehaynes.com/should-you-use-google-extended-in-robots-txt/

: Cite.sh, "GPTBot, ClaudeBot, PerplexityBot: The AI Crawler Guide" — verified 2026-05-03 — supports user-agent group inheritance behavior. https://www.cite.sh/blog/ai-crawler-guide/

: Taskade, "11 Best AI Robots.txt & SEO Config Generators" — verified 2026-05-03 — supports CI golden-file diff caution. https://www.taskade.com/blog/ai-robots-txt-generators

: robotstxt.com, "AI / LLM User-Agents: Blocking Guide" — verified 2026-05-03 — supports DisallowAITraining draft status. https://robotstxt.com/ai

Related Articles

reference

AI Crawler IP Allowlist Reference

Reference list of official AI crawler IP range endpoints, user agents, and reverse-DNS verification methods for GPTBot, ClaudeBot, PerplexityBot, Googlebot, and more.

guide

How to Create llms.txt: Step-by-Step Tutorial for AI Search

Step-by-step tutorial for creating, deploying, and validating an llms.txt file so AI systems and LLMs can discover your site's most important content.

comparison

HTTP/2 vs HTTP/3 for AI Crawlers

HTTP/3 AI crawlers support is uneven: GPTBot and most AI bots still default to HTTP/2 over TCP. Compare protocols, fallback behavior, and CDN config.

Topics
Stay Updated

GEO & AI Search Insights

New articles, framework updates, and industry analysis. No spam, unsubscribe anytime.