Robots.txt for AI crawlers uses the same Robots Exclusion Protocol syntax (User-agent, Allow, Disallow, Sitemap) as traditional SEO, but each AI vendor publishes its own user-agent strings — GPTBot and OAI-SearchBot for OpenAI, ClaudeBot and Claude-User for Anthropic, PerplexityBot and Perplexity-User for Perplexity, Google-Extended for Google's Gemini training, and CCBot for Common Crawl. Vendors honor the standard directives; controlling training and live retrieval requires separate User-agent groups.

TL;DR

Use the standard robots.txt syntax with one User-agent group per AI bot. Block training-only bots (GPTBot, Google-Extended, CCBot, Bytespider, Applebot-Extended) when you do not want your content used to train foundation models. Allow live-retrieval bots (OAI-SearchBot, ChatGPT-User, PerplexityBot, Perplexity-User, Claude-User) when you want to be cited in AI search answers. Treat the two decisions as independent: you can disallow training while allowing citation.

Definition

/robots.txt is the canonical file at the root of a domain that implements the Robots Exclusion Protocol (REP, formalized in RFC 9309). It instructs compliant crawlers which paths they may fetch. AI crawlers honor REP voluntarily, like traditional bots; the file is advisory, not enforcing. For enforcement, use IP-level blocking, WAF rules, or authentication.

Why it matters

A misconfigured robots.txt has two failure modes for AI search:

Default-block CMS/CDN templates silently disallow User-agent: *, removing your content from AI search citations entirely.
No vendor-specific groups force you to choose between blocking everything or allowing everything, instead of separating training (long-term IP risk) from live retrieval (visibility upside).

Getting robots.txt right is the cheapest, highest-leverage AI search lever. It costs minutes to edit and has 24-hour propagation on most vendors (OpenAI explicitly documents ~24h propagation for robots.txt updates).

Supported directive syntax

AI crawlers support the same four core directives as Googlebot:

Directive	Required	Description
User-agent:	Yes	Names the bot group. * matches any bot not explicitly named.
Allow:	No	Permits crawling of the given path. Overrides Disallow on more specific paths.
Disallow:	No	Forbids crawling. Disallow: (empty) is equivalent to allow all.
Sitemap:	No	Absolute URL to a sitemap. Not tied to a User-agent group.

Google's robots.txt parser, which most AI vendors emulate, formally supports only these four fields. Crawl-delay is not supported by Google or by GPTBot; it is honored by Bing, Yandex, and some smaller AI crawlers but should not be relied upon.

Path matching rules:

Paths are case-sensitive and must start with /.
* is a wildcard for any sequence of characters.
$ anchors the end of the URL.
More specific paths win when both Allow and Disallow match.
An empty Disallow: means allow everything; an empty Allow: is meaningless.

AI crawler user-agent reference

The following table covers the major AI crawler user-agents observed in production and documented by the operating vendors. Each row notes the bot's purpose: train (foundation model training), search (real-time index for AI search products), or fetch (on-demand fetch from a user prompt).

User-agent	Vendor	Purpose	Honors robots.txt	Source
GPTBot	OpenAI	train	Yes	OpenAI bots docs
OAI-SearchBot	OpenAI	search	Yes	OpenAI bots docs
ChatGPT-User	OpenAI	fetch (user-initiated)	Yes	OpenAI bots docs
OAI-AdsBot	OpenAI	ad landing-page validation	Yes	OpenAI bots docs
ClaudeBot	Anthropic	train	Yes	Anthropic published guidance
Claude-Web	Anthropic	search/fetch	Yes	Anthropic published guidance
Claude-User	Anthropic	fetch (user-initiated)	Yes	Anthropic published guidance
anthropic-ai	Anthropic	legacy alias	Yes	Anthropic published guidance
PerplexityBot	Perplexity	search/index	Yes	Perplexity bots docs
Perplexity-User	Perplexity	fetch (user-initiated)	Limited (user-initiated)	Perplexity bots docs
Google-Extended	Google	train (Gemini, Vertex AI)	Yes	Google docs
Googlebot	Google	search/index (also feeds AI Overviews)	Yes	Google docs
CCBot	Common Crawl	train (open dataset)	Yes	Common Crawl docs
Bytespider	ByteDance	train	Partial — known violations reported	Vendor docs
Applebot-Extended	Apple	train (Apple Intelligence)	Yes	Apple docs
Applebot	Apple	search/index	Yes	Apple docs
Meta-ExternalAgent	Meta	train	Yes	Meta docs
Meta-ExternalFetcher	Meta	fetch	Yes	Meta docs
Amazonbot	Amazon	search/Alexa	Yes	Amazon docs

Key insight: most vendors split the bot fleet so training and search/retrieval can be controlled independently. Blocking GPTBot while allowing OAI-SearchBot is a coherent strategy — you opt out of training while staying visible in ChatGPT Search citations.

Recommended configurations

Configuration A: Allow all AI crawlers (maximum AI visibility)

Use this when you want maximum exposure in AI search results and are comfortable with content being used for foundation model training.

AI search visibility — allow all

User-agent: GPTBot

Allow: /

User-agent: OAI-SearchBot

Allow: /

User-agent: ChatGPT-User

Allow: /

User-agent: ClaudeBot

Allow: /

User-agent: Claude-User

Allow: /

User-agent: PerplexityBot

Allow: /

User-agent: Perplexity-User

Allow: /

User-agent: Google-Extended

Allow: /

User-agent: Applebot-Extended

Allow: /

User-agent: CCBot

Allow: /

Sitemap: https://example.com/sitemap.xml

Configuration B: Block training, allow citation (recommended for most publishers)

Use this to prevent foundation-model training while remaining citable in ChatGPT Search, Perplexity, Claude, and Google AI Overviews.

Block training crawlers

User-agent: GPTBot

Disallow: /

User-agent: ClaudeBot

Disallow: /

User-agent: Google-Extended

Disallow: /

User-agent: Applebot-Extended

Disallow: /

User-agent: CCBot

Disallow: /

User-agent: Bytespider

Disallow: /

User-agent: Meta-ExternalAgent

Disallow: /

Allow live-retrieval / citation crawlers

User-agent: OAI-SearchBot

Allow: /

User-agent: ChatGPT-User

Allow: /

User-agent: PerplexityBot

Allow: /

User-agent: Perplexity-User

Allow: /

User-agent: Claude-User

Allow: /

User-agent: Googlebot

Allow: /

Sitemap: https://example.com/sitemap.xml

Note: Google-Extended controls Gemini training and grounding but does not control Google AI Overviews, which fall under Googlebot. Disallowing Google-Extended does not remove your content from AI Overviews.

Configuration C: Block sensitive paths only

Use this when you want broad AI access but need to protect specific sections (admin, gated content, internal APIs).

User-agent: *

Allow: /

Disallow: /admin/

Disallow: /private/

Disallow: /api/

Disallow: /*?session=

User-agent: GPTBot

Allow: /

Disallow: /admin/

Disallow: /private/

Disallow: /api/

Disallow: /paid/

Sitemap: https://example.com/sitemap.xml

The User-agent: * group does not propagate to a named user-agent group — you must repeat the disallows in each group, or the named group will inherit no restrictions.

Validation pipeline

Syntax validation: Use Google's robots.txt Tester (in Search Console) or the open-source google/robotstxt parser. Both implement RFC 9309.
Bot-specific test: Fetch the file via curl -A "GPTBot/1.0" https://example.com/robots.txt and verify the response. The Content-Type must be text/plain.
Live behavior check: OpenAI's bot logs propagate ~24 hours. Anthropic and Perplexity do not document propagation; assume 24-48 hours.
CI test: Add a build step that diffs robots.txt against a golden file and fails the build on unexpected changes — a single misplaced Disallow: / can deindex an entire site.
Monitoring: Track crawler hits per user-agent in CDN logs. Sudden drops in named AI crawlers indicate misconfiguration.

Differences vs traditional SEO robots.txt

Dimension	Traditional SEO	AI crawler config
Number of named user-agents	1-3 (Googlebot, Bingbot, AdsBot)	10-20 (per-vendor split)
Crawl frequency cadence	Predictable	Bursty, can be 100x normal volume
Training vs serving split	Not applicable	Critical — separate user agents per vendor
Crawl-delay support	Bing yes, Google no	Mostly no
Compliance	Mature, near-100%	Voluntary; some bots ignore robots.txt
File propagation	Hours	24-48 hours
Default behavior	If allowed implicitly, crawl	If allowed implicitly, train

Misconceptions

"AI bots ignore robots.txt." Most major AI vendors (OpenAI, Anthropic, Perplexity, Google, Apple, Meta) honor robots.txt. Bytespider and some smaller crawlers have a track record of partial compliance.
"Disallowing GPTBot stops ChatGPT from citing me." No. GPTBot controls training only. ChatGPT Search citations are governed by OAI-SearchBot.
"Google-Extended controls AI Overviews." No. Google-Extended controls Gemini training and grounding, not AI Overviews.
"User-agent: * covers all AI bots." Technically yes if no specific group is named, but vendor-specific groups override the wildcard. Always be explicit for the bots you care about.
"DisallowAITraining is a supported directive." It is a Microsoft draft IETF proposal, not a deployed standard. Do not rely on it.

Common mistakes

Blocking User-agent: * in CMS templates and forgetting to whitelist AI crawlers
Putting Sitemap directives inside a User-agent group (Sitemap is global)
Using uppercase USER-AGENT: (case in field names is tolerated, but inconsistency causes parser bugs in older tooling)
Relying on Crawl-delay against GPTBot or Googlebot — not supported
Editing robots.txt without a CI golden-file diff, allowing a typo to deindex the site
Confusing training opt-out with citation opt-out (they are separate controls)

How to apply

Audit your current /robots.txt and inventory which AI bot user-agents are present.
Pick a configuration profile (A, B, or C above) based on whether you prioritize visibility or training opt-out.
Stage the new file in a non-production branch, run a CI golden-file diff, and review with stakeholders.
Deploy to production and verify with curl -A "GPTBot/1.0" https://example.com/robots.txt.
Wait 24-48 hours, then check CDN logs for the expected user-agent traffic patterns.
Add a quarterly calendar reminder to refresh the user-agent list — new AI crawlers are launched every quarter.

FAQ

Q: Does blocking GPTBot remove me from ChatGPT Search citations?

No. GPTBot controls foundation-model training. ChatGPT Search citations are governed by OAI-SearchBot. To opt out of training while staying citable, disallow GPTBot and allow OAI-SearchBot.

Q: Does Google-Extended control AI Overviews?

No. Google-Extended controls whether your content is used to train Gemini models and to ground Gemini answers. AI Overviews in Google Search are powered by Googlebot-indexed content; there is no robots.txt directive to opt out of AI Overviews specifically without also opting out of Google Search.

Q: Is Crawl-delay supported by AI crawlers?

Mostly no. Google's parser ignores Crawl-delay, and OpenAI does not document support. Bing and Yandex honor it. For AI crawl rate control, use server-side rate limiting or CDN rules instead.

Q: How long does a robots.txt change take to propagate?

OpenAI documents ~24 hours for robots.txt changes to take effect. Anthropic and Perplexity do not publish a propagation SLA; assume 24-48 hours.

Q: Should I add a Sitemap directive for AI crawlers?

Yes. AI crawlers benefit from a Sitemap directive even when they prefer llms.txt. The Sitemap directive is global (not user-agent-specific) and provides a fallback URL inventory for any compliant bot.

Q: Do agentic browsers like ChatGPT Atlas or Perplexity Comet honor robots.txt?

User-initiated browsing by agentic browsers is treated like a human user; robots.txt enforcement is limited or non-existent for these on-demand fetches. Use authentication or rate limiting to control access.

: Google, "How Google Interprets the robots.txt Specification" — verified 2026-05-03 — supports four-directive REP definition. https://developers.google.com/crawling/docs/robots-txt/robots-txt-spec

: Wikipedia, "robots.txt" — verified 2026-05-03 — supports REP history and voluntary-compliance nature. https://en.wikipedia.org/wiki/Robots.txt

: OpenAI, "Overview of OpenAI Crawlers" — verified 2026-05-03 — supports GPTBot/OAI-SearchBot/ChatGPT-User split and 24h propagation. https://developers.openai.com/api/docs/bots

: Scrunch, "Guide to AI User Agents" — verified 2026-05-03 — supports PerplexityBot and Perplexity-User behavior. https://scrunch.com/resources/guides/guide-to-ai-user-agents/

: Marie Haynes, "Should you block Google Extended in Robots.txt?" — verified 2026-05-03 — supports Google-Extended scope (training, not AI Overviews). https://www.mariehaynes.com/should-you-use-google-extended-in-robots-txt/

: Cite.sh, "GPTBot, ClaudeBot, PerplexityBot: The AI Crawler Guide" — verified 2026-05-03 — supports user-agent group inheritance behavior. https://www.cite.sh/blog/ai-crawler-guide/

: Taskade, "11 Best AI Robots.txt & SEO Config Generators" — verified 2026-05-03 — supports CI golden-file diff caution. https://www.taskade.com/blog/ai-robots-txt-generators

: robotstxt.com, "AI / LLM User-Agents: Blocking Guide" — verified 2026-05-03 — supports DisallowAITraining draft status. https://robotstxt.com/ai

Robots.txt for AI Crawlers: Specification & Configuration

TL;DR

Definition

Why it matters

Supported directive syntax

AI crawler user-agent reference

Recommended configurations

Configuration A: Allow all AI crawlers (maximum AI visibility)

AI search visibility — allow all

Configuration B: Block training, allow citation (recommended for most publishers)

Block training crawlers

Allow live-retrieval / citation crawlers

Configuration C: Block sensitive paths only

Validation pipeline

Differences vs traditional SEO robots.txt

Misconceptions

Common mistakes

How to apply

FAQ

Q: Does blocking GPTBot remove me from ChatGPT Search citations?

Q: Does Google-Extended control AI Overviews?

Q: Is Crawl-delay supported by AI crawlers?

Q: How long does a robots.txt change take to propagate?

Q: Should I add a Sitemap directive for AI crawlers?

Q: Do agentic browsers like ChatGPT Atlas or Perplexity Comet honor robots.txt?

Related Articles

AI Crawler IP Allowlist Reference

How to Create llms.txt: Step-by-Step Tutorial for AI Search

HTTP/2 vs HTTP/3 for AI Crawlers

GEO & AI Search Insights