Geodocs.dev

robots.txt for AI Crawlers

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

robots.txt for AI crawlers uses per-user-agent Allow and Disallow rules to control how AI training crawlers (GPTBot, Google-Extended, Applebot-Extended, ClaudeBot, CCBot, Bytespider) and AI retrieval crawlers (ChatGPT-User, OAI-SearchBot, PerplexityBot, Perplexity-User, DuckAssistBot) access a site. Most teams allow retrieval bots to preserve AI search visibility and selectively block training-only bots when content protection is the priority.

TL;DR

Treat AI crawlers in two buckets. Retrieval bots fetch your pages so AI engines can answer a user's live question and usually link back — you almost always want to allow them. Training bots fetch pages to feed model training data with no per-query attribution — here you make a content-protection choice. Use explicit per-user-agent blocks, allow your sitemap, and verify enforcement in server logs. For the wider crawler-control stack, see the technical hub and AI Crawl Signals.

Definition

robots.txt for AI crawlers is the application of the Robots Exclusion Protocol (REP, formalized in RFC 9309) to the new generation of AI-specific user-agents that emerged after 2022. Where the original REP governed search engine indexers like Googlebot and Bingbot, today's robots.txt also has to address AI training crawlers (which feed model weights), AI retrieval crawlers (which fetch pages live during user queries), and training opt-out tokens such as Google-Extended and Applebot-Extended that gate downstream training use without changing crawl behavior.

A correctly configured robots.txt for AI is a per-user-agent policy that distinguishes training from retrieval, names the specific bots that matter for your content strategy, exposes your sitemap, and is verified against real server logs. It is the lowest-cost lever in the AI access-control stack — and, because it is honor-system, it must be paired with verification and (for high-stakes content) edge-layer enforcement.

Why it matters

The robots.txt file has quietly become the single most consequential configuration knob a website operator can turn for AI visibility. Three trends make it more important than at any point in the last decade:

  1. AI search has become a meaningful traffic and citation channel. ChatGPT search, Perplexity, Google AI Overviews, Claude, Gemini, Apple Intelligence, and DuckAssist all rely on retrieval bots. Blocking them — accidentally or otherwise — removes you from the answer surface entirely.
  2. Training and retrieval are now distinct business decisions. Allowing GPTBot affects model training. Allowing OAI-SearchBot affects whether ChatGPT can cite you tomorrow. These are no longer the same setting, and treating them as one is the most common mistake in the field.
  3. Honor-system compliance is wider than it used to be. Major AI vendors publish their user-agent strings and state they honor robots.txt. According to a Q1 2026 Cloudflare report, GPTBot is the most-blocked AI crawler on the internet — evidence that publishers are exercising the lever and AI vendors are responding to it.

Getting this file wrong has two failure modes that look identical from the outside: silently being excluded from AI answers (visibility loss), or silently feeding free training data while assuming you were protected (content-value loss). Either failure compounds for months before anyone notices, which is why a thoughtful policy plus a 60-day review cadence is the minimum bar.

Two buckets: training vs. retrieval

The single most useful framing for AI crawler policy is the split between training and retrieval.

  • Training crawlers fetch your content to build or update model weights. The user does not see your URL when the model later answers a related question. Examples: GPTBot, Google-Extended, Applebot-Extended, ClaudeBot, CCBot, Bytespider, Cohere-AI, Diffbot.
  • Retrieval crawlers fetch your content in response to a live user query. Citations and links back to your site are the norm. Examples: ChatGPT-User, OAI-SearchBot, PerplexityBot, Perplexity-User, DuckAssistBot, MistralAI-User, Meta-ExternalAgent.

Some bots straddle both modes (Amazonbot, GoogleOther). The decision matrix below assumes you allow retrieval and treat training as a content-strategy decision.

AI crawler user-agents (2026 reference)

User-AgentOperatorModeNotes
GPTBotOpenAITrainingOriginal OpenAI training crawler.
OAI-SearchBotOpenAIRetrievalPowers ChatGPT search index.
ChatGPT-UserOpenAIRetrievalTriggered by user-initiated browsing.
Google-ExtendedGoogleTraining opt-outToken, not a crawler; signals training use of pages crawled by Googlebot.
GoogleOtherGoogleMixedInternal Google product fetches.
PerplexityBotPerplexityRetrievalIndexer for Perplexity answers.
Perplexity-UserPerplexityRetrievalLive user-action fetches.
ClaudeBotAnthropicTrainingActive Anthropic crawler.
anthropic-aiAnthropicTrainingLegacy token; kept for backward compatibility.
Applebot-ExtendedAppleTraining opt-outToken gating training use of Applebot-fetched pages.
MistralAI-UserMistralRetrievalLe Chat user-action fetches.
AmazonbotAmazonMixedSearch and AI-product fetches.
DuckAssistBotDuckDuckGoRetrievalDuckAssist answer crawler.
Meta-ExternalAgentMetaMixedMeta AI assistant fetches.
Cohere-AICohereTrainingCohere model training crawler.
BytespiderByteDanceTrainingOften blocked due to crawl volume.
CCBotCommon CrawlTrainingPowers many third-party LLM training corpora.
DiffbotDiffbotMixedKnowledge graph + AI extraction.
ImagesiftBotImagesift / TheHiveMixedMultimodal image and text training.

New tokens appear roughly every quarter. Treat this list as a starting point, not a closed set. The community-maintained ai.robots.txt GitHub repository is a useful reference for staying current.

Allow vs deny tradeoffs

Each Allow and Disallow line is a tradeoff between three competing goals: citation reach, content protection, and operational cost.

  • Allowing retrieval bots maximizes the chance your URL appears as a cited answer in ChatGPT, Perplexity, DuckAssist, and similar surfaces. Cost: none beyond the small crawl bandwidth.
  • Allowing training bots lets your content shape future model behavior. The upside is influence over how models talk about your domain; the downside is no per-query attribution and no opt-out once weights are trained.
  • Blocking training bots preserves the commercial value of long-form, expensive-to-produce content (newsroom investigations, research reports, paid courses). Cost: your content shapes the model less, and competitors who allow training may be over-represented in answers about your space.
  • Blocking retrieval bots removes you from AI answer surfaces entirely. This is rarely the right choice unless you have legal, contractual, or paywall reasons.
  • Selective path-level rules (Allow: /blog/, Disallow: /admin/) let you publish public knowledge while protecting customer data, internal APIs, and gated assets. Cost: more rules to maintain.

A good default for content publishers in 2026: allow all retrieval bots, block training bots whose content you want to protect, and verify every quarter. Sites whose entire business model is content visibility (developer docs, open-knowledge wikis) should usually allow everything; sites whose long-form content is the product (newsrooms, research firms) should block training and stay open to retrieval.

Configuration patterns

Pattern 1: Allow everything (open-content sites)

txt

User-agent: GPTBot

Allow: /

User-agent: ClaudeBot

Allow: /

User-agent: Google-Extended

Allow: /

User-agent: Applebot-Extended

Allow: /

User-agent: PerplexityBot

Allow: /

User-agent: OAI-SearchBot

Allow: /

Sitemap: https://example.com/sitemap.xml

Use this when discoverability outweighs content-protection concerns and you want maximum AI visibility.

Pattern 2: Allow retrieval, block training (most common)

txt

--- Allow retrieval (preserve AI search visibility) ---

User-agent: ChatGPT-User

Allow: /

User-agent: OAI-SearchBot

Allow: /

User-agent: PerplexityBot

Allow: /

User-agent: Perplexity-User

Allow: /

User-agent: DuckAssistBot

Allow: /

User-agent: MistralAI-User

Allow: /

--- Block training ---

User-agent: GPTBot

Disallow: /

User-agent: ClaudeBot

Disallow: /

User-agent: anthropic-ai

Disallow: /

User-agent: Google-Extended

Disallow: /

User-agent: Applebot-Extended

Disallow: /

User-agent: CCBot

Disallow: /

User-agent: Bytespider

Disallow: /

User-agent: Cohere-AI

Disallow: /

User-agent: Diffbot

Disallow: /

--- Default ---

User-agent: *

Allow: /

Sitemap: https://example.com/sitemap.xml

This is the dominant pattern for publishers, SaaS docs, and content-protected sites that still want to be cited.

Pattern 3: Selective access by section

txt

User-agent: GPTBot

Allow: /blog/

Allow: /docs/

Disallow: /admin/

Disallow: /api/

Disallow: /customer/

User-agent: PerplexityBot

Allow: /blog/

Allow: /docs/

Disallow: /admin/

Disallow: /api/

Use the most specific path rules first. Allow the public knowledge base; protect customer data, admin surfaces, and internal APIs.

Real-world examples

Patterns from major sites illustrate the spectrum of policies in production today. (Check live robots.txt for the latest content; policies evolve quickly.)

  1. The New York Times — strict training block. NYT was an early and aggressive blocker. Its robots.txt explicitly disallows GPTBot, ClaudeBot, anthropic-ai, Google-Extended, CCBot, Applebot-Extended, and several others, while leaving traditional Googlebot and major retrieval bots allowed. The pattern matches NYT's pending litigation posture: protect long-form journalism from training corpora, retain search and citation visibility.
  2. Reddit — gated by partnership. Reddit blocks generic AI training crawlers in robots.txt while licensing access through paid agreements (notably with Google and OpenAI). The robots.txt is essentially a fence that pushes AI vendors toward commercial deals.
  3. Stack Overflow — license-first model. Stack Overflow disallows GPTBot and similar training crawlers in robots.txt and offers a paid Stack Overflow for Teams + API license for AI training data. Retrieval bots are allowed so answers can still be cited live.
  4. BBC and major UK publishers — full training block. BBC blocks GPTBot, Google-Extended, ClaudeBot, CCBot, Applebot-Extended, and others. Retrieval bots remain allowed to keep BBC content surfaceable in answer engines.
  5. NPR — selective retrieval. NPR's robots.txt allows ChatGPT-User and PerplexityBot while disallowing GPTBot and Google-Extended, demonstrating the canonical training-vs-retrieval split.
  6. Developer documentation sites (Stripe, Vercel, Cloudflare docs) — allow everything. Developer documentation sites generally allow all AI bots. The strategic logic: the more frequently models cite your docs as the authoritative source, the more developer mindshare you capture.
  7. E-commerce platforms — selective by path. Product detail pages are typically allowed to retrieval bots; account, checkout, and customer-data paths are disallowed across the board.

The lesson across all of these: there is no universal answer. The right configuration is a function of business model, content economics, and regulatory exposure.

robots.txt is one layer in a small stack of AI access controls.

  • robots.txt — the standards-based REP file, honored by major AI vendors but enforced on the honor system.
  • ai.txt — a human-readable AI policy file that complements robots.txt; see ai.txt Reference.
  • llms.txt — a positive index that tells AI engines which pages you want indexed; see llms.txt Reference.
  • Edge controls — Cloudflare AI Audit and 'Block AI bots', Fastly bot management, Akamai Bot Manager, and CDN-level rules enforce policy when honor-system compliance is not enough.
  • Page-level meta tags — noai, noimageai, nocache, and noarchive directives where supported by the consumer.
  • Server-side IP verification — major AI vendors publish IP ranges (OpenAI, Anthropic, Google, Apple) so you can confirm the user-agent is genuine, not spoofed.
  • Terms of service and licensing — robots.txt is not a contract; explicit ToS clauses about AI training are the legal backstop.

robots.txt remains the lowest-cost first line; edge controls and ToS are the enforcement layer.

Decision matrix

GoalSuggested policy
Maximize AI search citationsAllow retrieval and training; expose llms.txt.
Preserve content value, stay citableAllow retrieval; block training-only crawlers.
Strict content protectionBlock all AI crawlers; add edge enforcement.
Public docs only, paid product gatedSelective access by path; sensitive sections disallowed.
Image-heavy siteAdd ImagesiftBot and similar to your training-block list.
News publisher with litigation postureBlock all training, allow retrieval, document policy in ToS.
Regulated content (health, finance)Block training, allow retrieval, layer page-level disclaimers.

How to verify your policy is enforced

robots.txt is the policy; enforcement requires verification.

  1. Tail server logs for the user-agent strings you listed and confirm 200 vs 403 patterns match your intent. A user-agent appearing in your access log with 200 status after you Disallowed it means your rule did not load — usually a syntax issue.
  2. Inspect Search Console crawl reports for any anomalies after the change.
  3. Re-run your AI prompt library (see AI Search Reporting: Dashboard Setup) to confirm citations did not drop unintentionally after a block.
  4. Validate syntax with the Google robots.txt tester or any community validator. A single misplaced blank line can merge two User-agent blocks.
  5. Watch for impostors. Some crawlers spoof a user-agent; pair high-stakes blocks with edge-layer IP verification or vendor-published IP ranges.
  6. Schedule a 60-day review. New AI crawlers and tokens appear roughly every quarter; your policy needs an explicit review cadence.

Common misconfigurations

  • Blocking everything with User-agent: and forgetting AI retrieval bots. This silently removes you from AI search. The default rule applies only when no more specific rule matches; if you intend to allow retrieval bots, name them explicitly above the wildcard.
  • Mixing training and retrieval into one block. Different crawlers have different business implications; separate them. A User-agent: GPTBot block does not affect ChatGPT-User or OAI-SearchBot.
  • Forgetting the Sitemap line. Always include Sitemap: https://yourdomain/sitemap.xml. AI crawlers use the sitemap as the canonical inventory of indexable URLs.
  • Treating Google-Extended or Applebot-Extended as crawlers. They are training opt-out tokens; the actual crawl is done by Googlebot or Applebot. Disallowing the token does not stop the crawl — it stops downstream training use.
  • Skipping the verification step. A robots.txt that looks correct but is malformed silently allows everything. Always verify against real logs.
  • Stale crawler lists. New tokens appear quarterly. Schedule a 60-day review.
  • Blocking by IP instead of user-agent. AI vendor IP ranges change frequently. Use user-agent rules as the primary policy and IP allow-lists only for high-stakes verification.
  • Putting AI rules behind a CDN that strips comments. Some CDN minifiers strip # comment lines and break User-agent block boundaries. Verify the served file matches the source.

Complete example

txt

robots.txt for example.com

Updated: 2026-05-01

--- Traditional search ---

User-agent: Googlebot

Allow: /

User-agent: Bingbot

Allow: /

--- AI retrieval (allow to preserve citations) ---

User-agent: ChatGPT-User

Allow: /

User-agent: OAI-SearchBot

Allow: /

User-agent: PerplexityBot

Allow: /

User-agent: Perplexity-User

Allow: /

User-agent: DuckAssistBot

Allow: /

User-agent: MistralAI-User

Allow: /

User-agent: Meta-ExternalAgent

Allow: /

--- AI training (block to protect content value) ---

User-agent: GPTBot

Disallow: /

User-agent: ClaudeBot

Disallow: /

User-agent: anthropic-ai

Disallow: /

User-agent: Google-Extended

Disallow: /

User-agent: Applebot-Extended

Disallow: /

User-agent: CCBot

Disallow: /

User-agent: Bytespider

Disallow: /

User-agent: Cohere-AI

Disallow: /

User-agent: Diffbot

Disallow: /

User-agent: ImagesiftBot

Disallow: /

--- Default ---

User-agent: *

Allow: /

Sitemap: https://example.com/sitemap.xml

FAQ

Q: Do AI bots actually obey robots.txt?

Major AI vendors (OpenAI, Google, Anthropic, Apple, Perplexity, Mistral, DuckDuckGo, Meta, Cohere) publicly state they honor robots.txt. Compliance is on the honor system, so high-stakes policies should be paired with edge-layer enforcement and vendor IP-range verification.

Q: Will blocking GPTBot remove me from ChatGPT answers?

Not necessarily. ChatGPT-User and OAI-SearchBot drive live retrieval and search; GPTBot drives training. Blocking GPTBot only removes you from future training data and does not block live citations as long as the retrieval bots remain allowed.

Q: What is the difference between Google-Extended and Googlebot?

Google-Extended is a training opt-out token, not a separate crawler. Googlebot still fetches the page; setting Disallow: / for Google-Extended tells Google not to use those pages for Gemini and Vertex AI training. Search ranking is unaffected.

Q: Should I add Applebot-Extended even if Apple isn't a search priority?

Yes, if you want explicit training opt-out coverage. Applebot-Extended is the training-use token for content fetched by Applebot, which powers Apple Intelligence and Siri suggestions.

Q: How often should I review my robots.txt for AI?

Every 60 days, or after any major model release. New crawler tokens appear about once a quarter, and existing vendors occasionally rename or split their bots.

Q: Does robots.txt protect against scraping for unauthorized training?

No. robots.txt is a request, not an enforcement mechanism. For protection against non-compliant scrapers, layer in CDN bot management, rate limiting, and contractual restrictions in your terms of service.

Q: Should I block CCBot (Common Crawl)?

It depends on your content economics. Common Crawl is a public web archive used by many third-party LLMs and research projects. Allowing CCBot maximizes downstream model exposure; blocking it is the most efficient way to remove your content from a wide swath of training pipelines, since many models rely on Common Crawl rather than crawling directly.

Q: Is there a difference between Disallow: / and removing the user-agent block entirely?

Yes. Disallow: / is an explicit instruction the crawler must follow. Omitting the block means the crawler falls back to your User-agent: * rule, which usually allows access. Always be explicit with the bots you care about.

Q: What about images, video, and multimodal content?

Add ImagesiftBot and similar multimodal crawlers to your training-block list if you want to protect visual assets. Some vendors also honor noimageai page-level meta tags as a complement to robots.txt rules.

Related Articles

reference

AI Crawl Signals: How AI Discovers Content

Technical reference for the signals AI systems use to discover, access, and prioritize web content — including sitemaps, llms.txt, robots.txt, structured data, and HTTP headers.

reference

ai.txt: AI Agent Access Policy Reference

ai.txt is an emerging root-level file that declares site-wide permissions and attribution rules for AI training, citation, and inference.

guide

HTML Semantic Structure for AI Readability

Use HTML5 semantic elements like article, section, nav, and proper heading hierarchy to improve AI crawler extraction and citation probability.

Topics
Cập nhật tin tức

Thông tin GEO & AI Search

Bài viết mới, cập nhật khung làm việc và phân tích ngành. Không spam, hủy đăng ký bất cứ lúc nào.