robots.txt for AI Crawlers
robots.txt for AI crawlers uses per-user-agent Allow and Disallow rules to control how AI training crawlers (GPTBot, Google-Extended, Applebot-Extended, ClaudeBot, CCBot, Bytespider) and AI retrieval crawlers (ChatGPT-User, OAI-SearchBot, PerplexityBot, Perplexity-User, DuckAssistBot) access a site. Most teams allow retrieval bots to preserve AI search visibility and selectively block training-only bots when content protection is the priority.
TL;DR
Treat AI crawlers in two buckets. Retrieval bots fetch your pages so AI engines can answer a user's live question and usually link back — you almost always want to allow them. Training bots fetch pages to feed model training data with no per-query attribution — here you make a content-protection choice. Use explicit per-user-agent blocks, allow your sitemap, and verify enforcement in server logs. For the wider crawler-control stack, see the technical hub and AI Crawl Signals.
Definition
robots.txt for AI crawlers is the application of the Robots Exclusion Protocol (REP, formalized in RFC 9309) to the new generation of AI-specific user-agents that emerged after 2022. Where the original REP governed search engine indexers like Googlebot and Bingbot, today's robots.txt also has to address AI training crawlers (which feed model weights), AI retrieval crawlers (which fetch pages live during user queries), and training opt-out tokens such as Google-Extended and Applebot-Extended that gate downstream training use without changing crawl behavior.
A correctly configured robots.txt for AI is a per-user-agent policy that distinguishes training from retrieval, names the specific bots that matter for your content strategy, exposes your sitemap, and is verified against real server logs. It is the lowest-cost lever in the AI access-control stack — and, because it is honor-system, it must be paired with verification and (for high-stakes content) edge-layer enforcement.
Why it matters
The robots.txt file has quietly become the single most consequential configuration knob a website operator can turn for AI visibility. Three trends make it more important than at any point in the last decade:
- AI search has become a meaningful traffic and citation channel. ChatGPT search, Perplexity, Google AI Overviews, Claude, Gemini, Apple Intelligence, and DuckAssist all rely on retrieval bots. Blocking them — accidentally or otherwise — removes you from the answer surface entirely.
- Training and retrieval are now distinct business decisions. Allowing GPTBot affects model training. Allowing OAI-SearchBot affects whether ChatGPT can cite you tomorrow. These are no longer the same setting, and treating them as one is the most common mistake in the field.
- Honor-system compliance is wider than it used to be. Major AI vendors publish their user-agent strings and state they honor robots.txt. According to a Q1 2026 Cloudflare report, GPTBot is the most-blocked AI crawler on the internet — evidence that publishers are exercising the lever and AI vendors are responding to it.
Getting this file wrong has two failure modes that look identical from the outside: silently being excluded from AI answers (visibility loss), or silently feeding free training data while assuming you were protected (content-value loss). Either failure compounds for months before anyone notices, which is why a thoughtful policy plus a 60-day review cadence is the minimum bar.
Two buckets: training vs. retrieval
The single most useful framing for AI crawler policy is the split between training and retrieval.
- Training crawlers fetch your content to build or update model weights. The user does not see your URL when the model later answers a related question. Examples: GPTBot, Google-Extended, Applebot-Extended, ClaudeBot, CCBot, Bytespider, Cohere-AI, Diffbot.
- Retrieval crawlers fetch your content in response to a live user query. Citations and links back to your site are the norm. Examples: ChatGPT-User, OAI-SearchBot, PerplexityBot, Perplexity-User, DuckAssistBot, MistralAI-User, Meta-ExternalAgent.
Some bots straddle both modes (Amazonbot, GoogleOther). The decision matrix below assumes you allow retrieval and treat training as a content-strategy decision.
AI crawler user-agents (2026 reference)
| User-Agent | Operator | Mode | Notes |
|---|---|---|---|
| GPTBot | OpenAI | Training | Original OpenAI training crawler. |
| OAI-SearchBot | OpenAI | Retrieval | Powers ChatGPT search index. |
| ChatGPT-User | OpenAI | Retrieval | Triggered by user-initiated browsing. |
| Google-Extended | Training opt-out | Token, not a crawler; signals training use of pages crawled by Googlebot. | |
| GoogleOther | Mixed | Internal Google product fetches. | |
| PerplexityBot | Perplexity | Retrieval | Indexer for Perplexity answers. |
| Perplexity-User | Perplexity | Retrieval | Live user-action fetches. |
| ClaudeBot | Anthropic | Training | Active Anthropic crawler. |
| anthropic-ai | Anthropic | Training | Legacy token; kept for backward compatibility. |
| Applebot-Extended | Apple | Training opt-out | Token gating training use of Applebot-fetched pages. |
| MistralAI-User | Mistral | Retrieval | Le Chat user-action fetches. |
| Amazonbot | Amazon | Mixed | Search and AI-product fetches. |
| DuckAssistBot | DuckDuckGo | Retrieval | DuckAssist answer crawler. |
| Meta-ExternalAgent | Meta | Mixed | Meta AI assistant fetches. |
| Cohere-AI | Cohere | Training | Cohere model training crawler. |
| Bytespider | ByteDance | Training | Often blocked due to crawl volume. |
| CCBot | Common Crawl | Training | Powers many third-party LLM training corpora. |
| Diffbot | Diffbot | Mixed | Knowledge graph + AI extraction. |
| ImagesiftBot | Imagesift / TheHive | Mixed | Multimodal image and text training. |
New tokens appear roughly every quarter. Treat this list as a starting point, not a closed set. The community-maintained ai.robots.txt GitHub repository is a useful reference for staying current.
Allow vs deny tradeoffs
Each Allow and Disallow line is a tradeoff between three competing goals: citation reach, content protection, and operational cost.
- Allowing retrieval bots maximizes the chance your URL appears as a cited answer in ChatGPT, Perplexity, DuckAssist, and similar surfaces. Cost: none beyond the small crawl bandwidth.
- Allowing training bots lets your content shape future model behavior. The upside is influence over how models talk about your domain; the downside is no per-query attribution and no opt-out once weights are trained.
- Blocking training bots preserves the commercial value of long-form, expensive-to-produce content (newsroom investigations, research reports, paid courses). Cost: your content shapes the model less, and competitors who allow training may be over-represented in answers about your space.
- Blocking retrieval bots removes you from AI answer surfaces entirely. This is rarely the right choice unless you have legal, contractual, or paywall reasons.
- Selective path-level rules (Allow: /blog/, Disallow: /admin/) let you publish public knowledge while protecting customer data, internal APIs, and gated assets. Cost: more rules to maintain.
A good default for content publishers in 2026: allow all retrieval bots, block training bots whose content you want to protect, and verify every quarter. Sites whose entire business model is content visibility (developer docs, open-knowledge wikis) should usually allow everything; sites whose long-form content is the product (newsrooms, research firms) should block training and stay open to retrieval.
Configuration patterns
Pattern 1: Allow everything (open-content sites)
txt
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
Sitemap: https://example.com/sitemap.xml
Use this when discoverability outweighs content-protection concerns and you want maximum AI visibility.
Pattern 2: Allow retrieval, block training (most common)
txt
--- Allow retrieval (preserve AI search visibility) ---
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: DuckAssistBot
Allow: /
User-agent: MistralAI-User
Allow: /
--- Block training ---
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Cohere-AI
Disallow: /
User-agent: Diffbot
Disallow: /
--- Default ---
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
This is the dominant pattern for publishers, SaaS docs, and content-protected sites that still want to be cited.
Pattern 3: Selective access by section
txt
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /admin/
Disallow: /api/
Disallow: /customer/
User-agent: PerplexityBot
Allow: /blog/
Allow: /docs/
Disallow: /admin/
Disallow: /api/
Use the most specific path rules first. Allow the public knowledge base; protect customer data, admin surfaces, and internal APIs.
Real-world examples
Patterns from major sites illustrate the spectrum of policies in production today. (Check live robots.txt for the latest content; policies evolve quickly.)
- The New York Times — strict training block. NYT was an early and aggressive blocker. Its robots.txt explicitly disallows GPTBot, ClaudeBot, anthropic-ai, Google-Extended, CCBot, Applebot-Extended, and several others, while leaving traditional Googlebot and major retrieval bots allowed. The pattern matches NYT's pending litigation posture: protect long-form journalism from training corpora, retain search and citation visibility.
- Reddit — gated by partnership. Reddit blocks generic AI training crawlers in robots.txt while licensing access through paid agreements (notably with Google and OpenAI). The robots.txt is essentially a fence that pushes AI vendors toward commercial deals.
- Stack Overflow — license-first model. Stack Overflow disallows GPTBot and similar training crawlers in robots.txt and offers a paid Stack Overflow for Teams + API license for AI training data. Retrieval bots are allowed so answers can still be cited live.
- BBC and major UK publishers — full training block. BBC blocks GPTBot, Google-Extended, ClaudeBot, CCBot, Applebot-Extended, and others. Retrieval bots remain allowed to keep BBC content surfaceable in answer engines.
- NPR — selective retrieval. NPR's robots.txt allows ChatGPT-User and PerplexityBot while disallowing GPTBot and Google-Extended, demonstrating the canonical training-vs-retrieval split.
- Developer documentation sites (Stripe, Vercel, Cloudflare docs) — allow everything. Developer documentation sites generally allow all AI bots. The strategic logic: the more frequently models cite your docs as the authoritative source, the more developer mindshare you capture.
- E-commerce platforms — selective by path. Product detail pages are typically allowed to retrieval bots; account, checkout, and customer-data paths are disallowed across the board.
The lesson across all of these: there is no universal answer. The right configuration is a function of business model, content economics, and regulatory exposure.
How robots.txt fits with related controls
robots.txt is one layer in a small stack of AI access controls.
- robots.txt — the standards-based REP file, honored by major AI vendors but enforced on the honor system.
- ai.txt — a human-readable AI policy file that complements robots.txt; see ai.txt Reference.
- llms.txt — a positive index that tells AI engines which pages you want indexed; see llms.txt Reference.
- Edge controls — Cloudflare AI Audit and 'Block AI bots', Fastly bot management, Akamai Bot Manager, and CDN-level rules enforce policy when honor-system compliance is not enough.
- Page-level meta tags — noai, noimageai, nocache, and noarchive directives where supported by the consumer.
- Server-side IP verification — major AI vendors publish IP ranges (OpenAI, Anthropic, Google, Apple) so you can confirm the user-agent is genuine, not spoofed.
- Terms of service and licensing — robots.txt is not a contract; explicit ToS clauses about AI training are the legal backstop.
robots.txt remains the lowest-cost first line; edge controls and ToS are the enforcement layer.
Decision matrix
| Goal | Suggested policy |
|---|---|
| Maximize AI search citations | Allow retrieval and training; expose llms.txt. |
| Preserve content value, stay citable | Allow retrieval; block training-only crawlers. |
| Strict content protection | Block all AI crawlers; add edge enforcement. |
| Public docs only, paid product gated | Selective access by path; sensitive sections disallowed. |
| Image-heavy site | Add ImagesiftBot and similar to your training-block list. |
| News publisher with litigation posture | Block all training, allow retrieval, document policy in ToS. |
| Regulated content (health, finance) | Block training, allow retrieval, layer page-level disclaimers. |
How to verify your policy is enforced
robots.txt is the policy; enforcement requires verification.
- Tail server logs for the user-agent strings you listed and confirm 200 vs 403 patterns match your intent. A user-agent appearing in your access log with 200 status after you Disallowed it means your rule did not load — usually a syntax issue.
- Inspect Search Console crawl reports for any anomalies after the change.
- Re-run your AI prompt library (see AI Search Reporting: Dashboard Setup) to confirm citations did not drop unintentionally after a block.
- Validate syntax with the Google robots.txt tester or any community validator. A single misplaced blank line can merge two User-agent blocks.
- Watch for impostors. Some crawlers spoof a user-agent; pair high-stakes blocks with edge-layer IP verification or vendor-published IP ranges.
- Schedule a 60-day review. New AI crawlers and tokens appear roughly every quarter; your policy needs an explicit review cadence.
Common misconfigurations
- Blocking everything with User-agent: and forgetting AI retrieval bots. This silently removes you from AI search. The default rule applies only when no more specific rule matches; if you intend to allow retrieval bots, name them explicitly above the wildcard.
- Mixing training and retrieval into one block. Different crawlers have different business implications; separate them. A User-agent: GPTBot block does not affect ChatGPT-User or OAI-SearchBot.
- Forgetting the Sitemap line. Always include Sitemap: https://yourdomain/sitemap.xml. AI crawlers use the sitemap as the canonical inventory of indexable URLs.
- Treating Google-Extended or Applebot-Extended as crawlers. They are training opt-out tokens; the actual crawl is done by Googlebot or Applebot. Disallowing the token does not stop the crawl — it stops downstream training use.
- Skipping the verification step. A robots.txt that looks correct but is malformed silently allows everything. Always verify against real logs.
- Stale crawler lists. New tokens appear quarterly. Schedule a 60-day review.
- Blocking by IP instead of user-agent. AI vendor IP ranges change frequently. Use user-agent rules as the primary policy and IP allow-lists only for high-stakes verification.
- Putting AI rules behind a CDN that strips comments. Some CDN minifiers strip # comment lines and break User-agent block boundaries. Verify the served file matches the source.
Complete example
txt
robots.txt for example.com
Updated: 2026-05-01
--- Traditional search ---
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
--- AI retrieval (allow to preserve citations) ---
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: DuckAssistBot
Allow: /
User-agent: MistralAI-User
Allow: /
User-agent: Meta-ExternalAgent
Allow: /
--- AI training (block to protect content value) ---
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Cohere-AI
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: ImagesiftBot
Disallow: /
--- Default ---
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
FAQ
Q: Do AI bots actually obey robots.txt?
Major AI vendors (OpenAI, Google, Anthropic, Apple, Perplexity, Mistral, DuckDuckGo, Meta, Cohere) publicly state they honor robots.txt. Compliance is on the honor system, so high-stakes policies should be paired with edge-layer enforcement and vendor IP-range verification.
Q: Will blocking GPTBot remove me from ChatGPT answers?
Not necessarily. ChatGPT-User and OAI-SearchBot drive live retrieval and search; GPTBot drives training. Blocking GPTBot only removes you from future training data and does not block live citations as long as the retrieval bots remain allowed.
Q: What is the difference between Google-Extended and Googlebot?
Google-Extended is a training opt-out token, not a separate crawler. Googlebot still fetches the page; setting Disallow: / for Google-Extended tells Google not to use those pages for Gemini and Vertex AI training. Search ranking is unaffected.
Q: Should I add Applebot-Extended even if Apple isn't a search priority?
Yes, if you want explicit training opt-out coverage. Applebot-Extended is the training-use token for content fetched by Applebot, which powers Apple Intelligence and Siri suggestions.
Q: How often should I review my robots.txt for AI?
Every 60 days, or after any major model release. New crawler tokens appear about once a quarter, and existing vendors occasionally rename or split their bots.
Q: Does robots.txt protect against scraping for unauthorized training?
No. robots.txt is a request, not an enforcement mechanism. For protection against non-compliant scrapers, layer in CDN bot management, rate limiting, and contractual restrictions in your terms of service.
Q: Should I block CCBot (Common Crawl)?
It depends on your content economics. Common Crawl is a public web archive used by many third-party LLMs and research projects. Allowing CCBot maximizes downstream model exposure; blocking it is the most efficient way to remove your content from a wide swath of training pipelines, since many models rely on Common Crawl rather than crawling directly.
Q: Is there a difference between Disallow: / and removing the user-agent block entirely?
Yes. Disallow: / is an explicit instruction the crawler must follow. Omitting the block means the crawler falls back to your User-agent: * rule, which usually allows access. Always be explicit with the bots you care about.
Q: What about images, video, and multimodal content?
Add ImagesiftBot and similar multimodal crawlers to your training-block list if you want to protect visual assets. Some vendors also honor noimageai page-level meta tags as a complement to robots.txt rules.
Related Articles
AI Crawl Signals: How AI Discovers Content
Technical reference for the signals AI systems use to discover, access, and prioritize web content — including sitemaps, llms.txt, robots.txt, structured data, and HTTP headers.
ai.txt: AI Agent Access Policy Reference
ai.txt is an emerging root-level file that declares site-wide permissions and attribution rules for AI training, citation, and inference.
HTML Semantic Structure for AI Readability
Use HTML5 semantic elements like article, section, nav, and proper heading hierarchy to improve AI crawler extraction and citation probability.