robots.txt for AI Crawlers

robots.txt for AI crawlers extends the traditional robots exclusion protocol to control how AI systems access, crawl, and index your content.

robots.txt for AI crawlers involves configuring specific user-agent rules for GPTBot, PerplexityBot, Google-Extended, and other AI crawlers to control what content they can access and index.

AI Crawler User-Agents

User-Agent	AI System	Purpose
GPTBot	OpenAI/ChatGPT	Web browsing, training
ChatGPT-User	ChatGPT	Real-time browsing
Google-Extended	Google AI	Training data
GoogleOther	Google	AI features
PerplexityBot	Perplexity	Search indexing
Anthropic-AI	Claude/Anthropic	Web access
ClaudeBot	Claude	Web browsing
Bytespider	ByteDance	AI training
CCBot	Common Crawl	Training datasets
FacebookBot	Meta AI	AI features

Configuration Examples

Allow All AI Crawlers

# AI Crawlers - Allow all
User-agent: GPTBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

Block Training, Allow Retrieval

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# Allow search/retrieval crawlers
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

Selective Access

# Allow AI to access public content only
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /admin/
Disallow: /api/
Disallow: /user/

User-agent: PerplexityBot
Allow: /blog/
Allow: /docs/
Disallow: /admin/
Disallow: /api/

Best Practices

Practice	Why
Be explicit with each user-agent	Wildcard rules may not apply
Separate training from retrieval	Different business implications
Allow sitemap access	Helps AI index efficiently
Test after changes	Verify crawlers respect rules
Document your policy	Use ai.txt for human-readable version

Common Mistakes

Blocking all bots — Also blocks beneficial AI search citation
Not listing specific user-agents — Generic rules may not apply
Forgetting sitemap — Always include Sitemap: URL
Mixing training and retrieval — Different crawlers, different purposes

Complete Example

# robots.txt for yoursite.com
# Updated: 2025-04-25

# Traditional crawlers
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# AI Retrieval (allow - benefits search visibility)
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

# AI Training (block - protect content value)
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# All other bots
User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

ai.txt Reference — AI access policy file
llms.txt Reference — AI content guide
AI Crawl Signals — How AI discovers content

robots.txt for AI Crawlers

AI Crawler User-Agents

Configuration Examples

Allow All AI Crawlers

Block Training, Allow Retrieval

Selective Access

Best Practices

Common Mistakes

Complete Example

Related Articles

AI Crawl Signals: How AI Discovers Content

ai.txt Reference

llms.txt Reference

GEO & AI Search Insights