robots.txt for AI Crawlers
robots.txt for AI crawlers extends the traditional robots exclusion protocol to control how AI systems access, crawl, and index your content.
robots.txt for AI crawlers involves configuring specific user-agent rules for GPTBot, PerplexityBot, Google-Extended, and other AI crawlers to control what content they can access and index.
AI Crawler User-Agents
| User-Agent | AI System | Purpose |
|---|---|---|
| GPTBot | OpenAI/ChatGPT | Web browsing, training |
| ChatGPT-User | ChatGPT | Real-time browsing |
| Google-Extended | Google AI | Training data |
| GoogleOther | AI features | |
| PerplexityBot | Perplexity | Search indexing |
| Anthropic-AI | Claude/Anthropic | Web access |
| ClaudeBot | Claude | Web browsing |
| Bytespider | ByteDance | AI training |
| CCBot | Common Crawl | Training datasets |
| FacebookBot | Meta AI | AI features |
Configuration Examples
Allow All AI Crawlers
# AI Crawlers - Allow all
User-agent: GPTBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ClaudeBot
Allow: /Block Training, Allow Retrieval
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
# Allow search/retrieval crawlers
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /Selective Access
# Allow AI to access public content only
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /admin/
Disallow: /api/
Disallow: /user/
User-agent: PerplexityBot
Allow: /blog/
Allow: /docs/
Disallow: /admin/
Disallow: /api/Best Practices
| Practice | Why |
|---|---|
| Be explicit with each user-agent | Wildcard rules may not apply |
| Separate training from retrieval | Different business implications |
| Allow sitemap access | Helps AI index efficiently |
| Test after changes | Verify crawlers respect rules |
| Document your policy | Use ai.txt for human-readable version |
Common Mistakes
- Blocking all bots — Also blocks beneficial AI search citation
- Not listing specific user-agents — Generic rules may not apply
- Forgetting sitemap — Always include
Sitemap: URL - Mixing training and retrieval — Different crawlers, different purposes
Complete Example
# robots.txt for yoursite.com
# Updated: 2025-04-25
# Traditional crawlers
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# AI Retrieval (allow - benefits search visibility)
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
# AI Training (block - protect content value)
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
# All other bots
User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xmlRelated Articles
- ai.txt Reference — AI access policy file
- llms.txt Reference — AI content guide
- AI Crawl Signals — How AI discovers content
Related Articles
AI Crawl Signals: How AI Discovers Content
A technical reference of the signals AI systems use to discover, crawl, and index web content.
ai.txt Reference
ai.txt is a proposed standard file that defines access policies and attribution requirements specifically for AI agents, chatbots, and LLM-powered systems.
llms.txt Reference
llms.txt is a proposed standard file that provides a machine-readable index of site content for AI crawlers. It tells LLMs what a site contains and how to navigate it.