Geodocs.dev

robots.txt for AI Crawlers

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

robots.txt for AI crawlers extends the traditional robots exclusion protocol to control how AI systems access, crawl, and index your content.

robots.txt for AI crawlers involves configuring specific user-agent rules for GPTBot, PerplexityBot, Google-Extended, and other AI crawlers to control what content they can access and index.

AI Crawler User-Agents

User-AgentAI SystemPurpose
GPTBotOpenAI/ChatGPTWeb browsing, training
ChatGPT-UserChatGPTReal-time browsing
Google-ExtendedGoogle AITraining data
GoogleOtherGoogleAI features
PerplexityBotPerplexitySearch indexing
Anthropic-AIClaude/AnthropicWeb access
ClaudeBotClaudeWeb browsing
BytespiderByteDanceAI training
CCBotCommon CrawlTraining datasets
FacebookBotMeta AIAI features

Configuration Examples

Allow All AI Crawlers

# AI Crawlers - Allow all
User-agent: GPTBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

Block Training, Allow Retrieval

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# Allow search/retrieval crawlers
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

Selective Access

# Allow AI to access public content only
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /admin/
Disallow: /api/
Disallow: /user/

User-agent: PerplexityBot
Allow: /blog/
Allow: /docs/
Disallow: /admin/
Disallow: /api/

Best Practices

PracticeWhy
Be explicit with each user-agentWildcard rules may not apply
Separate training from retrievalDifferent business implications
Allow sitemap accessHelps AI index efficiently
Test after changesVerify crawlers respect rules
Document your policyUse ai.txt for human-readable version

Common Mistakes

  1. Blocking all bots — Also blocks beneficial AI search citation
  2. Not listing specific user-agents — Generic rules may not apply
  3. Forgetting sitemap — Always include Sitemap: URL
  4. Mixing training and retrieval — Different crawlers, different purposes

Complete Example

# robots.txt for yoursite.com
# Updated: 2025-04-25

# Traditional crawlers
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# AI Retrieval (allow - benefits search visibility)
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

# AI Training (block - protect content value)
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# All other bots
User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Related Articles

reference

AI Crawl Signals: How AI Discovers Content

A technical reference of the signals AI systems use to discover, crawl, and index web content.

reference

ai.txt Reference

ai.txt is a proposed standard file that defines access policies and attribution requirements specifically for AI agents, chatbots, and LLM-powered systems.

reference

llms.txt Reference

llms.txt is a proposed standard file that provides a machine-readable index of site content for AI crawlers. It tells LLMs what a site contains and how to navigate it.

Stay Updated

GEO & AI Search Insights

New articles, framework updates, and industry analysis. No spam, unsubscribe anytime.