Geodocs.dev

AI Search Crawler User-Agents: Complete 2026 Reference

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

AI search crawlers identify themselves with distinct user-agents: GPTBot and OAI-SearchBot from OpenAI, ClaudeBot and Claude-User from Anthropic, PerplexityBot and Perplexity-User from Perplexity, Google-Extended from Google, plus others. Allow or block in robots.txt and verify with reverse DNS where vendors publish IP ranges.

TL;DR

Most AI engines run multiple crawlers with distinct purposes: training, retrieval, and live answering. Block training-only bots if you wish; almost always allow retrieval bots if you want to be cited. Verify identity with reverse DNS or published IP ranges.

Definition

An AI search crawler user-agent is the string an AI engine's HTTP client sends in the User-Agent header when fetching content for training, retrieval, indexing, or live answering. Each major engine ships multiple bots scoped to different purposes.

Reference table

VendorUser-AgentPurposerobots.txt nameVerification
OpenAIGPTBot/1.xTrainingGPTBotIP ranges published
OpenAIOAI-SearchBot/1.xSearch index for ChatGPT searchOAI-SearchBotIP ranges published
OpenAIChatGPT-User/1.xOn-demand fetch from a user promptChatGPT-UserIP ranges published
AnthropicClaudeBot/1.xTrainingClaudeBotIP ranges published
AnthropicClaude-User/1.xOn-demand fetch from a user promptClaude-UserIP ranges published
AnthropicClaude-SearchBotSearch/retrieval indexingClaude-SearchBotIP ranges published
PerplexityPerplexityBot/1.xSearch/retrieval indexingPerplexityBotIP ranges published
PerplexityPerplexity-User/1.xOn-demand fetch from a user promptPerplexity-UserIP ranges published
GoogleGoogle-ExtendedGemini training opt-out token (no separate UA)Google-Extendedn/a, declared
GoogleGoogleOtherVarious Google AI fetchesGoogleOtherRDNS
MicrosoftBingbot (also AI Mode/Copilot)Search + AIBingbotRDNS
AppleApplebot (with Applebot-Extended opt-out)Apple IntelligenceApplebot/Applebot-ExtendedRDNS
MetaMeta-ExternalAgentMeta AIMeta-ExternalAgentIP ranges published
Coherecohere-aiTraining/retrievalcohere-ain/a
Common CrawlCCBotOpen dataset many LLMs train onCCBotRDNS
You.comYouBotSearch/retrievalYouBotRDNS
MistralMistralAI-UserOn-demand fetchMistralAI-Userdeclared
BytedanceBytespiderDoubao trainingBytespiderRDNS

Allow citation bots, block training-only

User-agent: GPTBot

Disallow: /

User-agent: ClaudeBot

Disallow: /

User-agent: CCBot

Disallow: /

Allow live retrieval (these power citations)

User-agent: OAI-SearchBot

Allow: /

User-agent: ChatGPT-User

Allow: /

User-agent: Claude-User

Allow: /

User-agent: PerplexityBot

Allow: /

Allow everything

User-agent: *

Allow: /

Allow all retrieval/citation bots and decline training

Use the explicit per-vendor Disallow for GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended, Bytespider while allowing search/retrieval bots.

Verifying crawler identity

User-agents are spoofable. Verify by:

  1. Checking the request IP against the vendor's published IP range list (OpenAI, Anthropic, Perplexity, Meta, Google publish theirs).
  2. Reverse DNS lookups (bingbot.com, googlebot.com).
  3. Logging unverified hits as suspicious traffic.

Common misconceptions

  • "Blocking GPTBot blocks ChatGPT search." False. GPTBot is training; OAI-SearchBot and ChatGPT-User power search and live citations. Block them only if you do not want to be cited.
  • "User-agents are reliable." False alone. Always combine with IP/RDNS verification.
  • "There is one bot per vendor." False. OpenAI, Anthropic, and Perplexity each ship 2-3 distinct bots.

How to apply

  1. Audit current robots.txt and confirm citation bots are allowed.
  2. Decide on training-bot policy and codify it.
  3. Implement IP/RDNS verification for high-traffic endpoints.
  4. Re-check vendor-published IP ranges quarterly.
  5. Log AI bot traffic separately for visibility analytics.

FAQ

Q: If I block GPTBot, do I lose ChatGPT citations?

No. GPTBot is training-only. Citations come via OAI-SearchBot and ChatGPT-User. Blocking GPTBot opts you out of training but preserves citations.

Q: Does Google have a separate AI crawler?

Google does not run a separate AI crawler. Instead, Google-Extended is a robots-token opt-out from Gemini training; classical Googlebot still drives indexing and AI Overviews.

Q: Can I block Common Crawl?

Yes via User-agent: CCBot / Disallow: /. Note that historical Common Crawl snapshots may already include your content; the block prevents future inclusion.

Q: Are these user-agents stable?

Vendor names are stable; version suffixes change. Match by prefix in your robots and analytics rules.

Q: What about Apple Intelligence?

Apple uses Applebot for indexing and offers Applebot-Extended as an AI-training opt-out token, similar to Google's pattern.

Related Articles

specification

Agent Citation Attribution Specification: Verifiable Source Tracking for Autonomous AI Agents

Specification defining HTTP headers, provenance manifests, and chain-of-citation markup so autonomous AI agents produce verifiable citations to source content.

specification

Browser Agent Crawl Etiquette: A Specification for Polite Autonomous AI Browsing

A specification defining how browser-based AI agents should identify themselves, throttle requests, and respect publisher signals to maintain citation trust.

specification

Verified Agent Identity for Citation Trust: A Specification for Authenticated AI Crawlers

Specification for verified agent identity: how publishers authenticate AI crawlers via cryptographic signatures so citation trust survives spoofing.

Stay Updated

GEO & AI Search Insights

New articles, framework updates, and industry analysis. No spam, unsubscribe anytime.