AI Search Crawler User-Agents: Complete 2026 Reference
AI search crawlers identify themselves with distinct user-agents: GPTBot and OAI-SearchBot from OpenAI, ClaudeBot and Claude-User from Anthropic, PerplexityBot and Perplexity-User from Perplexity, Google-Extended from Google, plus others. Allow or block in robots.txt and verify with reverse DNS where vendors publish IP ranges.
TL;DR
Most AI engines run multiple crawlers with distinct purposes: training, retrieval, and live answering. Block training-only bots if you wish; almost always allow retrieval bots if you want to be cited. Verify identity with reverse DNS or published IP ranges.
Definition
An AI search crawler user-agent is the string an AI engine's HTTP client sends in the User-Agent header when fetching content for training, retrieval, indexing, or live answering. Each major engine ships multiple bots scoped to different purposes.
Reference table
| Vendor | User-Agent | Purpose | robots.txt name | Verification |
|---|---|---|---|---|
| OpenAI | GPTBot/1.x | Training | GPTBot | IP ranges published |
| OpenAI | OAI-SearchBot/1.x | Search index for ChatGPT search | OAI-SearchBot | IP ranges published |
| OpenAI | ChatGPT-User/1.x | On-demand fetch from a user prompt | ChatGPT-User | IP ranges published |
| Anthropic | ClaudeBot/1.x | Training | ClaudeBot | IP ranges published |
| Anthropic | Claude-User/1.x | On-demand fetch from a user prompt | Claude-User | IP ranges published |
| Anthropic | Claude-SearchBot | Search/retrieval indexing | Claude-SearchBot | IP ranges published |
| Perplexity | PerplexityBot/1.x | Search/retrieval indexing | PerplexityBot | IP ranges published |
| Perplexity | Perplexity-User/1.x | On-demand fetch from a user prompt | Perplexity-User | IP ranges published |
| Google-Extended | Gemini training opt-out token (no separate UA) | Google-Extended | n/a, declared | |
| GoogleOther | Various Google AI fetches | GoogleOther | RDNS | |
| Microsoft | Bingbot (also AI Mode/Copilot) | Search + AI | Bingbot | RDNS |
| Apple | Applebot (with Applebot-Extended opt-out) | Apple Intelligence | Applebot/Applebot-Extended | RDNS |
| Meta | Meta-ExternalAgent | Meta AI | Meta-ExternalAgent | IP ranges published |
| Cohere | cohere-ai | Training/retrieval | cohere-ai | n/a |
| Common Crawl | CCBot | Open dataset many LLMs train on | CCBot | RDNS |
| You.com | YouBot | Search/retrieval | YouBot | RDNS |
| Mistral | MistralAI-User | On-demand fetch | MistralAI-User | declared |
| Bytedance | Bytespider | Doubao training | Bytespider | RDNS |
Recommended robots.txt patterns
Allow citation bots, block training-only
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
Allow live retrieval (these power citations)
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-User
Allow: /
User-agent: PerplexityBot
Allow: /
Allow everything
User-agent: *
Allow: /
Allow all retrieval/citation bots and decline training
Use the explicit per-vendor Disallow for GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended, Bytespider while allowing search/retrieval bots.
Verifying crawler identity
User-agents are spoofable. Verify by:
- Checking the request IP against the vendor's published IP range list (OpenAI, Anthropic, Perplexity, Meta, Google publish theirs).
- Reverse DNS lookups (bingbot.com, googlebot.com).
- Logging unverified hits as suspicious traffic.
Common misconceptions
- "Blocking GPTBot blocks ChatGPT search." False. GPTBot is training; OAI-SearchBot and ChatGPT-User power search and live citations. Block them only if you do not want to be cited.
- "User-agents are reliable." False alone. Always combine with IP/RDNS verification.
- "There is one bot per vendor." False. OpenAI, Anthropic, and Perplexity each ship 2-3 distinct bots.
How to apply
- Audit current robots.txt and confirm citation bots are allowed.
- Decide on training-bot policy and codify it.
- Implement IP/RDNS verification for high-traffic endpoints.
- Re-check vendor-published IP ranges quarterly.
- Log AI bot traffic separately for visibility analytics.
FAQ
Q: If I block GPTBot, do I lose ChatGPT citations?
No. GPTBot is training-only. Citations come via OAI-SearchBot and ChatGPT-User. Blocking GPTBot opts you out of training but preserves citations.
Q: Does Google have a separate AI crawler?
Google does not run a separate AI crawler. Instead, Google-Extended is a robots-token opt-out from Gemini training; classical Googlebot still drives indexing and AI Overviews.
Q: Can I block Common Crawl?
Yes via User-agent: CCBot / Disallow: /. Note that historical Common Crawl snapshots may already include your content; the block prevents future inclusion.
Q: Are these user-agents stable?
Vendor names are stable; version suffixes change. Match by prefix in your robots and analytics rules.
Q: What about Apple Intelligence?
Apple uses Applebot for indexing and offers Applebot-Extended as an AI-training opt-out token, similar to Google's pattern.
Related Articles
Agent Citation Attribution Specification: Verifiable Source Tracking for Autonomous AI Agents
Specification defining HTTP headers, provenance manifests, and chain-of-citation markup so autonomous AI agents produce verifiable citations to source content.
Browser Agent Crawl Etiquette: A Specification for Polite Autonomous AI Browsing
A specification defining how browser-based AI agents should identify themselves, throttle requests, and respect publisher signals to maintain citation trust.
Verified Agent Identity for Citation Trust: A Specification for Authenticated AI Crawlers
Specification for verified agent identity: how publishers authenticate AI crawlers via cryptographic signatures so citation trust survives spoofing.