AI search crawlers identify themselves with distinct user-agents: GPTBot and OAI-SearchBot from OpenAI, ClaudeBot and Claude-User from Anthropic, PerplexityBot and Perplexity-User from Perplexity, Google-Extended from Google, plus others. Allow or block in robots.txt and verify with reverse DNS where vendors publish IP ranges.

TL;DR

Most AI engines run multiple crawlers with distinct purposes: training, retrieval, and live answering. Block training-only bots if you wish; almost always allow retrieval bots if you want to be cited. Verify identity with reverse DNS or published IP ranges.

Definition

An AI search crawler user-agent is the string an AI engine's HTTP client sends in the User-Agent header when fetching content for training, retrieval, indexing, or live answering. Each major engine ships multiple bots scoped to different purposes.

Reference table

Vendor	User-Agent	Purpose	robots.txt name	Verification
OpenAI	GPTBot/1.x	Training	GPTBot	IP ranges published
OpenAI	OAI-SearchBot/1.x	Search index for ChatGPT search	OAI-SearchBot	IP ranges published
OpenAI	ChatGPT-User/1.x	On-demand fetch from a user prompt	ChatGPT-User	IP ranges published
Anthropic	ClaudeBot/1.x	Training	ClaudeBot	IP ranges published
Anthropic	Claude-User/1.x	On-demand fetch from a user prompt	Claude-User	IP ranges published
Anthropic	Claude-SearchBot	Search/retrieval indexing	Claude-SearchBot	IP ranges published
Perplexity	PerplexityBot/1.x	Search/retrieval indexing	PerplexityBot	IP ranges published
Perplexity	Perplexity-User/1.x	On-demand fetch from a user prompt	Perplexity-User	IP ranges published
Google	Google-Extended	Gemini training opt-out token (no separate UA)	Google-Extended	n/a, declared
Google	GoogleOther	Various Google AI fetches	GoogleOther	RDNS
Microsoft	Bingbot (also AI Mode/Copilot)	Search + AI	Bingbot	RDNS
Apple	Applebot (with Applebot-Extended opt-out)	Apple Intelligence	Applebot/Applebot-Extended	RDNS
Meta	Meta-ExternalAgent	Meta AI	Meta-ExternalAgent	IP ranges published
Cohere	cohere-ai	Training/retrieval	cohere-ai	n/a
Common Crawl	CCBot	Open dataset many LLMs train on	CCBot	RDNS
You.com	YouBot	Search/retrieval	YouBot	RDNS
Mistral	MistralAI-User	On-demand fetch	MistralAI-User	declared
Bytedance	Bytespider	Doubao training	Bytespider	RDNS

Recommended robots.txt patterns

Allow citation bots, block training-only

User-agent: GPTBot

Disallow: /

User-agent: ClaudeBot

Disallow: /

User-agent: CCBot

Disallow: /

Allow live retrieval (these power citations)

User-agent: OAI-SearchBot

Allow: /

User-agent: ChatGPT-User

Allow: /

User-agent: Claude-User

Allow: /

User-agent: PerplexityBot

Allow: /

Allow everything

User-agent: *

Allow: /

Allow all retrieval/citation bots and decline training

Use the explicit per-vendor Disallow for GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended, Bytespider while allowing search/retrieval bots.

Verifying crawler identity

User-agents are spoofable. Verify by:

Checking the request IP against the vendor's published IP range list (OpenAI, Anthropic, Perplexity, Meta, Google publish theirs).
Reverse DNS lookups (bingbot.com, googlebot.com).
Logging unverified hits as suspicious traffic.

Common misconceptions

"Blocking GPTBot blocks ChatGPT search." False. GPTBot is training; OAI-SearchBot and ChatGPT-User power search and live citations. Block them only if you do not want to be cited.
"User-agents are reliable." False alone. Always combine with IP/RDNS verification.
"There is one bot per vendor." False. OpenAI, Anthropic, and Perplexity each ship 2-3 distinct bots.

How to apply

Audit current robots.txt and confirm citation bots are allowed.
Decide on training-bot policy and codify it.
Implement IP/RDNS verification for high-traffic endpoints.
Re-check vendor-published IP ranges quarterly.
Log AI bot traffic separately for visibility analytics.

FAQ

Q: If I block GPTBot, do I lose ChatGPT citations?

No. GPTBot is training-only. Citations come via OAI-SearchBot and ChatGPT-User. Blocking GPTBot opts you out of training but preserves citations.

Q: Does Google have a separate AI crawler?

Google does not run a separate AI crawler. Instead, Google-Extended is a robots-token opt-out from Gemini training; classical Googlebot still drives indexing and AI Overviews.

Q: Can I block Common Crawl?

Yes via User-agent: CCBot / Disallow: /. Note that historical Common Crawl snapshots may already include your content; the block prevents future inclusion.

Q: Are these user-agents stable?

Vendor names are stable; version suffixes change. Match by prefix in your robots and analytics rules.

Q: What about Apple Intelligence?

Apple uses Applebot for indexing and offers Applebot-Extended as an AI-training opt-out token, similar to Google's pattern.

AI Search Crawler User-Agents: Complete 2026 Reference

TL;DR

Definition

Reference table

Recommended robots.txt patterns

Allow citation bots, block training-only

Allow live retrieval (these power citations)

Allow everything

Allow all retrieval/citation bots and decline training

Verifying crawler identity

Common misconceptions

How to apply

FAQ

Q: If I block GPTBot, do I lose ChatGPT citations?

Q: Does Google have a separate AI crawler?

Q: Can I block Common Crawl?

Q: Are these user-agents stable?

Q: What about Apple Intelligence?

Related Articles

Agent Citation Attribution Specification: Verifiable Source Tracking for Autonomous AI Agents

Browser Agent Crawl Etiquette: A Specification for Polite Autonomous AI Browsing

Verified Agent Identity for Citation Trust: A Specification for Authenticated AI Crawlers

GEO & AI Search Insights