HTTP Cache Headers for AI Crawlers

AI crawlers like GPTBot, ClaudeBot, PerplexityBot, and Google-Extended use HTTP cache headers — ETag, Last-Modified, and Cache-Control — to decide when to refresh content. Correct headers improve citation freshness and conserve crawl budget; misconfigured caches stall citations or serve stale facts to AI answers.

TL;DR

HTTP cache headers tell AI crawlers when content has changed and when to revisit. The four headers that matter most are ETag, Last-Modified, Cache-Control, and Vary. Major declared AI bots honor If-Modified-Since and If-None-Match and respond meaningfully to 304 Not Modified. Send strong validators on canonical content, align Cache-Control: max-age with the page's actual update cadence, and never serve Cache-Control: private on pages you want crawled. For the broader picture, see the Technical hub and Content Freshness Signals for AI Search.

Why HTTP cache headers matter for AI citation

AI search systems do not index pages once and forget them. They re-fetch on a cadence to keep training corpora and live retrieval indexes current. That fetch volume is large: Cloudflare reported in April 2026 that AI bot traffic now exceeds 10 billion requests per week across its network (Cloudflare blog).

Cache headers change three things AI bots care about:

Freshness signal. A clear Last-Modified or ETag lets the crawler skip re-downloading unchanged pages, freeing budget for changed pages.
Trust signal. Stable, validated cache metadata correlates with predictable origins. Engines treat predictable origins as lower-risk for citation.
Stale-citation risk. If a page changes but headers do not communicate the change, an AI cache may keep citing an outdated version for days or weeks.

Google's own crawler guidance specifically recommends ETag headers to reduce unnecessary refetching (Google Search Central, Dec 2024; Search Engine Journal). The same headers that help Googlebot help AI crawlers built on similar HTTP caching primitives.

Per-header reference

The contract for HTTP caches is defined by RFC 9111: HTTP Caching (June 2022, obsoletes RFC 7234). The headers below are the subset most relevant to AI crawler behavior.

Header	Direction	What it does	AI crawler relevance
ETag	Response	Strong or weak validator identifying a specific representation	Enables If-None-Match revalidation; a stable ETag tells AI crawlers "nothing changed since you last saw this hash."
Last-Modified	Response	Wall-clock time the resource last changed	Pairs with If-Modified-Since; coarser than ETag but widely supported.
Cache-Control	Response	Directives controlling cacheability and lifetime	Sets max-age, s-maxage, must-revalidate, no-cache, private, public, stale-while-revalidate. Crawlers use this to choose between revalidating and refetching.
Vary	Response	Lists request headers that affect the response	Critical when serving different HTML based on User-Agent; missing Vary: User-Agent can cause CDNs to mix bot and human responses.
Age	Response	Seconds the resource has been in shared caches	Lets crawlers reason about origin freshness vs CDN freshness.
If-Modified-Since	Request	Conditional GET against Last-Modified	AI bots send this on revisit; expect 304 Not Modified when unchanged.
If-None-Match	Request	Conditional GET against ETag	Strong-validator equivalent; preferred when both are available.
Expires	Response	HTTP/1.0 absolute expiration date	Legacy; superseded by Cache-Control: max-age in modern stacks.

Cache-Control directives worth singling out for AI traffic:

public — explicitly allows shared caches (CDNs, AI fetch caches) to store the response.
private — disallows shared caching. Avoid on canonical content you want bots to crawl.
max-age=N — origin says "fresh for N seconds." Match this to your real publish cadence.
s-maxage=N — overrides max-age for shared caches only.
must-revalidate — once stale, the cache must revalidate; useful for facts you cannot risk serving outdated.
no-cache — must revalidate before each use; not the same as no-store.
no-store — bypass caches entirely; almost never appropriate for canonical content pages.
stale-while-revalidate=N — serve stale up to N seconds while a background refresh runs; helpful for fast TTFB without hurting freshness.

How major AI crawlers handle cache headers

Public documentation and observed behavior across the four most-cited AI crawlers:

Crawler	Honors conditional requests	Behavior on 304	Refetch cadence	Source
GPTBot (OpenAI)	Yes, observed in production logs	Skips re-download; queues next visit	Variable; commonly hours-to-days for high-priority content	OpenAI crawler docs
ClaudeBot (Anthropic)	Yes for ClaudeBot; Claude-User is user-triggered and bypasses	Skips re-download	ClaudeBot can be aggressive on first discovery; revisit cadence depends on robots.txt and server signals	Anthropic crawler docs; Search Engine Land
PerplexityBot	Yes for the indexer	Skips re-download	Changes typically reflect within ~24 hours per Perplexity docs	Perplexity crawler docs
Google-Extended	Yes; uses Google's standard caching infrastructure	Skips re-download	Inherits Googlebot caching primitives, including ETag-first guidance	Google Search Central

Two important nuances:

User-triggered fetchers behave differently. Perplexity-User, Claude-User, OAI-SearchBot, and similar user-action fetchers usually fire on demand and can ignore caching signals because the user is waiting for a real-time answer. Optimize for those by keeping origin response time fast and avoiding private or no-store on public pages.
Stealth crawlers exist. Cloudflare publicly accused Perplexity of using undeclared crawlers in August 2025 (Cloudflare blog). Treat documented behavior as a useful baseline, not a guarantee.

CDN gotchas

Cache headers are interpreted by every layer between origin and bot. Common pitfalls per CDN:

Cloudflare.

Free and Pro plans cache static assets aggressively but bypass HTML by default. Set up a Cache Rule to cache HTML for AI bots if your TTFB is high.
AI Crawl Control and managed robots.txt features can override or augment your origin headers. Verify the rendered headers, not just the origin's.
The cf-cache-status response header tells you whether Cloudflare returned HIT, MISS, EXPIRED, or REVALIDATED — useful for debugging crawler behavior (Cloudflare AI Gateway caching docs).

Fastly.

Surrogate-Control and s-maxage rules on the edge can shadow Cache-Control for downstream caches.
VCL transformations can strip ETag. Confirm your final response retains the validator.

Vercel.

The s-maxage plus stale-while-revalidate pattern is idiomatic for incremental static regeneration but may yield identical ETags across revalidations; use a content hash to make ETags meaningful for change detection.
Edge functions sometimes return different responses per region — pair with a Vary header where appropriate.

Recommended header recipes

Pick a recipe by how often the page actually changes.

Evergreen reference content (changes monthly or less)

Cache-Control: public, max-age=86400, stale-while-revalidate=43200, must-revalidate
ETag: "<content-hash>"
Last-Modified: <RFC 1123 date>
Vary: Accept-Encoding

Behavior: shared caches serve fresh for 24 hours, stale-but-served for an additional 12 hours while revalidating. AI bots revalidate cheaply with If-None-Match.

News and frequently-updated articles (changes daily or hourly)

Cache-Control: public, max-age=300, stale-while-revalidate=600
ETag: "<content-hash>"
Last-Modified: <RFC 1123 date>
Vary: Accept-Encoding, User-Agent

Behavior: short freshness window keeps stale citations rare; SWR keeps TTFB low under bot bursts. Vary: User-Agent matters if you serve different bot vs human variants.

API and JSON endpoints used by AI agents

Cache-Control: public, max-age=60, must-revalidate
ETag: "<entity-hash>"
Vary: Accept, Accept-Encoding, Authorization

Behavior: short TTL for accuracy; Vary: Authorization prevents leaking authorized responses to anonymous bots; ETags let agents poll cheaply.

Pages you want excluded from AI training but allowed for retrieval

Combine HTTP headers with a robots.txt and ai.txt strategy. Headers alone do not opt out of training — see the misconfiguration section below.

How stale content hurts AI citation freshness

When an AI engine has a stale snapshot of your page, three failure modes appear in production:

Outdated factual citation. The AI quotes a now-wrong number or claim. Public corrections do not propagate until the next successful refetch.
Citation churn. Engines drop sources whose freshness scores fall below a threshold and replace them with newer competitors.
Wasted crawl budget. Without ETag or Last-Modified, every revisit downloads the full page. High-volume bots back off, slowing future updates.

The fix is rarely about adding more headers — it is about adding correct validators and aligning max-age with how often the underlying content actually changes.

Common misconfigurations

No ETag and no Last-Modified. Every crawl is a full download. Remediate at the framework layer (Next.js, Astro, Hugo, custom server). Hash content into a strong ETag.
Cache-Control: no-store on canonical HTML. Often inherited from auth-gated paths. Audit which routes carry it.
Vary: or missing Vary. defeats shared caching; missing Vary: User-Agent when you serve a bot-only HTML variant causes humans and bots to swap responses.
Mismatched ETag across CDN PoPs. Different ETags for the same resource inflate cache misses. Fix by deriving the ETag from content, not from time.
Confusing private with no-cache. private blocks shared caches (which is what AI crawlers effectively are); no-cache requires revalidation but allows shared storage.
Treating cache headers as access control. Cache headers do not opt content out of AI training. Use robots.txt, ai.txt, and platform-specific opt-outs (for example, User-agent: GPTBot followed by Disallow: /).

How to apply

Audit current headers on 20-30 canonical URLs. Use curl -I or redbot.org to inspect.
Pick a recipe per content type: evergreen, news, or API.
Implement strong ETags from content hashes; do not rely on inode-based defaults.
Set Cache-Control: max-age to your real update cadence — actual, not aspirational.
Add Vary: Accept-Encoding everywhere; add Vary: User-Agent only when you serve UA-specific HTML.
Verify with a conditional GET: curl -H 'If-None-Match: "..."' should return 304.
Track AI crawler hits in logs by user-agent; watch for 200 rates dropping as 304 rates rise after deploy.
Re-audit quarterly; AI crawler behavior evolves as platforms publish new bot versions.

Decision flow for a single page

flowchart TD
  A["New or updated page"] --> B{"Update cadence?"}
  B -->|Monthly or less| C["Evergreen recipe
max-age=86400 + SWR"]
  B -->|Daily or hourly| D["News recipe
max-age=300 + SWR"]
  B -->|On every request| E["API recipe
max-age=60 + must-revalidate"]
  C --> F{"Strong ETag set?"}
  D --> F
  E --> F
  F -->|No| G["Add content-hash ETag"]
  F -->|Yes| H["Verify 304 with conditional GET"]
  G --> H
  H --> I["Monitor 304 vs 200 ratio in logs"]

FAQ

Q: Do AI crawlers actually honor Cache-Control: max-age?

A: Major declared crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) honor standard HTTP caching primitives including max-age and conditional requests, per their public documentation. Stealth or undeclared crawlers may not — Cloudflare publicly documented Perplexity using stealth crawlers in August 2025.

Q: Should I send strong or weak ETags?

A: Strong validators (no W/ prefix) are preferred for canonical HTML when content is byte-identical across revisions. Weak ETags (W/"...") are appropriate when content is semantically equivalent but byte-different (for example, trivial whitespace). Per RFC 9111, weak ETags work for revalidation but not range requests.

Q: What happens on 304 Not Modified for AI crawlers?

A: The bot treats the response as confirmation that its cached copy is current. The full body is not re-downloaded, the bot's freshness metadata for the URL is updated, and the next refetch is scheduled later in the queue.

Q: Does setting Cache-Control: no-store block AI training?

A: No. no-store only instructs caches not to retain the response; it does not instruct AI vendors to exclude the content from training. To opt out of training, use User-agent: GPTBot and Disallow: / (or platform equivalents) in robots.txt and a documented ai.txt.

Q: How long until cache header changes propagate to AI bots?

A: Origin changes are visible immediately to a bot's next request, but bots do not recrawl on demand. Typical visibility windows are minutes-to-hours for high-priority pages on major bots, around 24 hours for Perplexity per its docs, and days for low-priority pages on any bot.

Q: Does Vary: User-Agent hurt CDN cache efficiency?

A: Yes, in proportion to how many user-agent variants you serve. Use it only when you actually serve different HTML to bots vs humans. For most sites, Vary: Accept-Encoding is sufficient.