Geodocs.dev

HTTP Cache Headers for AI Crawlers

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

AI crawlers like GPTBot, ClaudeBot, PerplexityBot, and Google-Extended use HTTP cache headers — ETag, Last-Modified, and Cache-Control — to decide when to refresh content. Correct headers improve citation freshness and conserve crawl budget; misconfigured caches stall citations or serve stale facts to AI answers.

TL;DR

HTTP cache headers tell AI crawlers when content has changed and when to revisit. The four headers that matter most are ETag, Last-Modified, Cache-Control, and Vary. Major declared AI bots honor If-Modified-Since and If-None-Match and respond meaningfully to 304 Not Modified. Send strong validators on canonical content, align Cache-Control: max-age with the page's actual update cadence, and never serve Cache-Control: private on pages you want crawled. For the broader picture, see the Technical hub and Content Freshness Signals for AI Search.

Why HTTP cache headers matter for AI citation

AI search systems do not index pages once and forget them. They re-fetch on a cadence to keep training corpora and live retrieval indexes current. That fetch volume is large: Cloudflare reported in April 2026 that AI bot traffic now exceeds 10 billion requests per week across its network (Cloudflare blog).

Cache headers change three things AI bots care about:

  • Freshness signal. A clear Last-Modified or ETag lets the crawler skip re-downloading unchanged pages, freeing budget for changed pages.
  • Trust signal. Stable, validated cache metadata correlates with predictable origins. Engines treat predictable origins as lower-risk for citation.
  • Stale-citation risk. If a page changes but headers do not communicate the change, an AI cache may keep citing an outdated version for days or weeks.

Google's own crawler guidance specifically recommends ETag headers to reduce unnecessary refetching (Google Search Central, Dec 2024; Search Engine Journal). The same headers that help Googlebot help AI crawlers built on similar HTTP caching primitives.

Per-header reference

The contract for HTTP caches is defined by RFC 9111: HTTP Caching (June 2022, obsoletes RFC 7234). The headers below are the subset most relevant to AI crawler behavior.

HeaderDirectionWhat it doesAI crawler relevance
ETagResponseStrong or weak validator identifying a specific representationEnables If-None-Match revalidation; a stable ETag tells AI crawlers "nothing changed since you last saw this hash."
Last-ModifiedResponseWall-clock time the resource last changedPairs with If-Modified-Since; coarser than ETag but widely supported.
Cache-ControlResponseDirectives controlling cacheability and lifetimeSets max-age, s-maxage, must-revalidate, no-cache, private, public, stale-while-revalidate. Crawlers use this to choose between revalidating and refetching.
VaryResponseLists request headers that affect the responseCritical when serving different HTML based on User-Agent; missing Vary: User-Agent can cause CDNs to mix bot and human responses.
AgeResponseSeconds the resource has been in shared cachesLets crawlers reason about origin freshness vs CDN freshness.
If-Modified-SinceRequestConditional GET against Last-ModifiedAI bots send this on revisit; expect 304 Not Modified when unchanged.
If-None-MatchRequestConditional GET against ETagStrong-validator equivalent; preferred when both are available.
ExpiresResponseHTTP/1.0 absolute expiration dateLegacy; superseded by Cache-Control: max-age in modern stacks.

Cache-Control directives worth singling out for AI traffic:

  • public — explicitly allows shared caches (CDNs, AI fetch caches) to store the response.
  • private — disallows shared caching. Avoid on canonical content you want bots to crawl.
  • max-age=N — origin says "fresh for N seconds." Match this to your real publish cadence.
  • s-maxage=N — overrides max-age for shared caches only.
  • must-revalidate — once stale, the cache must revalidate; useful for facts you cannot risk serving outdated.
  • no-cache — must revalidate before each use; not the same as no-store.
  • no-store — bypass caches entirely; almost never appropriate for canonical content pages.
  • stale-while-revalidate=N — serve stale up to N seconds while a background refresh runs; helpful for fast TTFB without hurting freshness.

How major AI crawlers handle cache headers

Public documentation and observed behavior across the four most-cited AI crawlers:

CrawlerHonors conditional requestsBehavior on 304Refetch cadenceSource
GPTBot (OpenAI)Yes, observed in production logsSkips re-download; queues next visitVariable; commonly hours-to-days for high-priority contentOpenAI crawler docs
ClaudeBot (Anthropic)Yes for ClaudeBot; Claude-User is user-triggered and bypassesSkips re-downloadClaudeBot can be aggressive on first discovery; revisit cadence depends on robots.txt and server signalsAnthropic crawler docs; Search Engine Land
PerplexityBotYes for the indexerSkips re-downloadChanges typically reflect within ~24 hours per Perplexity docsPerplexity crawler docs
Google-ExtendedYes; uses Google's standard caching infrastructureSkips re-downloadInherits Googlebot caching primitives, including ETag-first guidanceGoogle Search Central

Two important nuances:

  • User-triggered fetchers behave differently. Perplexity-User, Claude-User, OAI-SearchBot, and similar user-action fetchers usually fire on demand and can ignore caching signals because the user is waiting for a real-time answer. Optimize for those by keeping origin response time fast and avoiding private or no-store on public pages.
  • Stealth crawlers exist. Cloudflare publicly accused Perplexity of using undeclared crawlers in August 2025 (Cloudflare blog). Treat documented behavior as a useful baseline, not a guarantee.

CDN gotchas

Cache headers are interpreted by every layer between origin and bot. Common pitfalls per CDN:

Cloudflare.

  • Free and Pro plans cache static assets aggressively but bypass HTML by default. Set up a Cache Rule to cache HTML for AI bots if your TTFB is high.
  • AI Crawl Control and managed robots.txt features can override or augment your origin headers. Verify the rendered headers, not just the origin's.
  • The cf-cache-status response header tells you whether Cloudflare returned HIT, MISS, EXPIRED, or REVALIDATED — useful for debugging crawler behavior (Cloudflare AI Gateway caching docs).

Fastly.

  • Surrogate-Control and s-maxage rules on the edge can shadow Cache-Control for downstream caches.
  • VCL transformations can strip ETag. Confirm your final response retains the validator.

Vercel.

  • The s-maxage plus stale-while-revalidate pattern is idiomatic for incremental static regeneration but may yield identical ETags across revalidations; use a content hash to make ETags meaningful for change detection.
  • Edge functions sometimes return different responses per region — pair with a Vary header where appropriate.

Pick a recipe by how often the page actually changes.

Evergreen reference content (changes monthly or less)

Cache-Control: public, max-age=86400, stale-while-revalidate=43200, must-revalidate
ETag: "<content-hash>"
Last-Modified: <RFC 1123 date>
Vary: Accept-Encoding

Behavior: shared caches serve fresh for 24 hours, stale-but-served for an additional 12 hours while revalidating. AI bots revalidate cheaply with If-None-Match.

News and frequently-updated articles (changes daily or hourly)

Cache-Control: public, max-age=300, stale-while-revalidate=600
ETag: "<content-hash>"
Last-Modified: <RFC 1123 date>
Vary: Accept-Encoding, User-Agent

Behavior: short freshness window keeps stale citations rare; SWR keeps TTFB low under bot bursts. Vary: User-Agent matters if you serve different bot vs human variants.

API and JSON endpoints used by AI agents

Cache-Control: public, max-age=60, must-revalidate
ETag: "<entity-hash>"
Vary: Accept, Accept-Encoding, Authorization

Behavior: short TTL for accuracy; Vary: Authorization prevents leaking authorized responses to anonymous bots; ETags let agents poll cheaply.

Pages you want excluded from AI training but allowed for retrieval

Combine HTTP headers with a robots.txt and ai.txt strategy. Headers alone do not opt out of training — see the misconfiguration section below.

How stale content hurts AI citation freshness

When an AI engine has a stale snapshot of your page, three failure modes appear in production:

  1. Outdated factual citation. The AI quotes a now-wrong number or claim. Public corrections do not propagate until the next successful refetch.
  2. Citation churn. Engines drop sources whose freshness scores fall below a threshold and replace them with newer competitors.
  3. Wasted crawl budget. Without ETag or Last-Modified, every revisit downloads the full page. High-volume bots back off, slowing future updates.

The fix is rarely about adding more headers — it is about adding correct validators and aligning max-age with how often the underlying content actually changes.

Common misconfigurations

  • No ETag and no Last-Modified. Every crawl is a full download. Remediate at the framework layer (Next.js, Astro, Hugo, custom server). Hash content into a strong ETag.
  • Cache-Control: no-store on canonical HTML. Often inherited from auth-gated paths. Audit which routes carry it.
  • Vary: or missing Vary. defeats shared caching; missing Vary: User-Agent when you serve a bot-only HTML variant causes humans and bots to swap responses.
  • Mismatched ETag across CDN PoPs. Different ETags for the same resource inflate cache misses. Fix by deriving the ETag from content, not from time.
  • Confusing private with no-cache. private blocks shared caches (which is what AI crawlers effectively are); no-cache requires revalidation but allows shared storage.
  • Treating cache headers as access control. Cache headers do not opt content out of AI training. Use robots.txt, ai.txt, and platform-specific opt-outs (for example, User-agent: GPTBot followed by Disallow: /).

How to apply

  1. Audit current headers on 20-30 canonical URLs. Use curl -I or redbot.org to inspect.
  2. Pick a recipe per content type: evergreen, news, or API.
  3. Implement strong ETags from content hashes; do not rely on inode-based defaults.
  4. Set Cache-Control: max-age to your real update cadence — actual, not aspirational.
  5. Add Vary: Accept-Encoding everywhere; add Vary: User-Agent only when you serve UA-specific HTML.
  6. Verify with a conditional GET: curl -H 'If-None-Match: "..."' should return 304.
  7. Track AI crawler hits in logs by user-agent; watch for 200 rates dropping as 304 rates rise after deploy.
  8. Re-audit quarterly; AI crawler behavior evolves as platforms publish new bot versions.

Decision flow for a single page

flowchart TD
  A["New or updated page"] --> B{"Update cadence?"}
  B -->|Monthly or less| C["Evergreen recipe
max-age=86400 + SWR"]
  B -->|Daily or hourly| D["News recipe
max-age=300 + SWR"]
  B -->|On every request| E["API recipe
max-age=60 + must-revalidate"]
  C --> F{"Strong ETag set?"}
  D --> F
  E --> F
  F -->|No| G["Add content-hash ETag"]
  F -->|Yes| H["Verify 304 with conditional GET"]
  G --> H
  H --> I["Monitor 304 vs 200 ratio in logs"]

FAQ

Q: Do AI crawlers actually honor Cache-Control: max-age?

A: Major declared crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) honor standard HTTP caching primitives including max-age and conditional requests, per their public documentation. Stealth or undeclared crawlers may not — Cloudflare publicly documented Perplexity using stealth crawlers in August 2025.

Q: Should I send strong or weak ETags?

A: Strong validators (no W/ prefix) are preferred for canonical HTML when content is byte-identical across revisions. Weak ETags (W/"...") are appropriate when content is semantically equivalent but byte-different (for example, trivial whitespace). Per RFC 9111, weak ETags work for revalidation but not range requests.

Q: What happens on 304 Not Modified for AI crawlers?

A: The bot treats the response as confirmation that its cached copy is current. The full body is not re-downloaded, the bot's freshness metadata for the URL is updated, and the next refetch is scheduled later in the queue.

Q: Does setting Cache-Control: no-store block AI training?

A: No. no-store only instructs caches not to retain the response; it does not instruct AI vendors to exclude the content from training. To opt out of training, use User-agent: GPTBot and Disallow: / (or platform equivalents) in robots.txt and a documented ai.txt.

Q: How long until cache header changes propagate to AI bots?

A: Origin changes are visible immediately to a bot's next request, but bots do not recrawl on demand. Typical visibility windows are minutes-to-hours for high-priority pages on major bots, around 24 hours for Perplexity per its docs, and days for low-priority pages on any bot.

Q: Does Vary: User-Agent hurt CDN cache efficiency?

A: Yes, in proportion to how many user-agent variants you serve. Use it only when you actually serve different HTML to bots vs humans. For most sites, Vary: Accept-Encoding is sufficient.

Related Articles

guide

What Is GEO? Generative Engine Optimization Defined

GEO (Generative Engine Optimization) is the practice of structuring content so AI search engines retrieve, understand, synthesize, and cite it in generated answers.

guide

robots.txt for AI Crawlers

How to configure robots.txt to control AI crawlers — GPTBot, PerplexityBot, Google-Extended, ClaudeBot, Applebot-Extended, and the rest — across training and retrieval use cases.

Stay Updated

GEO & AI Search Insights

New articles, framework updates, and industry analysis. No spam, unsubscribe anytime.