Geodocs.dev

AI Crawler Content Negotiation Specification

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

Use HTTP proactive content negotiation per RFC 9110 to serve LLM-friendly variants of your content (markdown, JSON-LD, plain JSON) when crawlers ask for them via the Accept header. Never branch on User-Agent alone, always set Vary: Accept, and keep content semantically equivalent across variants so you don't trigger cloaking penalties.

TL;DR

Most AI crawlers fetch HTML by default, but several (Anthropic's ClaudeBot, fast.com-style fetchers, agent-driven scrapers) increasingly send Accept: text/markdown or Accept: application/ld+json to skip the rendering step. Honour those headers and you save them parsing time and yourself bandwidth. Get the spec wrong and you either ignore the signal (missed efficiency) or cloak (SEO penalty). RFC 9110 defines the rules; this spec applies them to the AI-search context.

Why this matters

When an AI engine cites a page, the cited text was extracted somewhere in the engine's pipeline — often by stripping HTML, removing chrome, and converting to plain text or markdown. If you can serve a clean markdown variant directly, you (a) skip the engine's HTML-to-text conversion, (b) reduce extraction errors, and (c) signal that you're a structured-data-friendly publisher.

The risk is cloaking: serving substantially different content to bots than to humans, which Google and Bing have penalised for years and AI engines now also flag as low-trust. Content negotiation is not cloaking when the variants are semantically equivalent (the markdown variant is a faithful representation of the HTML). It is cloaking when the variants disagree.

Specification scope

This spec defines:

  1. Negotiable dimensions — what to negotiate (format, language, encoding).
  2. Selection rules — how to pick a variant.
  3. Cache safety — Vary header rules.
  4. Cloaking boundary — what is and isn't allowed.
  5. Fallback behaviour — when no acceptable variant exists.

Non-goals: language detection beyond Accept-Language, IP-geolocation routing, DRM/paywall negotiation.

1. Negotiable dimensions

1.1 Format (Accept)

Support these MIME types as variants of every article URL:

MIMEUse
text/htmlDefault for browsers and most crawlers.
text/markdownLLM-friendly text-only variant. Strip nav, ads, related posts.
application/ld+jsonPure structured-data variant for JSON-LD-only consumers.
application/jsonPage-data API for headless / RSS-style consumers.
text/plainLast-resort plain text.

Branch on the request's Accept header per RFC 9110 § 12. If the client sends Accept: text/markdown, text/html;q=0.5, return markdown. If the client sends Accept: / or no Accept header, default to text/html.

1.2 Language (Accept-Language)

Support per-locale variants by branching on Accept-Language. Honour the highest-quality match present in your translation set; fall back to the default locale otherwise. Always emit Content-Language on the response so caches and downstream proxies see the resolved locale.

1.3 Encoding (Accept-Encoding)

Negotiate compression (gzip, br, zstd) per the RFC. AI crawlers handle gzip universally and Brotli widely; zstd is rare — don't make it the only option.

2. Selection rules

client sends:

GET /articles/answer-grounding HTTP/2

Accept: text/markdown, text/html;q=0.5

Accept-Language: de, en;q=0.7

server decides:

format = text/markdown (q=1 > 0.5)

language = de (de present in set)

encoding = br (preferred)

response:

HTTP/2 200 OK

Content-Type: text/markdown; charset=utf-8

Content-Language: de

Content-Encoding: br

Vary: Accept, Accept-Language, Accept-Encoding

Link: ; rel="canonical"

Follow RFC 9110's quality-value algorithm: highest q wins; ties broken by server preference. If no variant matches, return 406 Not Acceptable only as a last resort; many real-world clients handle 406 poorly. Most servers fall back to text/html and document the behaviour.

3. Cache safety

Vary is non-negotiable. Without it, a CDN may serve a markdown variant to a browser or vice versa. Set:

Vary: Accept, Accept-Language, Accept-Encoding

Keep the Vary set minimal — every header included multiplies cache key cardinality. Do not include User-Agent in Vary (see § 4 below). For high-traffic origins, consider normalising Accept at the CDN edge so text/markdown, text/html;q=0.5 and text/html;q=0.5, text/markdown map to the same cache key.

4. The cloaking boundary

Google's spam policy treats cloaking as "presenting different content to users than to search engines" with intent to manipulate rankings. Content negotiation under RFC 9110 is not cloaking when:

  1. Semantic equivalence. The markdown / JSON-LD variant says the same things as the HTML. Text content matches; structured data is a strict subset.
  2. Triggered by Accept, not User-Agent. Branching on User-Agent: GPTBot to serve different content is cloaking. Branching on Accept: text/markdown is not, because any client — including a human's curl command — can ask for it.
  3. Same canonical URL. All variants share a single canonical URL, exposed via and Link: rel="canonical" HTTP header.
  4. No hidden incentives. Variants don't carry promotional content absent from the HTML, or omit content present in the HTML for SEO benefit.

Avoid these anti-patterns:

  • UA-only branching. Serve markdown only if User-Agent matches a bot pattern. Treats bots as a special class, the textbook cloaking definition.
  • Variant divergence. Markdown variant longer / different / shorter than HTML.
  • Bot-only structured data. JSON-LD only present when the requester looks like a bot.

5. Fallback behaviour

If the client requests a variant you don't support:

  • Strict mode: return 406 Not Acceptable with a Link header pointing to the available variants.
  • Lenient mode: return 200 with text/html and Content-Type: text/html. Most clients handle this gracefully.

Lenient is the practical default for AI-crawler endpoints; strict is appropriate for API-only endpoints.

6. URL design

Two viable strategies:

One URL, multiple variants. The URL is https://example.com/articles/x; the variant is selected by Accept. Cleanest from a canonical-URL perspective.

Strategy B: explicit suffix

Expose a parallel https://example.com/articles/x.md or https://example.com/articles/x.jsonld. Easier to debug and link from Link headers (Link: <...x.md>; rel="alternate"; type="text/markdown"). Set on each variant pointing back to the HTML URL so canonical-URL signals are unambiguous.

Most teams ship Strategy B first (lower risk), then add Strategy A negotiation on top.

Common mistakes

  • Branching on User-Agent. Use Accept. UA strings are mutable and using them for content branching is the canonical cloaking pattern.
  • Forgetting Vary. Without Vary: Accept, a CDN will serve the wrong variant to the wrong client. This is the most common cause of mysterious "why is markdown showing up in browsers" tickets.
  • Variant drift. Markdown variant generated from a different source than HTML. Generate both from the same content store.
  • Different canonical per variant. Each variant points to itself as canonical, fragmenting authority. Always use the HTML URL as the shared canonical.
  • No 406 fallback signalled. If you do return 406, include a Link header listing supported variants — saves clients a follow-up round trip.

FAQ

Q: Which Accept types should I support first?

Start with text/markdown because it's the most-requested variant by current AI agents and the cheapest to generate. Add application/ld+json next if you publish structured data. application/json for full page data is optional and only worth it if you have a public API surface anyway.

Q: Is serving markdown to bots considered cloaking?

No, when triggered by Accept: text/markdown and the markdown is semantically equivalent to the HTML. Yes, when triggered by User-Agent: GPTBot and the markdown contains different facts or omits content from the HTML.

Q: How do I keep variants in sync?

Generate them from a single source (Markdown source files, a CMS field, or a structured-data store). Don't author HTML and markdown variants by hand — they'll drift.

Q: Should I include Vary: User-Agent?

No. Vary: User-Agent explodes cache cardinality and signals UA-based content branching, which is exactly the cloaking pattern to avoid. Negotiate on Accept instead.

Q: Do AI agents actually send Accept: text/markdown today?

Mixed. Some agent frameworks (LangChain crawlers, Crawl4AI in markdown mode, OpenAI's own evaluation tools) do; the major bot crawlers (GPTBot, ClaudeBot, PerplexityBot) mostly still use Accept: /. Supporting markdown is forward-looking infrastructure that costs little and pays off as agents become the dominant fetch pattern.

Related Articles

specification

AI Crawler Prefetch Hints Specification

How to use Resource Hints, Link headers, and 103 Early Hints to accelerate AI crawler discovery while keeping origin load and crawl budget under control.

guide

How to Create llms.txt: Step-by-Step Tutorial for AI Search

Step-by-step tutorial for creating, deploying, and validating an llms.txt file so AI systems and LLMs can discover your site's most important content.

specification

JSON-LD Validation Pipeline Specification for AI Search

A production-grade specification for validating JSON-LD structured data in CI/CD: Schema Markup Validator, Rich Results Test, error vs warning triage, and regression alerting.

Topics
Stay Updated

GEO & AI Search Insights

New articles, framework updates, and industry analysis. No spam, unsubscribe anytime.