AI Agent Content Specification
This specification defines how to structure web content for autonomous AI agents — crawlers, chatbots, research assistants, browser-based agents like ChatGPT Atlas and Perplexity Comet, coding agents like Cursor and Claude Code, and any other AI system that discovers, parses, and synthesizes information from the web on behalf of a user.
The AI Agent Content Specification defines three layers — discovery (llms.txt, agents.json, sitemap, robots.txt), parsing (frontmatter, semantic HTML, JSON-LD, agent.md), and attribution (ai.txt, canonical URLs) — that together let AI agents reliably find, understand, and cite web content. Compliance is verified through the checklist at the end of this page.
TL;DR
Make every page agent-ready by ensuring it is discoverable (present in llms.txt and sitemap.xml, allowed for major bots in robots.txt, optionally exposed via agents.json), parseable (full ~30-field frontmatter, semantic headings, JSON-LD, and where relevant a paired agent.md for tool-use surfaces), and attributable (canonical URL plus an ai.txt policy declaring source name and citation format). The compliance checklist at the bottom of this page is the single source of truth.
For the broader strategy, see the AI Agents pillar.
Specification overview
| Layer | Purpose | Standards |
|---|---|---|
| Discovery | AI agents find your content | llms.txt, agents.json, sitemap.xml, robots.txt |
| Parsing | AI agents understand your content | Frontmatter, semantic HTML, JSON-LD, agent.md |
| Attribution | AI agents cite your content | ai.txt, source metadata, canonical URLs |
Layer 1: Discovery
llms.txt
Every site should provide a /llms.txt file — a Markdown index that tells AI agents what your site contains and how it is organized. The format was proposed by Jeremy Howard and is documented at llmstxt.org.
Required elements:
- Site name (H1 heading)
- Site description (blockquote)
- Content index (links with descriptions)
- Section organization (H2 headings)
Minimal example:
markdown
Acme is a payments platform. This index lists the canonical references AI agents should consult.
Core concepts
- Payments overview: How card and bank transfers move through Acme.
- Webhooks reference: Event types, retry policy, and signature verification.
API
- Authentication: API key formats and header conventions.
- Errors: Error envelope and code taxonomy.
Full specification: How to Create llms.txt.
agents.json (proposed)
agents.json is an emerging convention for declaring agent-actionable surfaces — APIs, tools, and structured tasks — at a well-known location (/.well-known/agents.json). It complements llms.txt by exposing capabilities, not just content.
Minimal example:
{
"schema_version": "v1",
"name_for_model": "acme_payments",
"description_for_model": "Read and act on Acme payment data.",
"auth": { "type": "oauth", "authorization_url": "https://acme.com/oauth" },
"tools": [
{
"name": "create_invoice",
"description": "Create an invoice for a customer.",
"endpoint": "https://api.acme.com/v1/invoices",
"method": "POST"
}
],
"contact_email": "[email protected]"
}Adoption is partial; treat it as forward-compatible metadata, not a hard requirement.
Sitemap for AI
Standard XML sitemaps help AI crawlers discover content. Enhance with:
for freshness signals to indicate update patterns to highlight key pages
A separate sitemap-ai.xml can list only canonical, citation-ready pages — useful when your site mixes marketing and reference content.
robots.txt for AI crawlers
Explicitly allow major agent crawlers; disallow only when you have a specific reason:
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: Google-Extended
Allow: /
Crawler user-agents change. Verify at OpenAI bot docs, Anthropic crawler docs, and Google crawler docs.
Layer 2: Parsing
Frontmatter metadata schema
Every content page should include structured frontmatter. The schema below is the canonical 30-field shape; deviate only where a field truly does not apply.
---
# Identity
title: "Page Title"
slug: "url-slug"
section: "section-name"
canonical_url: "https://example.com/section/url-slug"
status: "published"Knowledge
canonical_concept_id: "unique-concept-identifier"
knowledge_domain: "domain-name"
concept_type: "core-concept|sub-concept|technique|tool|standard|metric"
entities: ["Primary Entity"]
aliases: ["alt name"]
related_concepts: ["related-id"]
Taxonomy
content_type: "guide|reference|comparison|definition|specification|checklist|tutorial|framework"
primary_audience: "developer|seo-specialist|content-strategist|founder|marketer"
secondary_audiences: ["..."]
reader_modes: ["human", "ai-agent"]
difficulty: "beginner|intermediate|advanced"
ai_platforms: ["chatgpt", "perplexity", "claude", "gemini"]
SEO
description: "120-160 char description."
focus_keyword: "primary keyword"
secondary_keywords: ["k1", "k2"]
AI readiness
canonical_question: "What is X?"
llm_summary: "2-sentence factual summary."
citation_readiness: "reviewed|draft"
Lifecycle
published_at: "YYYY-MM-DD"
updated_at: "YYYY-MM-DD"
last_reviewed_at: "YYYY-MM-DD"
review_cycle_days: 90
version: "1.0"
Relations
series: "series-id"
series_order: 1
related_articles: ["section/slug"]
I18n + authorship
lang: "en"
translations: []
author: "Author Name"
reviewed_by: null
agent.md (tool-use surface)
For pages that document an API, CLI, or other tool, pair the human-readable page with a sibling agent.md file at the same path. agent.md strips marketing prose and gives an agent the deterministic signature it needs.
Minimal example:
# acme.payments.create_invoiceSignature
POST /v1/invoices
Content-Type: application/json
Authorization: Bearer
Input
- customer_id (string, required): Acme customer ID, format cus_.
- amount_cents (integer, required): Positive integer.
- currency (string, required): ISO 4217 code.
Output
- invoice_id (string): Created invoice ID, format inv_.
- status (string): One of open, paid, void.
Errors
- 400 invalid_currency — currency not in ISO 4217.
- 402 insufficient_funds — customer balance below amount_cents.
Idempotency
Pass Idempotency-Key header. Same key returns the original response.
Cursor, Claude Code, and similar coding agents are the primary consumers of agent.md-style surfaces today. Repository-root files (AGENTS.md, CLAUDE.md) follow the same pattern for code-context use.
Content body structure
Answer-first pattern
The first section directly answers the page's core question:
# [Title as Question or Topic][Direct answer in 1-2 sentences. Complete and self-contained.]
[2-3 sentence expanded summary.]
TL;DR
[Snippet-ready 2-3 sentence summary.]
Heading hierarchy
- H1: page title (exactly one)
- H2: major sections
- H3: sub-sections within H2
- H4: rare; avoid deeper nesting
Required structural elements
Every agent-ready page must include:
- One H1 matching the frontmatter title.
- One AI summary blockquote immediately after the H1.
- A TL;DR section (## TL;DR) with a 2-3 sentence snippet.
- At least one extractable definition, table, or step list in the body.
- A FAQ section with 3-8 question-answer pairs phrased as natural questions.
- A canonical URL in frontmatter and in .
- At least one JSON-LD block describing the primary entity.
Pages missing any of (1)-(7) are non-compliant and should not be marked citation_readiness: reviewed.
Extractable patterns
Definition:
[Term] is [complete definition in one sentence].
[Optional second sentence on significance.]Comparison table:
| Dimension | Option A | Option B |
|---|---|---|
| Aspect 1 | Value | Value |Step-by-step:
1. **Step name** — Description.
2. **Step name** — Description.FAQ pair:
### Question in natural language?
[Direct answer. No preamble.]JSON-LD structured data
Every page should include at least one JSON-LD block.
Article:
{
"@context": "https://schema.org",
"@type": "TechArticle",
"headline": "Page Title",
"description": "Page description",
"author": { "@type": "Organization", "name": "Site Name" },
"datePublished": "2025-01-01",
"dateModified": "2026-05-01"
}FAQ:
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "Question text",
"acceptedAnswer": { "@type": "Answer", "text": "Answer text" }
}
]
}Full reference: Structured Data for AI Search.
Layer 3: Attribution
ai.txt
Define how AI agents should attribute your content:
AI Agent Access Policy
User-agent: *
Allow: /
Attribution-required: yes
Source-name: Your Site Name
Source-url: https://yoursite.com
Citation-format: "[Title] — [Source-name] (Source-url/path)"
Full specification: ai.txt Starter Template.
Canonical URLs
Every page must have exactly one canonical URL:
<link rel="canonical" href="https://yoursite.com/section/slug" />Agents should use this URL for attribution regardless of how they discovered the page. Mirror sites, syndicated copies, and AMP variants must point back to the canonical.
Source metadata
Mark organization-level information once with Organization schema (logo, sameAs, contact) so agents can resolve "who owns this content" without parsing every page individually.
Implementation patterns by platform
Different agents weight signals differently. The patterns below summarize observed behavior; verify against each platform's current docs.
ChatGPT (OpenAI) — Search and Atlas browser
- Honors robots.txt for GPTBot, OAI-SearchBot, and ChatGPT-User.
- Strongly favors pages with clear
, meta description, and Article / TechArticle JSON-LD. - Atlas (browser agent) consumes the rendered DOM, so client-side-only content is parseable but slower than server-rendered HTML.
- Cites Source-name from ai.txt when present; otherwise falls back to the domain.
Perplexity — Answer engine and Comet browser
- Crawls aggressively via PerplexityBot; honors per-page noindex and robots.txt.
- Quotes short extractive snippets and links back; benefits most from concise answer-first openers and FAQ sections.
- Comet (browser agent) follows links the user is on; well-structured headings let it summarize without re-fetching.
Claude (Anthropic) — Chat, Computer Use, Claude Code
- Uses ClaudeBot and Claude-Web for retrieval; respects robots.txt.
- Computer Use parses page screenshots plus accessibility trees — semantic HTML (proper headings, labeled inputs, alt text) materially improves agent reliability.
- Claude Code reads repo-local agent.md / AGENTS.md / CLAUDE.md files; mirror your public spec into the repo for code-context use.
Gemini (Google) — Search, AI Overviews, Gemini Live
- Uses Googlebot for search and Google-Extended for generative training and grounding opt-in.
- Heavily weights structured data: Article, FAQPage, HowTo, Product, and BreadcrumbList.
- AI Overviews favor pages already ranking organically, so traditional SEO hygiene compounds with agent readiness.
Cursor and other coding agents
- Read AGENTS.md, agent.md, and project-root README files first.
- Prefer deterministic input/output examples and explicit error taxonomies over prose.
- Pages that document libraries should publish a sibling llms-full.txt containing the full Markdown body for offline indexing.
Validation
Validate compliance with the same tools agents (or their pipelines) use:
- Schema.org Validator
- Google Rich Results Test
- llms.txt validator reference implementations
- curl -A "GPTBot" -I https://yoursite.com/page to confirm crawler access
- curl -A "ClaudeBot" -I and curl -A "PerplexityBot" -I for the other major bots
- A dry-run "ask the chatbot" check: paste the URL into ChatGPT, Perplexity, and Claude and verify the summary matches the canonical content
Compliance checklist
Discovery
- [ ] /llms.txt exists and is current
- [ ] /.well-known/agents.json published if the site exposes tools/APIs
- [ ] sitemap.xml includes all content pages with lastmod
- [ ] robots.txt allows the major agent crawlers listed above
- [ ] /ai.txt defines access policy
Parsing
- [ ] Every page has full frontmatter (~30-field schema)
- [ ] Every page has answer-first opening
- [ ] Every page has a single AI summary blockquote and a TL;DR section
- [ ] Every page has at least one JSON-LD block describing the primary entity
- [ ] Heading hierarchy is semantic (one H1, then H2 → H3)
- [ ] Tables, lists, and code blocks use proper Markdown / HTML markup
- [ ] Tool / API pages have a sibling agent.md
Attribution
- [ ] is set on every page
- [ ] ai.txt specifies attribution requirements and citation format
- [ ] Author / Organization metadata is included
- [ ] published_at and updated_at are accurate
Common mistakes
- Mixing legacy and current frontmatter keys. Drop date_published, date_updated, ai_summary, and schema_type; use published_at, updated_at, llm_summary, and concept_type consistently.
- Two or more "AI summary" blocks per page. Dilutes which sentence agents extract; keep exactly one immediately after the H1.
- Leaving JSON-LD as the only structured signal. Agents cross-check JSON-LD against the rendered HTML; mismatches cause structured data to be ignored.
- Disallowing all bots in robots.txt "to be safe". This blocks citations and grounding; allow the major agent UAs explicitly and use per-path rules for sensitive sections.
- Treating llms.txt as a marketing brochure. It is an index; keep it terse and link-heavy.
FAQ
Is this specification an official standard?
No. It is a practical specification based on observable AI system behavior and emerging community conventions. JSON-LD, semantic HTML, and sitemaps are well-established standards; llms.txt, ai.txt, and agents.json are proposals with growing adoption.
Do all AI agents follow these conventions?
Not uniformly. JSON-LD and structured HTML are recognized by every major AI system. llms.txt and ai.txt are emerging — major models do not yet officially commit to consuming them, but they are low-cost to publish and forward-compatible.
How often should I update my compliance?
Review quarterly. Crawler user agents, structured-data types, and emerging conventions shift fast. Core HTML and schema are stable, but discovery and attribution mechanisms continue to evolve.
Is the frontmatter schema required for HTML-only pages?
The frontmatter schema is the canonical metadata source — it can be expressed equivalently in HTML tags or JSON-LD. The exact transport matters less than completeness and accuracy. Static-site generators like Next.js, Astro, and Hugo make YAML frontmatter the easiest path; for hand-authored HTML, mirror the same fields into JSON-LD Article and Organization blocks.
What is the bare-minimum subset?
If you can only do four things: (1) one canonical URL per page, (2) JSON-LD for the primary entity, (3) llms.txt listing your top pages, (4) robots.txt allowing GPTBot, ClaudeBot, and PerplexityBot. This subset captures most of the citation upside.
How does this differ from traditional SEO?
Traditional SEO optimizes for ranking in a SERP that a human reads. Agent content optimization additionally optimizes for extraction (a non-human consumer copying a sentence) and attribution (that consumer linking back). The two overlap heavily — a well-structured SEO page is already most of the way to agent-ready — but agent readiness adds explicit machine-readable layers (frontmatter, JSON-LD, llms.txt, ai.txt).
Do I need a separate agent.md for every page?
No. Pair agent.md only with pages that document an actionable surface: APIs, CLIs, SDKs, configuration files. Pure narrative or conceptual pages do not need one — the standard frontmatter and JSON-LD are sufficient.
Will following this spec guarantee citations?
No. It maximizes eligibility — agents still rank by topical authority, freshness, and source reputation. Treat the spec as removing avoidable failure modes, not as a ranking lever.
Related Articles
AI Agent Optimization: Technical Guide
Technical implementation guide for optimizing websites for AI agent discovery, evaluation, and interaction. Covers discovery, understanding, and action layers.
ai.txt Starter Template: Copy-Ready AI Access Policy File
A copy-ready ai.txt starter template for declaring AI crawler access policies, attribution requirements, and content licensing terms.
How to Create llms.txt: Step-by-Step Tutorial for AI Search
Step-by-step tutorial for creating, deploying, and validating an llms.txt file so AI systems and LLMs can discover your site's most important content.