AI Agent Content Specification
This specification defines how to structure web content for autonomous AI agents β crawlers, chatbots, research assistants, and other AI systems that discover, parse, and synthesize information from the web.
π€ AI SUMMARY
The AI Agent Content Specification defines three layers: (1) discovery β how AI agents find your content via llms.txt, sitemaps, and robots.txt, (2) parsing β how agents extract information through frontmatter metadata, heading structure, and structured data, and (3) attribution β how agents should cite and link back to your content via ai.txt policies and schema markup.
Specification Overview
| Layer | Purpose | Standards |
|---|---|---|
| Discovery | AI agents find your content | llms.txt, sitemap.xml, robots.txt |
| Parsing | AI agents understand your content | Frontmatter, HTML structure, JSON-LD |
| Attribution | AI agents cite your content | ai.txt, source metadata, canonical URLs |
Layer 1: Discovery
llms.txt
Every site should provide a /llms.txt file β a machine-readable index that tells AI agents what your site contains and how it's organized.
Required elements:
- Site name (H1 heading)
- Site description (blockquote)
- Content index (links with descriptions)
- Section organization (H2 headings)
Full specification: llms.txt Reference
Sitemap for AI
Standard XML sitemaps help AI crawlers discover content. Enhance with:
<lastmod>dates for freshness signals<changefreq>to indicate update patterns<priority>to highlight key pages
robots.txt for AI Crawlers
Explicitly allow AI crawler access:
User-agent: GPTBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Applebot-Extended
Allow: /Layer 2: Parsing
Frontmatter Metadata Schema
Every content page should include structured frontmatter metadata:
---
# Identity
title: "Page Title"
description: "One-paragraph description"
slug: "url-slug"
section: "section-name"
# Knowledge
canonical_concept_id: "unique-concept-identifier"
content_type: "definition|guide|comparison|reference|tutorial|framework|checklist"
difficulty: "beginner|intermediate|advanced"
knowledge_graph_domains: ["domain1", "domain2"]
# Taxonomy
tags: ["tag1", "tag2", "tag3"]
related_articles:
- section/slug
- section/slug
# AI Readiness
ai_summary: "2-3 sentence summary optimized for AI extraction"
schema_type: "TechArticle|FAQPage|HowTo"
# Lifecycle
author: "Author Name"
date_published: "YYYY-MM-DD"
date_updated: "YYYY-MM-DD"
---Content Body Structure
Answer-First Pattern
The first section must directly answer the page's core question:
# [Title as Question or Topic]
[Direct answer in 1-2 sentences. Complete and self-contained.]
> **π€ AI SUMMARY**
> [2-3 sentence expanded summary for AI extraction]
## [First Major Section]
...Heading Hierarchy
- H1: Page title (exactly one)
- H2: Major sections
- H3: Sub-sections within H2
- H4: Rarely used; avoid deeper nesting
Extractable Content Patterns
Definition block:
[Term] is [complete definition in one sentence].
[Optional second sentence expanding on significance].Comparison table:
| Dimension | Option A | Option B |
|-----------|----------|----------|
| Aspect 1 | Value | Value |
| Aspect 2 | Value | Value |Step-by-step:
1. **Step name** β Description of what to do
2. **Step name** β Description of what to doFAQ pair:
### [Question in natural language]?
[Direct answer. No preamble.]JSON-LD Structured Data
Every page should include at least one JSON-LD block:
For articles:
{
"@context": "https://schema.org",
"@type": "TechArticle",
"headline": "Page Title",
"description": "Page description",
"author": {"@type": "Organization", "name": "Site Name"},
"datePublished": "2025-01-01",
"dateModified": "2025-04-01"
}For FAQ content:
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "Question text",
"acceptedAnswer": {
"@type": "Answer",
"text": "Answer text"
}
}
]
}Full reference: Structured Data for AI Search
Layer 3: Attribution
ai.txt
Define how AI agents should attribute your content:
# AI Agent Access Policy
User-agent: *
Allow: /
Attribution-required: yes
Source-name: Your Site Name
Source-url: https://yoursite.comFull specification: ai.txt Reference
Canonical URLs
Every page must have exactly one canonical URL:
<link rel="canonical" href="https://yoursite.com/section/slug" />AI agents should use this URL for attribution regardless of how they discovered the page.
Citation Format Preference
Include preferred citation format in ai.txt:
Citation-format: "[Title] - [Source-name] (Source-url/path)"Compliance Checklist
Use this checklist to verify AI agent readiness:
Discovery
- [ ]
/llms.txtexists and is current - [ ]
sitemap.xmlincludes all content pages - [ ]
robots.txtallows AI crawlers - [ ]
/ai.txtdefines access policy
Parsing
- [ ] Every page has frontmatter with required fields
- [ ] Every page has answer-first opening
- [ ] Every page has AI summary block
- [ ] Every page has JSON-LD structured data
- [ ] Heading hierarchy is semantic (H1 β H2 β H3)
- [ ] Tables use proper HTML markup
Attribution
- [ ] Canonical URLs are set on all pages
- [ ] ai.txt specifies attribution requirements
- [ ] Author metadata is included
- [ ] Publication and update dates are accurate
FAQ
Is this specification an official standard?
No. This is a practical specification based on observable AI system behavior and emerging community standards. As AI content consumption matures, formal standards may emerge. This spec is designed to work with current AI systems.
Do all AI agents follow these conventions?
Not uniformly. llms.txt and ai.txt are emerging conventions with growing adoption. JSON-LD and structured HTML are well-established web standards that all major AI systems recognize. Frontmatter metadata is primarily used by content management systems and static site generators.
How often should I update my compliance?
Review quarterly. AI agent capabilities evolve rapidly, and new conventions emerge regularly. The core HTML and schema standards are stable, but discovery and attribution mechanisms may change.