AI Agent Content Specification

This specification defines how to structure web content for autonomous AI agents — crawlers, chatbots, research assistants, and other AI systems that discover, parse, and synthesize information from the web.

🤖 AI SUMMARY

The AI Agent Content Specification defines three layers: (1) discovery — how AI agents find your content via llms.txt, sitemaps, and robots.txt, (2) parsing — how agents extract information through frontmatter metadata, heading structure, and structured data, and (3) attribution — how agents should cite and link back to your content via ai.txt policies and schema markup.

Specification Overview

Layer	Purpose	Standards
Discovery	AI agents find your content	llms.txt, sitemap.xml, robots.txt
Parsing	AI agents understand your content	Frontmatter, HTML structure, JSON-LD
Attribution	AI agents cite your content	ai.txt, source metadata, canonical URLs

Layer 1: Discovery

llms.txt

Every site should provide a /llms.txt file — a machine-readable index that tells AI agents what your site contains and how it's organized.

Required elements:

Site name (H1 heading)
Site description (blockquote)
Content index (links with descriptions)
Section organization (H2 headings)

Full specification: llms.txt Reference

Sitemap for AI

Standard XML sitemaps help AI crawlers discover content. Enhance with:

<lastmod> dates for freshness signals
<changefreq> to indicate update patterns
<priority> to highlight key pages

robots.txt for AI Crawlers

Explicitly allow AI crawler access:

User-agent: GPTBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Applebot-Extended
Allow: /

Layer 2: Parsing

Frontmatter Metadata Schema

Every content page should include structured frontmatter metadata:

---
# Identity
title: "Page Title"
description: "One-paragraph description"
slug: "url-slug"
section: "section-name"

# Knowledge
canonical_concept_id: "unique-concept-identifier"
content_type: "definition|guide|comparison|reference|tutorial|framework|checklist"
difficulty: "beginner|intermediate|advanced"
knowledge_graph_domains: ["domain1", "domain2"]

# Taxonomy
tags: ["tag1", "tag2", "tag3"]
related_articles:
  - section/slug
  - section/slug

# AI Readiness
ai_summary: "2-3 sentence summary optimized for AI extraction"
schema_type: "TechArticle|FAQPage|HowTo"

# Lifecycle
author: "Author Name"
date_published: "YYYY-MM-DD"
date_updated: "YYYY-MM-DD"
---

Content Body Structure

Answer-First Pattern

The first section must directly answer the page's core question:

# [Title as Question or Topic]

[Direct answer in 1-2 sentences. Complete and self-contained.]

> **🤖 AI SUMMARY**
> [2-3 sentence expanded summary for AI extraction]

## [First Major Section]
...

Heading Hierarchy

H1: Page title (exactly one)
H2: Major sections
H3: Sub-sections within H2
H4: Rarely used; avoid deeper nesting

Extractable Content Patterns

Definition block:

[Term] is [complete definition in one sentence].
[Optional second sentence expanding on significance].

Comparison table:

| Dimension | Option A | Option B |
|-----------|----------|----------|
| Aspect 1  | Value    | Value    |
| Aspect 2  | Value    | Value    |

Step-by-step:

1. **Step name** — Description of what to do
2. **Step name** — Description of what to do

FAQ pair:

### [Question in natural language]?

[Direct answer. No preamble.]

JSON-LD Structured Data

Every page should include at least one JSON-LD block:

For articles:

{
  "@context": "https://schema.org",
  "@type": "TechArticle",
  "headline": "Page Title",
  "description": "Page description",
  "author": {"@type": "Organization", "name": "Site Name"},
  "datePublished": "2025-01-01",
  "dateModified": "2025-04-01"
}

For FAQ content:

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "Question text",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Answer text"
      }
    }
  ]
}

Full reference: Structured Data for AI Search

Layer 3: Attribution

ai.txt

Define how AI agents should attribute your content:

# AI Agent Access Policy
User-agent: *
Allow: /
Attribution-required: yes
Source-name: Your Site Name
Source-url: https://yoursite.com

Full specification: ai.txt Reference

Canonical URLs

Every page must have exactly one canonical URL:

<link rel="canonical" href="https://yoursite.com/section/slug" />

AI agents should use this URL for attribution regardless of how they discovered the page.

Citation Format Preference

Include preferred citation format in ai.txt:

Citation-format: "[Title] - [Source-name] (Source-url/path)"

Compliance Checklist

Use this checklist to verify AI agent readiness:

Discovery

[ ] /llms.txt exists and is current
[ ] sitemap.xml includes all content pages
[ ] robots.txt allows AI crawlers
[ ] /ai.txt defines access policy

Parsing

[ ] Every page has frontmatter with required fields
[ ] Every page has answer-first opening
[ ] Every page has AI summary block
[ ] Every page has JSON-LD structured data
[ ] Heading hierarchy is semantic (H1 → H2 → H3)
[ ] Tables use proper HTML markup

Attribution

[ ] Canonical URLs are set on all pages
[ ] ai.txt specifies attribution requirements
[ ] Author metadata is included
[ ] Publication and update dates are accurate

FAQ

Is this specification an official standard?

No. This is a practical specification based on observable AI system behavior and emerging community standards. As AI content consumption matures, formal standards may emerge. This spec is designed to work with current AI systems.

Do all AI agents follow these conventions?

Not uniformly. llms.txt and ai.txt are emerging conventions with growing adoption. JSON-LD and structured HTML are well-established web standards that all major AI systems recognize. Frontmatter metadata is primarily used by content management systems and static site generators.

How often should I update my compliance?

Review quarterly. AI agent capabilities evolve rapidly, and new conventions emerge regularly. The core HTML and schema standards are stable, but discovery and attribution mechanisms may change.