.well-known/ai-summary Endpoint Spec for AI Discovery

The /.well-known/ai-summary endpoint is a single deterministic URL on a domain that returns a JSON document describing the site's canonical AI summary, content inventory, citation policy, and crawl preferences. It builds on RFC 8615 well-known URIs, complements llms.txt, and uses Schema.org Dataset vocabulary so AI crawlers can discover and cite content efficiently.

TL;DR

AI crawlers and answer engines have no canonical way to discover what a site wants them to know. robots.txt is too coarse, llms.txt is markdown-only and unstable, sitemaps are URL-only, and Schema.org markup is scattered across pages. This spec proposes a single endpoint at /.well-known/ai-summary that returns a JSON document with: a canonical site summary, the list of citable knowledge entries with stable URLs, a citation policy (attribution required, license, freshness), and crawl preferences for AI agents. The endpoint reuses RFC 8615 well-known URI conventions, links out to existing llms.txt and sitemap.xml, and emits valid Schema.org Dataset / DataCatalog so it is consumable by both bespoke AI crawlers and conventional structured-data extractors.

Definition

/.well-known/ai-summary is a proposed well-known URI suffix (per RFC 8615) that, when appended to an origin (https://example.com/.well-known/ai-summary), returns a JSON document describing the origin's AI-discovery surface.

The endpoint is discovery-oriented, not delivery-oriented. It does not return the full content of the site; it returns a manifest that points to where citable content lives, what it is about, and under what terms it can be cited. The full content remains at its existing URLs and continues to be served with whatever schema, llms.txt, and sitemap layers the site already publishes. Think of /.well-known/ai-summary as the index for AI crawlers in the same way sitemap.xml is the index for traditional search crawlers.

The spec is intentionally narrow. It does not replace llms.txt, Schema.org markup, sitemaps, or robots.txt. It complements them by giving AI agents a single deterministic location to start from.

Why this matters

AI crawlers behave differently from traditional search crawlers. They tend to follow a small number of seeded URLs, retrieve a few pages per session, and rely heavily on summary signals to decide what to ingest more deeply. Three problems result.

Discovery cost. Without a manifest, AI crawlers either over-fetch (wasting bandwidth on both sides) or under-fetch (missing the canonical content the publisher would want cited). A single endpoint that lists the citable knowledge entries reduces both errors.

Citation accuracy. When AI assistants cite a site, they often pick whichever URL was retrieved during the session, not the canonical reference URL. Publishing a canonical summary and a stable knowledge inventory at a deterministic location lets AI agents prefer canonical URLs over arbitrary intermediate ones.

Policy expression. Publishers increasingly want to express AI-specific policy: attribution requirements, content license, freshness expectations, and per-agent crawl preferences. There is no good place to express this today. A .well-known/ai-summary endpoint is the natural home.

Specification

URI and method

The endpoint MUST be available at /.well-known/ai-summary on the origin and MUST respond to GET with 200 OK. It SHOULD respond to HEAD with the same headers as GET minus the body. The endpoint MUST NOT require authentication.

Content type

The response MUST have Content-Type: application/json; charset=utf-8. Implementations MAY also offer application/ld+json as an alternative representation via content negotiation.

Response document

The response body is a JSON object with the following top-level keys.

@context (REQUIRED) — MUST be "https://schema.org" so the document validates as Schema.org JSON-LD.
@type (REQUIRED) — MUST be "DataCatalog".
name (REQUIRED) — the site's canonical name.
url (REQUIRED) — the canonical site URL.
description (REQUIRED) — a 1-3 sentence canonical summary that AI agents may quote verbatim.
dateModified (REQUIRED) — ISO-8601 timestamp of the manifest's last update.
publisher (REQUIRED) — a Schema.org Organization object with at minimum @type, name, and url.
dataset (REQUIRED) — an array of Dataset objects, one per citable knowledge entry. Each Dataset SHOULD include @type, name, url (the canonical page URL), description, dateModified, keywords, and license.
aiCrawlPolicy (REQUIRED) — an object describing crawl preferences (see below).
citationPolicy (REQUIRED) — an object describing citation requirements (see below).
links (RECOMMENDED) — an object linking to companion files: llmsTxt, sitemap, robotsTxt, humansTxt.

aiCrawlPolicy object

The aiCrawlPolicy object SHOULD include:

defaultPolicy — one of allow, disallow, summarize-only, attribution-required.
agents — a map from user-agent token to a per-agent policy object (policy, rateLimit, paths).
paths — an array of path-level overrides, each with path, policy, and optional comment.

The spec does not enforce policy; it expresses it. AI agents are expected to honor aiCrawlPolicy in the same way they honor robots.txt, with the understanding that compliance is voluntary.

citationPolicy object

The citationPolicy object SHOULD include:

attribution — one of required, recommended, not-required.
license — a SPDX identifier or a URL to the site's license.
canonicalUrlRequired — boolean; if true, citing AI agents should reference dataset[].url rather than retrieved intermediate URLs.
quoteLength — maximum quoted character count per citation, or "unlimited".
freshness — expected refresh interval in ISO-8601 duration (e.g., P7D).

Caching

The endpoint SHOULD be cacheable via standard HTTP caching headers (Cache-Control, ETag). A Cache-Control: max-age=3600 is a reasonable default. Manifests that change frequently SHOULD reduce max-age accordingly.

Validation

A compliant manifest MUST validate as Schema.org JSON-LD using the Schema.org Validator and MUST be valid JSON per RFC 8259. Implementations SHOULD also surface a version field at the top level for forward compatibility (default: "1.0").

Example response (abridged)

{
  "@context": "https://schema.org",
  "@type": "DataCatalog",
  "name": "Example Docs",
  "url": "https://example.com/",
  "description": "Example Docs is the official documentation site for the Example platform, covering installation, configuration, and reference for Example versions 1.x through 4.x.",
  "dateModified": "2026-05-04T00:00:00Z",
  "publisher": {
    "@type": "Organization",
    "name": "Example, Inc.",
    "url": "https://example.com/"
  },
  "dataset": [
    {
      "@type": "Dataset",
      "name": "Installation Guide",
      "url": "https://example.com/docs/install",
      "description": "Canonical installation instructions for Example on macOS, Linux, and Windows.",
      "dateModified": "2026-05-01T00:00:00Z",
      "keywords": ["install", "setup", "getting-started"],
      "license": "CC-BY-4.0"
    }
  ],
  "aiCrawlPolicy": {
    "defaultPolicy": "attribution-required",
    "agents": {
      "GPTBot": { "policy": "allow", "rateLimit": "60/min" }
    },
    "paths": [
      { "path": "/private/*", "policy": "disallow" }
    ]
  },
  "citationPolicy": {
    "attribution": "required",
    "license": "CC-BY-4.0",
    "canonicalUrlRequired": true,
    "quoteLength": 500,
    "freshness": "P7D"
  },
  "links": {
    "llmsTxt": "https://example.com/llms.txt",
    "sitemap": "https://example.com/sitemap.xml",
    "robotsTxt": "https://example.com/robots.txt"
  },
  "version": "1.0"
}

Practical application

An implementation rollout for an existing site:

Generate the manifest from the same source of truth that produces the sitemap. Each indexable canonical page becomes one Dataset entry.
Author the canonical summary (description at the top level) by hand. This is the one citation-quotable sentence that defines the site; it is worth the time to get it right.
Wire crawl and citation policy to whatever your existing legal and SEO teams have agreed to. Use attribution-required as a sensible default for documentation and editorial sites.
Serve the endpoint with normal HTTP caching. A small worker, edge function, or static file refreshed on deploy is sufficient; the endpoint does not need to be dynamic.
Cross-link from llms.txt so existing AI agents that hit llms.txt first can find the richer manifest.
Validate with the Schema.org Validator on every deploy. A broken manifest is worse than no manifest.

Common mistakes

Treating it as a replacement for llms.txt. It complements llms.txt; both should exist and link to each other.

Listing every URL on the site. The dataset array is for citable knowledge entries, not for the full sitemap. Use links.sitemap for the full URL inventory.

Skipping the canonical summary. The top-level description is the highest-leverage field; AI assistants will quote it directly.

Not validating as Schema.org JSON-LD. A manifest that does not validate is silently ignored by structured-data extractors.

Stale dateModified. AI agents use dateModified to decide whether to re-fetch. A manifest that lies about freshness gets cached longer than intended.

FAQ

Q: How is this different from llms.txt?

llms.txt is a markdown file that summarizes the site for LLMs. /.well-known/ai-summary is a JSON-LD endpoint that adds machine-validatable structure, per-dataset entries, and crawl/citation policy. They complement each other: llms.txt is the human-readable summary, ai-summary is the machine-readable manifest.

Q: Why use the .well-known prefix?

RFC 8615 defines .well-known as the standard location for site-wide metadata served at deterministic paths. Using it is the conventional way to introduce new endpoints without colliding with site-specific URL space.

Q: Do AI crawlers actually fetch this endpoint today?

Adoption is partial and growing. Early adopters and AI agents that follow ecosystem conventions (llms.txt, well-known URIs) will fetch it; older crawlers will not. The cost of publishing the endpoint is small enough that it is worth doing in advance of universal support.

Q: Does it replace robots.txt?

No. robots.txt remains the authoritative directive for traditional search crawlers and is referenced by aiCrawlPolicy. /.well-known/ai-summary adds AI-specific signals on top.

Q: What if my site changes too often to keep dataset[] current?

Keep the top-level summary, policy, and links current; let the dataset array be regenerated from the same build pipeline that produces your sitemap. Most sites can do this on every deploy with no manual work.

Q: Should I sign the manifest?

For most sites, no. For high-stakes deployments (financial, medical, government) where AI agents must verify provenance, an optional signature field with a JWS over the manifest body is a reasonable extension. The base spec leaves signing out of scope.

Q: Does this work for SPAs and JavaScript-rendered sites?

Yes, because the endpoint is server-rendered JSON, independent of the rest of the site's rendering strategy. SPAs especially benefit because their sitemap and per-page schema may be incomplete; the manifest gives AI agents a reliable index regardless.