Auth-Gated Content Disclosure Specification for AI Crawlers

Auth-gated content disclosure is a layered specification using schema.org isAccessibleForFree+hasPart, dedicated summary endpoints, and llms.txt gated annotations to let AI crawlers cite paywalled or login-protected content without exposing the full body. The pattern enables citation visibility while preserving subscription value.

TL;DR

Auth-gated content (paywalls, login-walls, enterprise SSO) is invisible to AI crawlers by default, which means it cannot be cited. The standard disclosure stack pairs three layers: schema.org CreativeWork markup with isAccessibleForFree: false and a hasPart block describing free vs gated sections; a public summary endpoint (for example /api/summary/) that returns a short factual abstract; and a llms.txt entry that points crawlers at the summary, not the gated body. Combine with audit logging on AI bot user-agents and partial-content fingerprints to detect training-set leakage. For broader context, see the Technical hub and llms.txt Reference.

Why a disclosure specification

Publishers, B2B SaaS docs, and enterprise knowledge bases all face the same dilemma: make content available enough for AI engines to cite it, but not so available that subscription value evaporates. Three trends make a formal spec necessary now:

AI Overviews and ChatGPT/Perplexity citations increasingly drive qualified clicks for publishers that participate, and outright invisibility for those that do not.
AI crawlers do not all log in. They will follow public schema, public summary URLs, and public llms.txt entries; they will not bypass auth.
Misimplementation creates real harm. A naive paywall hides the page from indexes and Overviews; an over-eager structured-data dump leaks the full article to anyone reading the JSON-LD (SEO For Google News, Best Practices for Paywalls).

Google has published structured-data guidance for paywalled content (Google Search Central, Subscription and Paywalled Content), but no comparable reference exists for the broader AI-crawler ecosystem (GPTBot, ClaudeBot, PerplexityBot, Google-Extended). This specification fills that gap.

Three access models

Disclosure strategy must match access model. The three patterns publishers actually run:

Model	Access requirement	Typical use case	Disclosure strategy
Paywall	Subscription or one-off payment	News publishers, premium analysis, courses	isAccessibleForFree: false + hasPart + summary endpoint + lead-in/metering optional
Login-wall	Free account required	Community sites, freemium SaaS docs, gated lead-gen	isAccessibleForFree: false + summary endpoint; llms.txt may safely include free-tier docs
Enterprise SSO	Workforce identity (SAML, OIDC, SCIM)	Internal KBs, customer-only docs, partner portals	No public disclosure of body; only metadata + tenant-scoped summary endpoint behind allowlist

The enterprise SSO case is the most often misimplemented. Teams either expose full content via robots.txt-allowed staging URLs (training-set leak risk) or block AI crawlers entirely (zero citation visibility). The right answer is metadata-only disclosure plus tenant-scoped summaries.

Pattern 1: schema.org isAccessibleForFree and hasPart

Schema.org defines isAccessibleForFree as a boolean indicating whether a CreativeWork (or its hasPart subset) is freely available. Google's paywall guidance specifies a precise pattern: mark the parent CreativeWork with isAccessibleForFree: false, then enumerate hasPart blocks for free and gated sections so crawlers can extract the free portion without the gated portion.

Minimum compliant JSON-LD:

{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "@id": "https://example.com/article/q1-earnings-2026",
  "headline": "Q1 Earnings Analysis",
  "datePublished": "2026-04-30T09:00:00-04:00",
  "isAccessibleForFree": false,
  "hasPart": [
    {
      "@type": "WebPageElement",
      "isAccessibleForFree": true,
      "cssSelector": ".article-summary"
    },
    {
      "@type": "WebPageElement",
      "isAccessibleForFree": false,
      "cssSelector": ".paywall"
    }
  ]
}

Critical correctness rules:

cssSelector (or xpath) must match exactly one parent element wrapping the gated section.
articleBody must NOT contain the full body when isAccessibleForFree: false is set on the parent. Doing so violates Google's anti-cloaking guidance and creates a structured-data leak (SE Roundtable analysis).
The free section must be substantive (>50 words) and self-contained, not a teaser stub.
Validate with the Schema.org validator and the Rich Results Test before deploying.

Pattern 2: Summary endpoint design

A summary endpoint is a public URL that returns a short factual abstract of the gated content. AI crawlers can fetch and cite it; the gated body is never exposed. This pattern is what makes citation possible for purely-gated content (no lead-in, no metering).

Endpoint contract

URL shape. Predictable per article: /api/summary/ or /
//summary.
Response. JSON or HTML containing: title, canonical URL of the full article, publication date, author, 60-120 word factual summary, key entities, optional bullet list of facts.
Status. 200 OK for available; 404 if the article does not exist; 410 Gone if removed. Never 401/403 — the summary is intentionally public.
Caching. Send Cache-Control: public, max-age=3600 plus a strong ETag derived from summary content. See HTTP Cache Headers for AI Crawlers for the full recipe.
Linking. The full-article HTML should reference the summary endpoint via .

Summary content rules

The summary is what gets cited. Treat it as the canonical AI-facing answer:

Lead with a one-sentence factual claim (extractable as a featured-snippet answer).
Add 3-5 sentences of supporting detail (entities, numbers, dates).
Include a clear call-out of what the full article adds ("the full analysis details methodology, vendor selection, and Q3 forecasts").
Never include the conclusions or unique numbers that drive subscription value. The summary establishes the existence of the analysis; the article delivers the analysis.
Keep length 60-180 words. Shorter is uncitable; longer dilutes.

Sample summary response

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Q1 2026 Earnings Analysis: Tech Sector Margins",
  "abstract": "Across 18 large-cap technology companies, Q1 2026 operating margins averaged 23.4%, down 1.8 percentage points from Q4 2025. Cloud infrastructure providers retained margin while consumer hardware compressed. Full analysis (subscriber) covers per-company drivers, FX exposure, and forward guidance.",
  "url": "https://example.com/article/q1-earnings-2026",
  "datePublished": "2026-04-30T09:00:00-04:00",
  "author": {"@type": "Person", "name": "Jane Analyst"},
  "isAccessibleForFree": true,
  "isPartOf": {
    "@type": "Article",
    "@id": "https://example.com/article/q1-earnings-2026",
    "isAccessibleForFree": false
  }
}

Pattern 3: llms.txt with gated annotations

llms.txt is a Markdown-format file at site root that lists URLs LLMs should consult, proposed by Jeremy Howard in September 2024. The base spec does not formalize gated content; the community convention below is increasingly common.

# Example Inc.

Premium business analysis. Subscriber-only articles disclose summaries publicly.

Free reference

Methodology: How we score companies
Glossary: Terms used in analysis

Gated (summary-only)

Q1 2026 Earnings Summary: Subscriber article — public summary endpoint
Cloud Vendor Margin 2026 Summary: Subscriber article — public summary endpoint

Key conventions:

The Gated (summary-only) section explicitly links the summary endpoint, never the gated full URL.
Each entry's description names the gating model ("Subscriber", "Login required", "Customer-only").
A separate llms-full.txt MAY include the full free-tier text but MUST NOT include gated bodies.
Validate that robots.txt allows AI crawlers to read both llms.txt and the linked summary URLs.

Google flexible sampling vs full disclosure

Google's flexible sampling guidance defines two models that interact with disclosure:

Metering. Allow N free articles per user per period; show paywall after the limit.
Lead-in. Show the first ~10-30% of every article publicly; gate the rest.

Disclosure strategy by sampling model:

Sampling model	Schema disclosure	Summary endpoint	llms.txt
Pure paywall (no sampling)	Required — isAccessibleForFree: false + hasPart	Strongly recommended; primary citation surface	List summary endpoint
Lead-in	Required — mark gated section in hasPart	Optional; lead-in often sufficient	List article URL
Metering	Required on the parent CreativeWork	Optional; AI bots are not signed-in users so they always hit the meter limit	List article URL
Hybrid	Required, with both patterns marked	Recommended for highest-value articles	List both

Google's own warning: even minor sampling reductions degrade ranking when content access is restricted. Lower bounds: do not drop the lead-in below a substantive paragraph (~150-300 words).

Partial-content fingerprints

Once a summary endpoint is public, you need a way to detect if the full body has leaked into AI training corpora or live retrieval responses. The pattern: embed semantic fingerprints in the gated body and probe AI engines for them.

Embed 3-5 stable, distinctive phrases per article in the gated body. Examples: branded turn-of-phrase, named framework, specific stat with unusual unit.
Maintain a registry mapping article ID → fingerprints.
Run weekly probes: ask each major AI engine for the article's fingerprints ("Is the phrase '...' associated with a 2026 earnings analysis?").
If the full body's fingerprints appear in answers without authentication, you have a leak. Investigate: is isAccessibleForFree: false actually serving the gated body in articleBody? Is robots.txt allowing GPTBot/ClaudeBot to read the full HTML?

This is a detective control, not a preventive one. Combine with rigorous server-side enforcement.

Audit logging requirements

Disclosure compliance requires being able to prove what AI crawlers saw. Minimum log fields per AI bot request:

Timestamp (ISO 8601)
User-agent (full string)
Source IP and reverse-DNS / verified-bot status
Requested URL
Response status code
Response bytes returned
Whether the response included gated body content (boolean derived from a content gate at egress)

Retain logs ≥ 90 days for forensics. For high-value content, ship logs to immutable storage (object lock, WORM) so subsequent investigation cannot be tampered with.

Verify-bot identity is critical. Major AI vendors publish IP allowlists (OpenAI, Anthropic, Perplexity). Treat any User-Agent string as untrusted unless the source IP appears in the published list.

Dimension	Paywall	Login-wall	Enterprise SSO
Who can read body	Paying subscribers	Any user with free account	Workforce identity holders
Public schema disclosure	Yes (isAccessibleForFree: false + hasPart)	Yes	Metadata only (no body in JSON-LD)
Public summary endpoint	Strongly recommended	Recommended	Tenant-scoped only; allowlist by IP
llms.txt entry	Yes — link to summary	Yes — link to article or summary	Out of scope; private docs do not appear
Robots.txt for AI	Allow on summary; allow on free hasPart; disallow on gated paths if separable	Allow	Disallow on internal hosts; consider VPC-only DNS
Audit logging	90 days minimum	90 days minimum	365 days; immutable storage
Citation visibility	High (with summary endpoint)	High	Low by design
Training-set leak risk	Medium (structured-data leakage)	Medium (over-eager llms.txt)	Low if config correct; high if misconfigured

Common failure modes

Full body in articleBody while isAccessibleForFree: false. Anyone reading the JSON-LD has the gated content. Fix: never serialize gated body into structured data.
Summary endpoint protected by auth. Defeats the entire pattern. Summary is intentionally public; only the full article is gated.
llms.txt lists gated full URL instead of summary URL. AI crawler hits a 401/403, freshness signal degrades, no citation. Fix: list summary endpoint exclusively in the gated section.
No Cache-Control on summary endpoint. Each AI bot revisit is a full DB query. Fix: short max-age plus strong ETag.
Identical summary across articles. AI engines deduplicate; only one canonical citation emerges. Fix: hand-write each summary; do not template.
Robots.txt blocks AI crawlers globally. Removes citation visibility entirely — the inverse of disclosure. Fix: allow on summary paths and free hasPart paths; disallow only specifically gated paths.
No audit log of AI bot visits. Cannot prove what was disclosed when. Fix: log per request, retain 90+ days.
No fingerprinting registry. Cannot detect leakage. Fix: embed and probe weekly.

Implementation checklist

Inventory gated content by access model (paywall / login-wall / enterprise SSO).
For each model, decide on lead-in, metering, or pure-gate strategy.
Add isAccessibleForFree + hasPart JSON-LD to all gated articles.
Build a public summary endpoint at a predictable path; populate factual 60-180 word summaries.
Add from full article to summary endpoint.
Update llms.txt with gated annotations and summary URLs.
Audit robots.txt to confirm AI crawlers can read summaries and free sections.
Configure caching headers per the HTTP cache header reference.
Build the fingerprint registry and the weekly probe job.
Stand up audit logging with 90+ day retention.
Validate end-to-end with a freshness probe after each article ships.
Re-audit quarterly; AI crawler behavior and Google guidance both evolve.

Decision flow

flowchart TD
  A["Gated article ready to publish"] --> B{"Access model?"}
  B -->|Paywall| C["Schema: isAccessibleForFree=false + hasPart"]
  B -->|Login-wall| D["Schema: isAccessibleForFree=false"]
  B -->|Enterprise SSO| E["Metadata only; no body in JSON-LD"]
  C --> F["Build public summary endpoint"]
  D --> F
  E --> G["Tenant-scoped summary; allowlist IP"]
  F --> H["Add llms.txt entry pointing at summary"]
  G --> I["Audit logging; immutable storage"]
  H --> I
  I --> J["Embed fingerprints; weekly probe"]

FAQ

Q: Will AI crawlers actually fetch summary endpoints?

A: Yes if they are publicly accessible, returned with 200 OK, served quickly, and linked from llms.txt and the parent article via rel=alternate. Major declared AI bots (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) follow standard HTTP discovery patterns; an unlinked summary endpoint at an obscure URL will not be found.

Q: Is structured-data disclosure of paywalled content a leak?

A: Not when implemented correctly. Google has explicitly stated the paywall structured-data approach is not leaky when articleBody excludes the gated portion (SE Roundtable). The leak only occurs when teams put the full body in articleBody while marking the parent isAccessibleForFree: false.

Q: How does this differ from C2PA Content Credentials?

A: C2PA provides cryptographic provenance metadata about origin, edits, and AI involvement. It does not gate content access; it documents content history. The two specifications complement each other — disclose access state via this spec, document provenance via C2PA.

Q: Should enterprise SSO content appear in any public AI surface?

A: Generally no. The benefit (citation visibility) does not justify the risk (data leak). Exceptions: a customer-success knowledge base where summary-level citations help prospects find documentation, with explicit legal review.

Q: What about content behind hard paywalls with no preview at all?

A: A summary endpoint is still appropriate. The summary is not a preview of the article body; it is a factual abstract sufficient for AI engines to know the article exists and what it covers. The article itself remains fully gated.

Q: How often should the disclosure spec be revalidated?

A: Quarterly. Google's flexible sampling and paywall guidance update without notice; AI vendors revise crawler docs; the llms.txt ecosystem is still maturing. A quarterly re-audit catches drift before it costs citations.

Q: Does llms.txt Gated annotation block AI training?

A: No, on its own. llms.txt is a discovery and curation file, not an access-control mechanism. Pair it with robots.txt rules and ai.txt opt-outs for training control. The Gated annotation is a community signal, not a ratified spec directive.

Auth-Gated Content Disclosure Specification for AI Crawlers

TL;DR

Why a disclosure specification

Three access models

Pattern 1: schema.org isAccessibleForFree and hasPart

Pattern 2: Summary endpoint design

Endpoint contract

Summary content rules

Sample summary response

Pattern 3: llms.txt with gated annotations

Free reference

Gated (summary-only)

Google flexible sampling vs full disclosure

Partial-content fingerprints

Audit logging requirements

Common failure modes

Implementation checklist

Decision flow

FAQ

Q: Will AI crawlers actually fetch summary endpoints?

Q: Is structured-data disclosure of paywalled content a leak?

Q: How does this differ from C2PA Content Credentials?

Q: Should enterprise SSO content appear in any public AI surface?

Q: What about content behind hard paywalls with no preview at all?

Q: How often should the disclosure spec be revalidated?

Q: Does llms.txt Gated annotation block AI training?

Related Articles

What Is GEO? Generative Engine Optimization Defined

llms.txt Reference: Specification, Format, and Examples

robots.txt for AI Crawlers

GEO & AI Search Insights

Auth-Gated Content Disclosure Specification for AI Crawlers

TL;DR

Why a disclosure specification

Three access models

Pattern 1: schema.org isAccessibleForFree and hasPart

Pattern 2: Summary endpoint design

Endpoint contract

Summary content rules

Sample summary response

Pattern 3: llms.txt with gated annotations

Free reference

Gated (summary-only)

Google flexible sampling vs full disclosure

Partial-content fingerprints

Audit logging requirements

Comparison: paywall vs login-wall vs enterprise SSO

Common failure modes

Implementation checklist

Decision flow

FAQ

Q: Will AI crawlers actually fetch summary endpoints?

Q: Is structured-data disclosure of paywalled content a leak?

Q: How does this differ from C2PA Content Credentials?

Q: Should enterprise SSO content appear in any public AI surface?

Q: What about content behind hard paywalls with no preview at all?

Q: How often should the disclosure spec be revalidated?

Q: Does llms.txt Gated annotation block AI training?

Related Articles

What Is GEO? Generative Engine Optimization Defined

llms.txt Reference: Specification, Format, and Examples

robots.txt for AI Crawlers

GEO & AI Search Insights