Auth-Gated Content Disclosure Specification for AI Crawlers
Auth-gated content disclosure is a layered specification using schema.org isAccessibleForFree+hasPart, dedicated summary endpoints, and llms.txt gated annotations to let AI crawlers cite paywalled or login-protected content without exposing the full body. The pattern enables citation visibility while preserving subscription value.
TL;DR
Auth-gated content (paywalls, login-walls, enterprise SSO) is invisible to AI crawlers by default, which means it cannot be cited. The standard disclosure stack pairs three layers: schema.org CreativeWork markup with isAccessibleForFree: false and a hasPart block describing free vs gated sections; a public summary endpoint (for example /api/summary/
Why a disclosure specification
Publishers, B2B SaaS docs, and enterprise knowledge bases all face the same dilemma: make content available enough for AI engines to cite it, but not so available that subscription value evaporates. Three trends make a formal spec necessary now:
- AI Overviews and ChatGPT/Perplexity citations increasingly drive qualified clicks for publishers that participate, and outright invisibility for those that do not.
- AI crawlers do not all log in. They will follow public schema, public summary URLs, and public llms.txt entries; they will not bypass auth.
- Misimplementation creates real harm. A naive paywall hides the page from indexes and Overviews; an over-eager structured-data dump leaks the full article to anyone reading the JSON-LD (SEO For Google News, Best Practices for Paywalls).
Google has published structured-data guidance for paywalled content (Google Search Central, Subscription and Paywalled Content), but no comparable reference exists for the broader AI-crawler ecosystem (GPTBot, ClaudeBot, PerplexityBot, Google-Extended). This specification fills that gap.
Three access models
Disclosure strategy must match access model. The three patterns publishers actually run:
| Model | Access requirement | Typical use case | Disclosure strategy |
|---|---|---|---|
| Paywall | Subscription or one-off payment | News publishers, premium analysis, courses | isAccessibleForFree: false + hasPart + summary endpoint + lead-in/metering optional |
| Login-wall | Free account required | Community sites, freemium SaaS docs, gated lead-gen | isAccessibleForFree: false + summary endpoint; llms.txt may safely include free-tier docs |
| Enterprise SSO | Workforce identity (SAML, OIDC, SCIM) | Internal KBs, customer-only docs, partner portals | No public disclosure of body; only metadata + tenant-scoped summary endpoint behind allowlist |
The enterprise SSO case is the most often misimplemented. Teams either expose full content via robots.txt-allowed staging URLs (training-set leak risk) or block AI crawlers entirely (zero citation visibility). The right answer is metadata-only disclosure plus tenant-scoped summaries.
Pattern 1: schema.org isAccessibleForFree and hasPart
Schema.org defines isAccessibleForFree as a boolean indicating whether a CreativeWork (or its hasPart subset) is freely available. Google's paywall guidance specifies a precise pattern: mark the parent CreativeWork with isAccessibleForFree: false, then enumerate hasPart blocks for free and gated sections so crawlers can extract the free portion without the gated portion.
Minimum compliant JSON-LD:
{
"@context": "https://schema.org",
"@type": "NewsArticle",
"@id": "https://example.com/article/q1-earnings-2026",
"headline": "Q1 Earnings Analysis",
"datePublished": "2026-04-30T09:00:00-04:00",
"isAccessibleForFree": false,
"hasPart": [
{
"@type": "WebPageElement",
"isAccessibleForFree": true,
"cssSelector": ".article-summary"
},
{
"@type": "WebPageElement",
"isAccessibleForFree": false,
"cssSelector": ".paywall"
}
]
}Critical correctness rules:
- cssSelector (or xpath) must match exactly one parent element wrapping the gated section.
- articleBody must NOT contain the full body when isAccessibleForFree: false is set on the parent. Doing so violates Google's anti-cloaking guidance and creates a structured-data leak (SE Roundtable analysis).
- The free section must be substantive (>50 words) and self-contained, not a teaser stub.
- Validate with the Schema.org validator and the Rich Results Test before deploying.
Pattern 2: Summary endpoint design
A summary endpoint is a public URL that returns a short factual abstract of the gated content. AI crawlers can fetch and cite it; the gated body is never exposed. This pattern is what makes citation possible for purely-gated content (no lead-in, no metering).
Endpoint contract
- URL shape. Predictable per article: /api/summary/
or / / /summary. - Response. JSON or HTML containing: title, canonical URL of the full article, publication date, author, 60-120 word factual summary, key entities, optional bullet list of facts.
- Status. 200 OK for available; 404 if the article does not exist; 410 Gone if removed. Never 401/403 — the summary is intentionally public.
- Caching. Send Cache-Control: public, max-age=3600 plus a strong ETag derived from summary content. See HTTP Cache Headers for AI Crawlers for the full recipe.
- Linking. The full-article HTML should reference the summary endpoint via .
Summary content rules
The summary is what gets cited. Treat it as the canonical AI-facing answer:
- Lead with a one-sentence factual claim (extractable as a featured-snippet answer).
- Add 3-5 sentences of supporting detail (entities, numbers, dates).
- Include a clear call-out of what the full article adds ("the full analysis details methodology, vendor selection, and Q3 forecasts").
- Never include the conclusions or unique numbers that drive subscription value. The summary establishes the existence of the analysis; the article delivers the analysis.
- Keep length 60-180 words. Shorter is uncitable; longer dilutes.
Sample summary response
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "Q1 2026 Earnings Analysis: Tech Sector Margins",
"abstract": "Across 18 large-cap technology companies, Q1 2026 operating margins averaged 23.4%, down 1.8 percentage points from Q4 2025. Cloud infrastructure providers retained margin while consumer hardware compressed. Full analysis (subscriber) covers per-company drivers, FX exposure, and forward guidance.",
"url": "https://example.com/article/q1-earnings-2026",
"datePublished": "2026-04-30T09:00:00-04:00",
"author": {"@type": "Person", "name": "Jane Analyst"},
"isAccessibleForFree": true,
"isPartOf": {
"@type": "Article",
"@id": "https://example.com/article/q1-earnings-2026",
"isAccessibleForFree": false
}
}Pattern 3: llms.txt with gated annotations
llms.txt is a Markdown-format file at site root that lists URLs LLMs should consult, proposed by Jeremy Howard in September 2024. The base spec does not formalize gated content; the community convention below is increasingly common.
# Example Inc.Premium business analysis. Subscriber-only articles disclose summaries publicly.
Free reference
- Methodology: How we score companies
- Glossary: Terms used in analysis
Gated (summary-only)
- Q1 2026 Earnings Summary: Subscriber article — public summary endpoint
- Cloud Vendor Margin 2026 Summary: Subscriber article — public summary endpoint
Key conventions:
- The Gated (summary-only) section explicitly links the summary endpoint, never the gated full URL.
- Each entry's description names the gating model ("Subscriber", "Login required", "Customer-only").
- A separate llms-full.txt MAY include the full free-tier text but MUST NOT include gated bodies.
- Validate that robots.txt allows AI crawlers to read both llms.txt and the linked summary URLs.
Google flexible sampling vs full disclosure
Google's flexible sampling guidance defines two models that interact with disclosure:
- Metering. Allow N free articles per user per period; show paywall after the limit.
- Lead-in. Show the first ~10-30% of every article publicly; gate the rest.
Disclosure strategy by sampling model:
| Sampling model | Schema disclosure | Summary endpoint | llms.txt |
|---|---|---|---|
| Pure paywall (no sampling) | Required — isAccessibleForFree: false + hasPart | Strongly recommended; primary citation surface | List summary endpoint |
| Lead-in | Required — mark gated section in hasPart | Optional; lead-in often sufficient | List article URL |
| Metering | Required on the parent CreativeWork | Optional; AI bots are not signed-in users so they always hit the meter limit | List article URL |
| Hybrid | Required, with both patterns marked | Recommended for highest-value articles | List both |
Google's own warning: even minor sampling reductions degrade ranking when content access is restricted. Lower bounds: do not drop the lead-in below a substantive paragraph (~150-300 words).
Partial-content fingerprints
Once a summary endpoint is public, you need a way to detect if the full body has leaked into AI training corpora or live retrieval responses. The pattern: embed semantic fingerprints in the gated body and probe AI engines for them.
- Embed 3-5 stable, distinctive phrases per article in the gated body. Examples: branded turn-of-phrase, named framework, specific stat with unusual unit.
- Maintain a registry mapping article ID → fingerprints.
- Run weekly probes: ask each major AI engine for the article's fingerprints ("Is the phrase '...' associated with a 2026 earnings analysis?").
- If the full body's fingerprints appear in answers without authentication, you have a leak. Investigate: is isAccessibleForFree: false actually serving the gated body in articleBody? Is robots.txt allowing GPTBot/ClaudeBot to read the full HTML?
This is a detective control, not a preventive one. Combine with rigorous server-side enforcement.
Audit logging requirements
Disclosure compliance requires being able to prove what AI crawlers saw. Minimum log fields per AI bot request:
- Timestamp (ISO 8601)
- User-agent (full string)
- Source IP and reverse-DNS / verified-bot status
- Requested URL
- Response status code
- Response bytes returned
- Whether the response included gated body content (boolean derived from a content gate at egress)
Retain logs ≥ 90 days for forensics. For high-value content, ship logs to immutable storage (object lock, WORM) so subsequent investigation cannot be tampered with.
Verify-bot identity is critical. Major AI vendors publish IP allowlists (OpenAI, Anthropic, Perplexity). Treat any User-Agent string as untrusted unless the source IP appears in the published list.
Comparison: paywall vs login-wall vs enterprise SSO
| Dimension | Paywall | Login-wall | Enterprise SSO |
|---|---|---|---|
| Who can read body | Paying subscribers | Any user with free account | Workforce identity holders |
| Public schema disclosure | Yes (isAccessibleForFree: false + hasPart) | Yes | Metadata only (no body in JSON-LD) |
| Public summary endpoint | Strongly recommended | Recommended | Tenant-scoped only; allowlist by IP |
| llms.txt entry | Yes — link to summary | Yes — link to article or summary | Out of scope; private docs do not appear |
| Robots.txt for AI | Allow on summary; allow on free hasPart; disallow on gated paths if separable | Allow | Disallow on internal hosts; consider VPC-only DNS |
| Audit logging | 90 days minimum | 90 days minimum | 365 days; immutable storage |
| Citation visibility | High (with summary endpoint) | High | Low by design |
| Training-set leak risk | Medium (structured-data leakage) | Medium (over-eager llms.txt) | Low if config correct; high if misconfigured |
Common failure modes
- Full body in articleBody while isAccessibleForFree: false. Anyone reading the JSON-LD has the gated content. Fix: never serialize gated body into structured data.
- Summary endpoint protected by auth. Defeats the entire pattern. Summary is intentionally public; only the full article is gated.
- llms.txt lists gated full URL instead of summary URL. AI crawler hits a 401/403, freshness signal degrades, no citation. Fix: list summary endpoint exclusively in the gated section.
- No Cache-Control on summary endpoint. Each AI bot revisit is a full DB query. Fix: short max-age plus strong ETag.
- Identical summary across articles. AI engines deduplicate; only one canonical citation emerges. Fix: hand-write each summary; do not template.
- Robots.txt blocks AI crawlers globally. Removes citation visibility entirely — the inverse of disclosure. Fix: allow on summary paths and free hasPart paths; disallow only specifically gated paths.
- No audit log of AI bot visits. Cannot prove what was disclosed when. Fix: log per request, retain 90+ days.
- No fingerprinting registry. Cannot detect leakage. Fix: embed and probe weekly.
Implementation checklist
- Inventory gated content by access model (paywall / login-wall / enterprise SSO).
- For each model, decide on lead-in, metering, or pure-gate strategy.
- Add isAccessibleForFree + hasPart JSON-LD to all gated articles.
- Build a public summary endpoint at a predictable path; populate factual 60-180 word summaries.
- Add from full article to summary endpoint.
- Update llms.txt with gated annotations and summary URLs.
- Audit robots.txt to confirm AI crawlers can read summaries and free sections.
- Configure caching headers per the HTTP cache header reference.
- Build the fingerprint registry and the weekly probe job.
- Stand up audit logging with 90+ day retention.
- Validate end-to-end with a freshness probe after each article ships.
- Re-audit quarterly; AI crawler behavior and Google guidance both evolve.
Decision flow
flowchart TD
A["Gated article ready to publish"] --> B{"Access model?"}
B -->|Paywall| C["Schema: isAccessibleForFree=false + hasPart"]
B -->|Login-wall| D["Schema: isAccessibleForFree=false"]
B -->|Enterprise SSO| E["Metadata only; no body in JSON-LD"]
C --> F["Build public summary endpoint"]
D --> F
E --> G["Tenant-scoped summary; allowlist IP"]
F --> H["Add llms.txt entry pointing at summary"]
G --> I["Audit logging; immutable storage"]
H --> I
I --> J["Embed fingerprints; weekly probe"]FAQ
Q: Will AI crawlers actually fetch summary endpoints?
A: Yes if they are publicly accessible, returned with 200 OK, served quickly, and linked from llms.txt and the parent article via rel=alternate. Major declared AI bots (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) follow standard HTTP discovery patterns; an unlinked summary endpoint at an obscure URL will not be found.
Q: Is structured-data disclosure of paywalled content a leak?
A: Not when implemented correctly. Google has explicitly stated the paywall structured-data approach is not leaky when articleBody excludes the gated portion (SE Roundtable). The leak only occurs when teams put the full body in articleBody while marking the parent isAccessibleForFree: false.
Q: How does this differ from C2PA Content Credentials?
A: C2PA provides cryptographic provenance metadata about origin, edits, and AI involvement. It does not gate content access; it documents content history. The two specifications complement each other — disclose access state via this spec, document provenance via C2PA.
Q: Should enterprise SSO content appear in any public AI surface?
A: Generally no. The benefit (citation visibility) does not justify the risk (data leak). Exceptions: a customer-success knowledge base where summary-level citations help prospects find documentation, with explicit legal review.
Q: What about content behind hard paywalls with no preview at all?
A: A summary endpoint is still appropriate. The summary is not a preview of the article body; it is a factual abstract sufficient for AI engines to know the article exists and what it covers. The article itself remains fully gated.
Q: How often should the disclosure spec be revalidated?
A: Quarterly. Google's flexible sampling and paywall guidance update without notice; AI vendors revise crawler docs; the llms.txt ecosystem is still maturing. A quarterly re-audit catches drift before it costs citations.
Q: Does llms.txt Gated annotation block AI training?
A: No, on its own. llms.txt is a discovery and curation file, not an access-control mechanism. Pair it with robots.txt rules and ai.txt opt-outs for training control. The Gated annotation is a community signal, not a ratified spec directive.
Related Articles
What Is GEO? Generative Engine Optimization Defined
GEO (Generative Engine Optimization) is the practice of structuring content so AI search engines retrieve, understand, synthesize, and cite it in generated answers.
llms.txt Reference: Specification, Format, and Examples
llms.txt is a proposed root-level Markdown file that gives LLMs a curated, machine-readable index of a site. Reference for spec, format, and adoption.
robots.txt for AI Crawlers
How to configure robots.txt to control AI crawlers — GPTBot, PerplexityBot, Google-Extended, ClaudeBot, Applebot-Extended, and the rest — across training and retrieval use cases.