An llms.txt generator produces a Markdown file at /llms.txt that lists a site's most important pages for LLMs, using a fixed structure: H1 title, blockquote summary, and H2 sections with annotated links. A correct generator must respect the published format from llmstxt.org, emit absolute URLs, and pass a small validation checklist before publishing.

TL;DR: An llms.txt generator's job is narrow but strict — crawl your site, pick canonical pages, and emit a Markdown file at /llms.txt (and optionally /llms-full.txt) that follows the format on llmstxt.org. Use this page as the spec your tool — or your evaluation of someone else's tool — must satisfy.

What an llms.txt generator does

A generator is a focused utility, not a CMS. Given a site URL (and sometimes a sitemap), it must:

Discover candidate pages — typically by reading sitemap.xml, the homepage, or a configured allowlist.
Filter to canonical content — exclude duplicates, paginated archives, search results, and noindex pages.
Extract minimal metadata per page — title, short description, and absolute URL.
Emit a single Markdown file — /llms.txt (concise) and optionally /llms-full.txt (full text).
Validate before writing — fail loudly if structure or URLs are invalid.

The emerging convention is documented at llmstxt.org and implemented by tools such as Mintlify's generator and Firecrawl's /llmstxt endpoint, which both produce concise and full variants of the file.

Required inputs

Base URL — root origin (https://example.com). Required.
Discovery source — sitemap.xml URL, a list of seed URLs, or a directory of Markdown.
Include / exclude rules — glob or regex patterns for path filtering.
Section grouping — mapping from URL prefix or category to a section heading.
Title and summary overrides — optional manual title and blockquote summary.
Mode — concise for llms.txt, full for llms-full.txt, or both.

Output format (the part that's strict)

The output must be valid CommonMark and follow this exact skeleton, per the llmstxt.org proposal:

{Project / site title}

{One-sentence summary of the site, in a blockquote}

{Optional short paragraph of additional context}

{Section heading, e.g. "Docs"}

Page title: Short description.
Another page: Short description.

{Optional section}

Page title: Short description.

Optional

Lower-priority page: Short description.

Hard rules a compliant generator must enforce:

The first line is an H1 with the project or site name.
The H1 is followed by a blockquote (>) summary of one sentence.
All link targets are absolute URLs.
Each link is followed by a colon and a brief description on the same line.
Lower-priority links go under an ## Optional section so LLMs can skip them under context pressure.
File is served at /llms.txt from the site root with Content-Type: text/plain; charset=utf-8 or text/markdown.

For llms-full.txt, the same header is followed by the full Markdown body of each listed page, separated by horizontal rules.

URL and content rules

Use canonical URLs only. If a page has a , prefer that target.
Strip tracking parameters (utm_*, gclid, fbclid, session IDs).
Resolve relative links against the base URL.
Skip pages with noindex, X-Robots-Tag: noindex, or those blocked in robots.txt for general crawlers.
Skip .pdf, .zip, and other non-text assets unless the section explicitly hosts downloadable references.
Trim descriptions to a single sentence (≈140 characters) to keep the file readable.

Validation checklist

Before writing the file, a generator should run these checks. Treat any failure as a build error.

[ ] First non-empty line is # followed by a non-empty title.
[ ] Second non-empty line is > followed by a one-sentence summary (≤ 200 characters).
[ ] At least one ## section is present.
[ ] Every section contains at least one list item.
] Every list item matches the pattern - [text with an optional : description.
[ ] All link URLs are absolute and resolve with 200 OK (HEAD request).
[ ] No URL is repeated across sections.
[ ] Total file size is under 100 KB (a soft cap; large sites should split into llms-full.txt).
[ ] Output is valid UTF-8.
[ ] Output is valid CommonMark (lints clean with a Markdown parser such as remark).

Common mistakes generators make

Using HTML instead of Markdown — breaks parsers that expect plain CommonMark.
Relative links — LLMs may resolve them against an unrelated base.
Including blocked pages — pages disallowed in robots.txt shouldn't appear in llms.txt.
Stuffing optional content into the main sections — push lower-priority links into ## Optional.
Skipping the blockquote summary — many parsers use it as the snippet.
Auto-generating bloated llms-full.txt — include only canonical pages, not every URL.

llms.txt is one of three small files most teams now publish for AI crawlers:

robots.txt controls access — which crawlers can fetch which paths.
ai.txt (proposed) declares policy — what AI vendors may do with the content.
llms.txt provides a curated map of the highest-signal pages.

A generator's job is only the third file, but it should be aware of the other two to avoid listing pages that crawlers are blocked from fetching.

FAQ

Q: Is llms.txt an official standard?

It's a community proposal published at llmstxt.org by Jeremy Howard in 2024. It is not an IETF or W3C standard, but a growing number of AI-friendly platforms publish and consume it.

Q: Where should a generator place the output file?

At the site root: https://example.com/llms.txt. Place llms-full.txt at the same root if you generate it.

Q: Do I need both llms.txt and llms-full.txt?

No. Start with llms.txt. Add llms-full.txt only if you want LLMs to ingest full page bodies and your site has fewer than ~50 canonical documents.

Q: How is llms.txt different from sitemap.xml?

A sitemap is an exhaustive index for search crawlers. llms.txt is a curated, opinionated short list of the pages you most want LLMs to read. The two complement each other.

Q: How often should the generator run?

On every deploy or at least daily. Treat it like a build artifact: regenerated, validated, and shipped with the rest of the site.

llms.txt generator: requirements, output format, and validation checklist

What an llms.txt generator does

Required inputs

Output format (the part that's strict)

{Project / site title}

{Section heading, e.g. "Docs"}

{Optional section}

Optional

URL and content rules

Validation checklist

Common mistakes generators make

FAQ

Q: Is llms.txt an official standard?

Q: Where should a generator place the output file?

Q: Do I need both llms.txt and llms-full.txt?

Q: How is llms.txt different from sitemap.xml?

Q: How often should the generator run?

Related Articles

Sitemap optimization for AI crawlers: rules, exclusions, and freshness signals

Ahrefs for GEO: Content Gap Analysis and AI Visibility

AI Bot Log Analytics Tool Buyer's Checklist

Thông tin GEO & AI Search

llms.txt generator: requirements, output format, and validation checklist

What an llms.txt generator does

Required inputs

Output format (the part that's strict)

{Project / site title}

{Section heading, e.g. "Docs"}

{Optional section}

Optional

URL and content rules

Validation checklist

Common mistakes generators make

How this fits with related files

FAQ

Q: Is llms.txt an official standard?

Q: Where should a generator place the output file?

Q: Do I need both llms.txt and llms-full.txt?

Q: How is llms.txt different from sitemap.xml?

Q: How often should the generator run?

Related Articles

Sitemap optimization for AI crawlers: rules, exclusions, and freshness signals

Ahrefs for GEO: Content Gap Analysis and AI Visibility

AI Bot Log Analytics Tool Buyer's Checklist

Thông tin GEO & AI Search