Sitemap-Index Patterns for AI Search
A sitemap-index file lists multiple sitemap files in a single XML document, letting large sites exceed the per-file 50,000-URL / 50 MB limit defined by Sitemap Protocol 0.9. For AI search, the index is the canonical entry point that AI crawlers (GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot) and traditional crawlers (Googlebot, Bingbot) use to discover URL inventory and prioritize re-fetches via accurate lastmod values.
TL;DR
For sites under 50,000 URLs, a single /sitemap.xml is sufficient. Above 50,000 URLs or 50 MB uncompressed, switch to a sitemap-index that references multiple sitemap files partitioned by content type (articles, products, images, videos, news) or by freshness (hot, warm, archive). Reference the index from robots.txt with a Sitemap: directive. Keep
Definition
A sitemap-index is an XML file conforming to the Sitemap Protocol 0.9 schema (http://www.sitemaps.org/schemas/sitemap/0.9) that lists individual sitemap URLs rather than page URLs. Its root element is
Minimum example:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-articles.xml</loc>
<lastmod>2026-05-03</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2026-05-02</lastmod>
</sitemap>
</sitemapindex>Hard limits (Sitemap Protocol 0.9)
| Limit | Value | Source |
|---|---|---|
| URLs per individual sitemap | 50,000 | sitemaps.org spec |
| Uncompressed file size per sitemap | 50 MB (52,428,800 bytes) | sitemaps.org spec |
| Sitemaps listed in one index | 50,000 | sitemaps.org spec |
| Indexes per Search Console property | 500 | Google docs |
| Compression | gzip allowed; uncompressed file must be ≤ 50 MB | sitemaps.org spec |
The spec does not formally allow a sitemap-index to reference another sitemap-index (nested indexes). Google enforces a single-level hierarchy, though some crawlers tolerate nesting. Stay flat to maximize compatibility.
Partitioning strategies
For a site with more than 50,000 URLs, partition by one of the following dimensions:
By content type
Most common pattern. One sitemap per content category, all referenced by a single index.
/sitemap.xml (sitemap-index)
-> /sitemap-articles.xml
-> /sitemap-products.xml
-> /sitemap-categories.xml
-> /sitemap-images.xml
-> /sitemap-videos.xml
-> /sitemap-news.xml
Benefit: AI crawlers and SEO crawlers can prioritize the sitemaps that match their interest. PerplexityBot and OAI-SearchBot tend to fetch articles and news sub-sitemaps disproportionately because those are the highest-citation surfaces.
By freshness tier
Useful for sites with very large archives where most URLs rarely change.
/sitemap.xml (sitemap-index)
-> /sitemap-hot.xml (last 30 days, lastmod accurate)
-> /sitemap-warm.xml (last 12 months)
-> /sitemap-archive.xml (older, mostly static)
Benefit: AI crawlers can re-fetch only sitemap-hot.xml and skip the archive on routine crawls. Combine with strict lastmod accuracy on the hot tier so crawlers do not waste budget on stale URLs.
By region or locale
For multilingual or multi-regional sites:
/sitemap.xml (sitemap-index)
-> /sitemap-en.xml
-> /sitemap-de.xml
-> /sitemap-ja.xml
-> /sitemap-zh.xml
Use
Hybrid (recommended for very large sites)
Combine content type and freshness:
/sitemap.xml
-> /sitemap-articles-hot.xml
-> /sitemap-articles-archive.xml
-> /sitemap-products-active.xml
-> /sitemap-products-discontinued.xml
-> /sitemap-images.xml
lastmod accuracy is the AI-search lever
The
Rules:
- Use ISO 8601 dates (YYYY-MM-DD) or full datetimes (YYYY-MM-DDThh:mm:ss+00:00).
- Update lastmod only when the page content materially changes. Cosmetic CSS or footer updates do not count.
- Do not regenerate lastmod for every URL on every build — this signals "everything changed" and reduces crawler trust.
- For sitemap-index entries, lastmod should reflect the most recent lastmod of any URL within that referenced sitemap.
- Compressed .xml.gz files preserve lastmod exactly; compression is invisible to consumers.
Better no
AI crawler discovery flow
Most AI crawlers follow this order:
- Fetch /robots.txt.
- Read Sitemap: directives. If multiple are present, all are honored.
- Fetch the listed sitemap or sitemap-index.
- Walk the index, fetching each referenced sitemap.
- Compare each URL's lastmod to local cache; refetch changed pages.
- (Some crawlers) Cross-reference URLs against /llms.txt to identify curated entry points.
Key behaviors:
- GPTBot / OAI-SearchBot: Honor sitemap directives; OpenAI documents 24-hour propagation for robots.txt and sitemap changes.
- PerplexityBot: Honors sitemaps; tends to recrawl frequently for fresh news content.
- ClaudeBot: Honors sitemaps; less aggressive recrawl cadence than GPTBot.
- Googlebot: Feeds AI Overviews from the same sitemap-driven index used for traditional Search.
- Bingbot: Powers Copilot, Microsoft AI Search; explicit guidance to use accurate lastmod.
Specialized sub-sitemap types
The Sitemap Protocol supports specialized extensions, all of which can be referenced from the same sitemap-index:
News sitemap
Namespace: http://www.google.com/schemas/sitemap-news/0.9. Used for time-sensitive news content (last 48 hours of publication). Fields:
Image sitemap
Namespace: http://www.google.com/schemas/sitemap-image/1.1. Adds
Video sitemap
Namespace: http://www.google.com/schemas/sitemap-video/1.1. Adds
hreflang annotations
Use
robots.txt integration
Always reference the sitemap-index from robots.txt. Sitemap directives are global (not bound to a User-agent group):
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
Multiple Sitemap: lines are valid and recommended when the news/image/video sub-sitemaps are not referenced from the main index.
Validation pipeline
- XML validation: Validate against the official XSD at https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd. Use xmllint or xmlstarlet in CI.
- URL reachability: Spot-check a sample of URLs from each referenced sitemap return HTTP 200. Broken sitemaps reduce crawler trust.
- lastmod sanity check: Verify lastmod is not all identical (signals lazy regeneration) and not all far in the past (signals abandonment).
- Submission: Submit the sitemap-index URL to Google Search Console and Bing Webmaster Tools. Both surface discovery errors.
- Crawler log monitoring: In CDN logs, verify that AI crawler user-agents (GPTBot, ClaudeBot, PerplexityBot) fetch the index regularly. Sudden drops indicate misconfiguration.
Anti-patterns
- Listing every URL in a single 200 MB sitemap. The 50 MB / 50,000 URL ceiling is a hard limit. Going over silently breaks indexing.
- Mixed nested indexes. A sitemap-index inside a sitemap-index is not portable. Stay flat.
- lastmod set to build time. This wastes crawl budget and erodes trust.
- Including noindex / 404 / redirected URLs. Sitemaps should list only canonical, indexable URLs.
- Forgetting the robots.txt Sitemap directive. Even with a well-formed sitemap-index, crawlers may not discover it without the directive.
- Using a sitemap to substitute for proper internal linking. AI search engines rely on both signals; a sitemap is not a substitute for in-content links.
- Compressing the sitemap-index incorrectly. The uncompressed file must be ≤ 50 MB. Older guidance from Google suggested a 10 MB compressed cap; the modern guidance allows up to 50 MB compressed but the uncompressed file must still meet the spec.
How to apply
- Audit URL count. Below 50,000, ship a single sitemap. Above, plan partitioning.
- Choose a partitioning dimension (content type, freshness, locale, or hybrid).
- Generate sitemap-index and sub-sitemaps from your CMS or static site generator. Most platforms (Next.js, Hugo, Sanity, WordPress with Yoast) emit sitemap-indexes by default.
- Confirm
is sourced from a trustworthy content modification timestamp — not the build time. - Reference the index in robots.txt and submit to Search Console and Bing Webmaster Tools.
- Monitor CDN logs for AI crawler fetch patterns over the next 7-14 days.
- Re-evaluate quarterly; URL counts grow, and partitioning may need to be re-balanced.
FAQ
Q: When do I need a sitemap-index instead of a single sitemap?
Whenever your site exceeds 50,000 URLs or 50 MB uncompressed in a single sitemap. The sitemap-index lets you reference multiple sub-sitemaps without merging them.
Q: Can I nest a sitemap-index inside another sitemap-index?
The Sitemap Protocol does not formally support it, and Google does not honor it. Some crawlers tolerate nesting, but the safe pattern is a single-level index.
Q: How accurate must lastmod values be?
They must reflect actual page content modification, not file regeneration. Bing has stated drifting or fake lastmod values reduce crawler trust. Better to omit lastmod than to fake it.
Q: Do AI crawlers honor sitemap files differently from Googlebot?
No. Major AI crawlers (GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot) honor Sitemap Protocol 0.9 the same way Googlebot does. Crawl cadence and recrawl logic differ, but discovery is identical.
Q: Should I include noindex pages in my sitemap?
No. Sitemaps should list only canonical, indexable URLs. Including noindex / 404 / redirected URLs reduces crawler trust and wastes budget.
Q: Where should the sitemap-index live?
At the site root, conventionally at /sitemap.xml. The path can vary, but referenced sub-sitemaps must be in the same or deeper directory and on the same host.
: sitemaps.org Protocol — verified 2026-05-03 — supports 50K URL / 50 MB limits and sitemap-index XML format. https://www.sitemaps.org/protocol.html
: Google, "Manage Your Sitemaps With Sitemap Index Files" — verified 2026-05-03 — supports 500 indexes per Search Console property and same-host requirement. https://developers.google.com/search/docs/crawling-indexing/sitemaps/large-sitemaps
: Bing Webmaster Blog, "Keeping Content Discoverable with Sitemaps in AI Powered Search" (July 2025) — verified 2026-05-03 — supports lastmod accuracy guidance. https://blogs.bing.com/webmaster/July-2025/Keeping-Content-Discoverable-with-Sitemaps-in-AI-Powered-Search
: Webmasters Stack Exchange, "Can a sitemap index contain other sitemap indexes?" — verified 2026-05-03 — supports flat-index recommendation. https://webmasters.stackexchange.com/questions/18243/can-a-sitemap-index-contain-other-sitemap-indexes
: Conductor Academy, "XML Sitemap: the ultimate reference guide" — verified 2026-05-03 — supports omit-rather-than-fake lastmod guidance. https://www.conductor.com/academy/xml-sitemap/
: Google, "Create a News Sitemap" — verified 2026-05-03 — supports news sitemap namespace. https://developers.google.com/search/docs/crawling-indexing/sitemaps/news-sitemap
Related Articles
AI Crawler IP Allowlist Reference
Reference list of official AI crawler IP range endpoints, user agents, and reverse-DNS verification methods for GPTBot, ClaudeBot, PerplexityBot, Googlebot, and more.
How to Create llms.txt: Step-by-Step Tutorial for AI Search
Step-by-step tutorial for creating, deploying, and validating an llms.txt file so AI systems and LLMs can discover your site's most important content.
HTTP/2 vs HTTP/3 for AI Crawlers
HTTP/3 AI crawlers support is uneven: GPTBot and most AI bots still default to HTTP/2 over TCP. Compare protocols, fallback behavior, and CDN config.