Geodocs.dev

Sitemap-Index Patterns for AI Search

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

A sitemap-index file lists multiple sitemap files in a single XML document, letting large sites exceed the per-file 50,000-URL / 50 MB limit defined by Sitemap Protocol 0.9. For AI search, the index is the canonical entry point that AI crawlers (GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot) and traditional crawlers (Googlebot, Bingbot) use to discover URL inventory and prioritize re-fetches via accurate lastmod values.

TL;DR

For sites under 50,000 URLs, a single /sitemap.xml is sufficient. Above 50,000 URLs or 50 MB uncompressed, switch to a sitemap-index that references multiple sitemap files partitioned by content type (articles, products, images, videos, news) or by freshness (hot, warm, archive). Reference the index from robots.txt with a Sitemap: directive. Keep values accurate — AI crawlers use them to schedule incremental re-crawls, and Bing has explicitly warned that drifting lastmod values reduce trust in the file.

Definition

A sitemap-index is an XML file conforming to the Sitemap Protocol 0.9 schema (http://www.sitemaps.org/schemas/sitemap/0.9) that lists individual sitemap URLs rather than page URLs. Its root element is (instead of ), and each child is a entry containing and optional .

Minimum example:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-articles.xml</loc>
    <lastmod>2026-05-03</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2026-05-02</lastmod>
  </sitemap>
</sitemapindex>

Hard limits (Sitemap Protocol 0.9)

LimitValueSource
URLs per individual sitemap50,000sitemaps.org spec
Uncompressed file size per sitemap50 MB (52,428,800 bytes)sitemaps.org spec
Sitemaps listed in one index50,000sitemaps.org spec
Indexes per Search Console property500Google docs
Compressiongzip allowed; uncompressed file must be ≤ 50 MBsitemaps.org spec

The spec does not formally allow a sitemap-index to reference another sitemap-index (nested indexes). Google enforces a single-level hierarchy, though some crawlers tolerate nesting. Stay flat to maximize compatibility.

Partitioning strategies

For a site with more than 50,000 URLs, partition by one of the following dimensions:

By content type

Most common pattern. One sitemap per content category, all referenced by a single index.

/sitemap.xml (sitemap-index)

-> /sitemap-articles.xml

-> /sitemap-products.xml

-> /sitemap-categories.xml

-> /sitemap-images.xml

-> /sitemap-videos.xml

-> /sitemap-news.xml

Benefit: AI crawlers and SEO crawlers can prioritize the sitemaps that match their interest. PerplexityBot and OAI-SearchBot tend to fetch articles and news sub-sitemaps disproportionately because those are the highest-citation surfaces.

By freshness tier

Useful for sites with very large archives where most URLs rarely change.

/sitemap.xml (sitemap-index)

-> /sitemap-hot.xml (last 30 days, lastmod accurate)

-> /sitemap-warm.xml (last 12 months)

-> /sitemap-archive.xml (older, mostly static)

Benefit: AI crawlers can re-fetch only sitemap-hot.xml and skip the archive on routine crawls. Combine with strict lastmod accuracy on the hot tier so crawlers do not waste budget on stale URLs.

By region or locale

For multilingual or multi-regional sites:

/sitemap.xml (sitemap-index)

-> /sitemap-en.xml

-> /sitemap-de.xml

-> /sitemap-ja.xml

-> /sitemap-zh.xml

Use entries inside each language sitemap to declare hreflang relationships. AI search engines use hreflang to surface language-appropriate citations.

Combine content type and freshness:

/sitemap.xml

-> /sitemap-articles-hot.xml

-> /sitemap-articles-archive.xml

-> /sitemap-products-active.xml

-> /sitemap-products-discontinued.xml

-> /sitemap-images.xml

lastmod accuracy is the AI-search lever

The element is the single highest-leverage tag for AI search. Crawlers use it to decide whether to refetch a URL. Bing has stated explicitly: do not set lastmod to the sitemap generation time; it must reflect the true last modification of page content.

Rules:

  • Use ISO 8601 dates (YYYY-MM-DD) or full datetimes (YYYY-MM-DDThh:mm:ss+00:00).
  • Update lastmod only when the page content materially changes. Cosmetic CSS or footer updates do not count.
  • Do not regenerate lastmod for every URL on every build — this signals "everything changed" and reduces crawler trust.
  • For sitemap-index entries, lastmod should reflect the most recent lastmod of any URL within that referenced sitemap.
  • Compressed .xml.gz files preserve lastmod exactly; compression is invisible to consumers.

Better no at all than an inaccurate one. Conductor's reference guidance is to leave it out rather than fake it.

AI crawler discovery flow

Most AI crawlers follow this order:

  1. Fetch /robots.txt.
  2. Read Sitemap: directives. If multiple are present, all are honored.
  3. Fetch the listed sitemap or sitemap-index.
  4. Walk the index, fetching each referenced sitemap.
  5. Compare each URL's lastmod to local cache; refetch changed pages.
  6. (Some crawlers) Cross-reference URLs against /llms.txt to identify curated entry points.

Key behaviors:

  • GPTBot / OAI-SearchBot: Honor sitemap directives; OpenAI documents 24-hour propagation for robots.txt and sitemap changes.
  • PerplexityBot: Honors sitemaps; tends to recrawl frequently for fresh news content.
  • ClaudeBot: Honors sitemaps; less aggressive recrawl cadence than GPTBot.
  • Googlebot: Feeds AI Overviews from the same sitemap-driven index used for traditional Search.
  • Bingbot: Powers Copilot, Microsoft AI Search; explicit guidance to use accurate lastmod.

Specialized sub-sitemap types

The Sitemap Protocol supports specialized extensions, all of which can be referenced from the same sitemap-index:

News sitemap

Namespace: http://www.google.com/schemas/sitemap-news/0.9. Used for time-sensitive news content (last 48 hours of publication). Fields: , , .

Image sitemap

Namespace: http://www.google.com/schemas/sitemap-image/1.1. Adds blocks with , , . Useful for image-heavy content where AI image search and multimodal models index media separately.

Video sitemap

Namespace: http://www.google.com/schemas/sitemap-video/1.1. Adds with thumbnail, duration, publication date. Critical for video-first sites; AI search engines that surface video citations rely on this metadata.

hreflang annotations

Use directly inside entries to declare locale alternates. Saves a separate locale sitemap.

robots.txt integration

Always reference the sitemap-index from robots.txt. Sitemap directives are global (not bound to a User-agent group):

User-agent: *

Allow: /

Sitemap: https://example.com/sitemap.xml

Sitemap: https://example.com/sitemap-news.xml

Multiple Sitemap: lines are valid and recommended when the news/image/video sub-sitemaps are not referenced from the main index.

Validation pipeline

  1. XML validation: Validate against the official XSD at https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd. Use xmllint or xmlstarlet in CI.
  2. URL reachability: Spot-check a sample of URLs from each referenced sitemap return HTTP 200. Broken sitemaps reduce crawler trust.
  3. lastmod sanity check: Verify lastmod is not all identical (signals lazy regeneration) and not all far in the past (signals abandonment).
  4. Submission: Submit the sitemap-index URL to Google Search Console and Bing Webmaster Tools. Both surface discovery errors.
  5. Crawler log monitoring: In CDN logs, verify that AI crawler user-agents (GPTBot, ClaudeBot, PerplexityBot) fetch the index regularly. Sudden drops indicate misconfiguration.

Anti-patterns

  • Listing every URL in a single 200 MB sitemap. The 50 MB / 50,000 URL ceiling is a hard limit. Going over silently breaks indexing.
  • Mixed nested indexes. A sitemap-index inside a sitemap-index is not portable. Stay flat.
  • lastmod set to build time. This wastes crawl budget and erodes trust.
  • Including noindex / 404 / redirected URLs. Sitemaps should list only canonical, indexable URLs.
  • Forgetting the robots.txt Sitemap directive. Even with a well-formed sitemap-index, crawlers may not discover it without the directive.
  • Using a sitemap to substitute for proper internal linking. AI search engines rely on both signals; a sitemap is not a substitute for in-content links.
  • Compressing the sitemap-index incorrectly. The uncompressed file must be ≤ 50 MB. Older guidance from Google suggested a 10 MB compressed cap; the modern guidance allows up to 50 MB compressed but the uncompressed file must still meet the spec.

How to apply

  1. Audit URL count. Below 50,000, ship a single sitemap. Above, plan partitioning.
  2. Choose a partitioning dimension (content type, freshness, locale, or hybrid).
  3. Generate sitemap-index and sub-sitemaps from your CMS or static site generator. Most platforms (Next.js, Hugo, Sanity, WordPress with Yoast) emit sitemap-indexes by default.
  4. Confirm is sourced from a trustworthy content modification timestamp — not the build time.
  5. Reference the index in robots.txt and submit to Search Console and Bing Webmaster Tools.
  6. Monitor CDN logs for AI crawler fetch patterns over the next 7-14 days.
  7. Re-evaluate quarterly; URL counts grow, and partitioning may need to be re-balanced.

FAQ

Q: When do I need a sitemap-index instead of a single sitemap?

Whenever your site exceeds 50,000 URLs or 50 MB uncompressed in a single sitemap. The sitemap-index lets you reference multiple sub-sitemaps without merging them.

Q: Can I nest a sitemap-index inside another sitemap-index?

The Sitemap Protocol does not formally support it, and Google does not honor it. Some crawlers tolerate nesting, but the safe pattern is a single-level index.

Q: How accurate must lastmod values be?

They must reflect actual page content modification, not file regeneration. Bing has stated drifting or fake lastmod values reduce crawler trust. Better to omit lastmod than to fake it.

Q: Do AI crawlers honor sitemap files differently from Googlebot?

No. Major AI crawlers (GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot) honor Sitemap Protocol 0.9 the same way Googlebot does. Crawl cadence and recrawl logic differ, but discovery is identical.

Q: Should I include noindex pages in my sitemap?

No. Sitemaps should list only canonical, indexable URLs. Including noindex / 404 / redirected URLs reduces crawler trust and wastes budget.

Q: Where should the sitemap-index live?

At the site root, conventionally at /sitemap.xml. The path can vary, but referenced sub-sitemaps must be in the same or deeper directory and on the same host.

: sitemaps.org Protocol — verified 2026-05-03 — supports 50K URL / 50 MB limits and sitemap-index XML format. https://www.sitemaps.org/protocol.html

: Google, "Manage Your Sitemaps With Sitemap Index Files" — verified 2026-05-03 — supports 500 indexes per Search Console property and same-host requirement. https://developers.google.com/search/docs/crawling-indexing/sitemaps/large-sitemaps

: Bing Webmaster Blog, "Keeping Content Discoverable with Sitemaps in AI Powered Search" (July 2025) — verified 2026-05-03 — supports lastmod accuracy guidance. https://blogs.bing.com/webmaster/July-2025/Keeping-Content-Discoverable-with-Sitemaps-in-AI-Powered-Search

: Webmasters Stack Exchange, "Can a sitemap index contain other sitemap indexes?" — verified 2026-05-03 — supports flat-index recommendation. https://webmasters.stackexchange.com/questions/18243/can-a-sitemap-index-contain-other-sitemap-indexes

: Conductor Academy, "XML Sitemap: the ultimate reference guide" — verified 2026-05-03 — supports omit-rather-than-fake lastmod guidance. https://www.conductor.com/academy/xml-sitemap/

: Google, "Create a News Sitemap" — verified 2026-05-03 — supports news sitemap namespace. https://developers.google.com/search/docs/crawling-indexing/sitemaps/news-sitemap

Related Articles

reference

AI Crawler IP Allowlist Reference

Reference list of official AI crawler IP range endpoints, user agents, and reverse-DNS verification methods for GPTBot, ClaudeBot, PerplexityBot, Googlebot, and more.

guide

How to Create llms.txt: Step-by-Step Tutorial for AI Search

Step-by-step tutorial for creating, deploying, and validating an llms.txt file so AI systems and LLMs can discover your site's most important content.

comparison

HTTP/2 vs HTTP/3 for AI Crawlers

HTTP/3 AI crawlers support is uneven: GPTBot and most AI bots still default to HTTP/2 over TCP. Compare protocols, fallback behavior, and CDN config.

Topics
Stay Updated

GEO & AI Search Insights

New articles, framework updates, and industry analysis. No spam, unsubscribe anytime.