Service Worker and AI Crawlers Spec: Cache, Routing, Bypass
Service workers can intercept fetches and serve stale or offline-shell responses that break AI crawler indexing. The spec defines bypass paths, cache strategies by request class, and the offline-fallback contract that preserves the rendered HTML crawlers need.
TL;DR
- A service worker that returns the offline shell (or a stale cached HTML) to a crawler will block indexing of that page.
- Match crawlers by user agent (Googlebot, GPTBot, ClaudeBot, etc.) and bypass the service worker for navigation requests from those agents.
- For non-navigation assets, network-first with short cache fallback is safer for crawlers than cache-first.
- Never return a generic offline.html in response to a crawler navigation; respond with the network result or fall through to the origin.
- Test crawler behavior in CI with a fake user agent and assert that the response body is the live page, not the shell.
Definition
A service worker is a JavaScript worker registered against an origin that intercepts network requests from pages on that origin and decides how to respond. Service workers exist primarily to enable Progressive Web App features: offline support, instant navigations from cache, and background sync. They are also the first thing in the response path for fetches initiated by browsers and any crawler that executes JavaScript and respects the worker registration.
The AI crawler interaction surface is where service-worker behavior collides with indexing. AI crawlers — Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot — either render JavaScript (and therefore activate the service worker) or do not (and bypass it). The spec covers both cases and the bypass logic that makes either case produce indexable HTML.
Why this matters
A service worker that returns a cached or fallback response to a crawler can replace your full HTML with a generic offline shell. The crawler then indexes the shell instead of the page, and over time the page disappears from search and AI overview citations. This is one of the most damaging configuration errors in PWA-style sites.
The failure mode is silent. Page rendering looks correct in production for users — the service worker serves fast cached responses and falls back to the network. But the crawler sees the same fast cached responses, which may be stale, or the offline fallback, which is essentially empty. Search Console gives only delayed signal of the indexing collapse, often after weeks of decline.
A second motivation is parity between crawlers. Googlebot renders JavaScript and respects service workers; some smaller AI crawlers do not. A service worker that breaks Googlebot will also break those crawlers because the offline shell sent to Googlebot is also what the static crawler sees.
How it works
The spec covers four mechanisms.
Crawler detection by user agent. Inside the service worker's fetch handler, inspect event.request.headers for known crawler user agents. The minimum set is Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot, AppleBot, and Amazonbot. Maintain a list rather than a regex; user-agent strings change.
Bypass for crawler navigation requests. When a crawler is detected and the request is a navigation (event.request.mode === 'navigate'), call event.respondWith(fetch(event.request)) to bypass the cache and serve the live network response. If the network fails, fall through (return undefined or rethrow) so the browser handles the error rather than serving an offline shell.
Cache strategies by request class. For non-navigation requests (CSS, JS, images), use the strategies in the table:
| Request class | Strategy for users | Strategy for crawlers |
|---|---|---|
| HTML navigation | Network-first with cache fallback | Network-only (bypass SW) |
| App shell JS | Cache-first | Network-first |
| Critical CSS | Cache-first | Network-first |
| Images | Stale-while-revalidate | Network-only |
| API responses | Network-first with TTL cache | Network-only |
| Offline fallback page | offline.html | Never serve to crawlers |
The principle: crawlers should always see what the origin currently serves, not what was cached at install time.
Offline-fallback contract. The traditional PWA offline page ("You're offline") must never be returned to a crawler. Either bypass to network or return a 503 with Retry-After. A 503 is a recoverable signal; the offline shell is an indexable replacement that will overwrite the real page in the index.
Practical application
Five-step adoption process:
- Audit the current service worker. List every fetch handler branch and identify which can return non-network content. Each is a candidate failure point for crawlers.
- Add a crawler-bypass branch as the first check. Match against the user-agent allow-list and bypass to fetch(event.request) when matched. Place this branch first so subsequent strategies cannot override it.
- Replace cache-first with network-first for HTML navigations even for users. This costs a small amount of perceived performance but eliminates the most common stale-content indexing risk.
- Replace the offline fallback with a 503. When the network fails for a crawler, respond with new Response(null, { status: 503, headers: { 'Retry-After': '120' } }). Search engines understand 503 and will retry; they do not understand offline shells.
- Add CI tests with crawler user agents. A simple Playwright test that loads the homepage with User-Agent: Googlebot/2.1 and asserts the response body contains your real
catches almost every regression. Run on every deploy.
Common mistakes
Returning the offline shell to a crawler is the most damaging mistake. The shell is empty content that overwrites the real page in the search index. The fix is the bypass-or-503 rule from the spec.
Using cache-first for HTML is the second mistake. Cache-first means the service worker returns whatever was cached at install time, which may be days or weeks old. Crawlers index the stale version, and any updates to the live page are invisible until the cache expires.
A third mistake is matching crawlers with a too-narrow regex. "Googlebot" alone misses "Googlebot-Image", "Googlebot-News", and AI crawlers entirely. Maintain an explicit allow-list rather than a regex; review it quarterly.
Finally, registering the service worker on the homepage but not testing crawler behavior. Service workers persist across navigations on the same origin; once registered, they intercept every subsequent navigation. A bad worker poisons every page on the origin until the next deploy.
FAQ
Q: Should I disable the service worker for crawlers entirely?
For most sites, yes — bypass the service worker for any request from a recognized crawler user agent. The performance benefit of caching does not apply to crawlers (they cache themselves at the engine level), and the indexing risk of stale or offline content is real. The only case where you might keep the worker is when you need to inject crawler-specific structured data, but even then the safer pattern is to render that data server-side instead.
Q: What user agents should I match?
At minimum: Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot, AppleBot, Amazonbot, DuckDuckBot, YandexBot, Baiduspider. Maintain the list in a config file the service worker imports, and review it quarterly because new AI crawlers appear regularly. Match by case-insensitive substring; full regex is unnecessary and brittle.
Q: Can I serve a different (lighter) page to crawlers?
No. Cloaking — serving substantively different content to crawlers vs users — violates Google's guidelines and risks de-indexing. The service worker bypass should serve the same network response a regular user receives, just without the cache layer. The only acceptable difference is that the crawler sees a fresher version, never a different version.
Q: How does this interact with prerendering or SSR?
If you use server-side rendering or a prerender service, the service worker should not intercept the navigation at all for crawlers. The prerender output is what should reach the crawler. Run the bypass branch first, before any cache lookup or render-on-demand logic; the SSR response will then be returned directly.
Q: How do I test the crawler path before deploying?
Three layers. Unit-test the service-worker fetch handler with mocked Request objects whose user agent is each crawler in your allow-list, and assert the handler returns a network bypass. End-to-end test with Playwright using a crawler user agent string and assert the rendered HTML contains live content. Production-canary by registering the new worker on a small percentage of traffic and watching Search Console for 503/404/empty-content spikes before full rollout.
Related Articles
AggregateRating Schema for AI Citations
AggregateRating schema specification for AI citations: required fields, decimal handling, parent-type pairings (Product, Course, SoftwareApplication, LocalBusiness), Google policy violations.
AI Crawler IP Allowlist Reference
Reference list of official AI crawler IP range endpoints, user agents, and reverse-DNS verification methods for GPTBot, ClaudeBot, PerplexityBot, Googlebot, and more.
Canonical Tag for AI Search
Specification for rel=canonical implementation across HTML and HTTP-header methods, with guidance on how AI engines resolve canonicals for parameterized URLs and AMP variants.