AI Search Referrer Attribution Reference Specification
This specification defines how analytics systems should detect, label, and attribute referral traffic from AI search engines. It enumerates the HTTP Referer hosts each major LLM surface emits, the UTM conventions analytics teams should standardize on, and a deterministic decision tree for reconciling sessions that arrive without a referrer header.
TL;DR. AI search clients pass referrer data inconsistently: Perplexity reliably sends perplexity.ai, ChatGPT sometimes sends chatgpt.com or chat.openai.com, Gemini sometimes sends gemini.google.com, and Google AI Overviews pass google.com indistinguishably from organic. Reliable attribution requires (1) a host allow-list, (2) a standardized UTM scheme on every link you control, and (3) a GA4 custom channel group that catches the long tail. Treat the result as a conservative lower bound — copy-paste citations are intrinsically unattributable.
This document is a normative reference. It defines field shapes, host patterns, and decision rules. For step-by-step setup, see the linked tutorials in Related references.
1. Scope and definitions
AI search referrer attribution is the process of identifying inbound web sessions whose origin was an AI-mediated answer surface (chat assistants, answer engines, AI browsers, or AI Overviews) and assigning them to a canonical channel and source.
| Term | Definition |
|---|---|
| AI search surface | Any user-facing product whose primary output is a synthesized answer with optional source citations (ChatGPT, Perplexity, Gemini, Copilot, Claude, You.com, Brave Leo). |
| AI browser | A browser whose default new-tab or address-bar surface is an LLM (ChatGPT Atlas, Perplexity Comet, Arc Search). |
| Click-through | A user-initiated navigation from a citation chip, link, or button inside an AI surface to a destination URL. |
| Citation impression | The act of an AI surface displaying a source link without the user clicking it. Not directly attributable from referrer headers alone. |
| Dark funnel | Sessions influenced by an AI surface but recorded as Direct, Organic, or (not set) due to missing referrer or UTM data. |
Out of scope: training-data attribution, crawler/bot identification (covered in the crawler user-agent reference), and conversion modeling.
2. Why AI traffic is hard to attribute
Three wire-level facts drive the entire spec:
- HTTP Referer is optional and often stripped. AI clients that open links in an in-app webview, native app, or sandbox frequently omit the header or rewrite it. When this happens, GA4 records the session as Direct / (not set).
- AI Overviews pass google.com. Clicks from a Google AI Overview citation present an HTTP Referer identical to a normal organic SERP click. There is currently no public referrer field that distinguishes them.
- Copy-paste is invisible. When a user reads a Perplexity or ChatGPT answer, copies the destination URL, and pastes it into a new tab, no referrer is sent. This is structurally unattributable.
The result: referrer-only attribution is necessary but not sufficient. A complete implementation layers referrer detection, UTM conventions, and channel grouping.
3. Normative referrer host registry
Attribution systems MUST treat the following hosts as AI-search referrers when they appear in document.referrer or the GA4 session_source dimension. Hosts are listed as registrable domains; subdomains MUST match.
3.1 First-party AI assistants
| Surface | Referrer hosts | Reliability | Notes |
|---|---|---|---|
| ChatGPT (web) | chatgpt.com, chat.openai.com | Medium | Sent for clickable citations on chatgpt.com; suppressed when opened via in-app link handlers. |
| ChatGPT Atlas browser | (not set) or chatgpt.com | Low | Internal webviews strip the Referer; treat as Direct fallback. |
| Perplexity (web) | perplexity.ai, www.perplexity.ai | High | Most reliable AI source; consistently passes referrer on citation clicks. |
| Perplexity Comet | perplexity.ai or (not set) | Medium | Behavior depends on whether the user clicks a citation versus types a URL. |
| Google Gemini | gemini.google.com | Medium | Sent for explicit citation clicks; native Android Assistant surface often omits. |
| Microsoft Copilot | copilot.microsoft.com, bing.com | Medium | Bing-Copilot blended sessions sometimes attribute to bing.com. |
| Claude (Anthropic) | claude.ai | Low | Most external links open in a new tab without referrer. |
| You.com | you.com | Medium | Reliable when the user clicks the citation card. |
| Brave Leo | (not set) | Very low | In-browser sidebar; no referrer in current builds. |
| Meta AI | meta.ai | Low | Inconsistent across surfaces. |
3.2 AI Overviews and SGE
Clicks from Google AI Overviews carry google.com (or country variants) as the referrer, identical to organic SERP clicks. Implementations MUST NOT classify google.com referrals as AI-attributed. AI Overview attribution requires a separate measurement model (branded-search lift, view-through, or first-party citation telemetry) and is out of scope for referrer-based detection.
3.3 Aggregators that proxy AI surfaces
Aggregator hosts (poe.com, phind.com, kagi.com, huggingface.co/chat, t3.chat) MUST be treated as AI-search referrers when present.
4. UTM convention (RECOMMENDED)
For every link an organization controls (citations seeded into prompts, plugin output, structured data, RAG sources), apply a deterministic UTM scheme so attribution survives referrer loss.
utm_source =
utm_medium = ai_search // fixed literal
utm_campaign =
utm_content =
utm_term =
Rules:
- utm_medium MUST be the literal string ai_search. This single token is the join key for channel grouping.
- utm_source values MUST be lowercase, hyphenated, and drawn from the controlled vocabulary in §3.1.
- Implementations MUST preserve UTMs through canonical redirects and CDN edge rules.
5. GA4 channel grouping ruleset
The RECOMMENDED GA4 custom channel group, ordered by precedence, is:
- AI Search — Paid surface — session_medium matches ai_search AND session_campaign contains paid.
- AI Search — Cited — session_source matches the regex below OR session_medium equals ai_search.
- AI Search — Suspected (dark funnel) — session_source is (direct) AND landing page is in the AI-cited URL set AND time-of-day or geographic anomaly score exceeds threshold (heuristic; mark with inferred=true).
- Organic Search — fall-through.
Reference regex for rule 2:
regex
^(chatgpt\.com|chat\.openai\.com|perplexity\.ai|www\.perplexity\.ai|gemini\.google\.com|copilot\.microsoft\.com|claude\.ai|you\.com|poe\.com|phind\.com|kagi\.com|meta\.ai|t3\.chat)$
Implementations MUST place AI Search rules above Organic Search and Referral so that bing.com Copilot sessions are not absorbed into Organic.
6. Detection decision tree
For each inbound session, evaluate in order and stop at the first match:
- If utm_medium = ai_search → assign AI Search /
. - Else if document.referrer host matches the regex in §5 → assign AI Search /
. - Else if document.referrer host is google.com AND landing page is flagged as AI-Overview-cited in your monitoring tool → assign AI Overviews (inferred).
- Else if referrer is empty AND client_id is new AND landing path is in the AI-cited URL set within a 7-day citation freshness window → assign AI Search (inferred) with inferred=true.
- Else fall through to standard channel grouping.
Rules 3 and 4 are inferred and MUST be flagged so downstream models can apply confidence weighting.
7. Field reference
| Field | Type | Required | Description | |||
|---|---|---|---|---|---|---|
| ai_surface | enum (§3.1) | yes | The AI product attributed to the session. | |||
| ai_attribution_method | enum: utm | referrer | inferred_overview | inferred_dark | yes | How the attribution was derived. |
| ai_confidence | float 0.0-1.0 | yes | 1.0 for utm, 0.8 for referrer, 0.5 for inferred_overview, 0.3 for inferred_dark. | |||
| ai_citation_url | string | optional | Destination URL recorded in the AI surface, when known. | |||
| ai_prompt_intent_hash | string | optional | Opaque hash from utm_term, for cohorting. | |||
| ai_first_seen_at | ISO-8601 | yes | First session timestamp on this client_id from any AI surface. |
Downstream attribution models SHOULD multiply pipeline credit by ai_confidence to avoid over-counting inferred sessions.
8. Conformance levels
- Level 1 — Detect. Implements §3 host registry and §5 channel grouping. Sufficient for executive dashboards.
- Level 2 — Tag. Adds §4 UTM conventions on all controlled surfaces. Sufficient for content-level ROI reporting.
- Level 3 — Reconcile. Adds §6 inferred-attribution rules with confidence flags and a citation freshness index. Required for pipeline-level attribution.
Compliance claims MUST cite the level achieved.
9. Misconceptions
- "GA4 has built-in AI channels." It does not. The default channel group routes most AI surfaces into Referral or Direct. A custom channel group is required.
- "bing.com is always Copilot." It is not. Bing organic and Copilot share the host. Use UTMs or page-path heuristics to disambiguate.
- "Perplexity referrers are 100% reliable." They are the most reliable, not perfect. Mobile app and Comet sessions can still strip the referrer.
- "AI Overviews can be isolated from referrer alone." They cannot. A separate measurement model is required.
10. FAQ
Q: Does Perplexity send a referrer header?
Yes. Perplexity is currently the most reliable AI source for referrer-based attribution; clicks from citation chips on perplexity.ai consistently include the Referer header. Native app and AI-browser sessions are less reliable.
Q: Why does ChatGPT traffic show up as Direct in GA4?
Because many ChatGPT click paths (mobile app, Atlas browser, in-app webviews) suppress the Referer header. To recover those sessions, tag any links you control with utm_medium=ai_search and add a GA4 custom channel group as in §5.
Q: Can I attribute clicks from Google AI Overviews?
Not from referrer headers — AI Overview clicks pass google.com exactly like organic. Use a separate measurement model: branded-search lift, citation monitoring tools, or first-party telemetry that detects AI-Overview-driven landing pages.
Q: What utm_medium value should I standardize on?
Use the literal string ai_search. A single, consistent token makes channel grouping, BigQuery joins, and cross-platform reporting deterministic.
Q: How should I weight inferred AI traffic?
Multiply pipeline credit by the ai_confidence value defined in §7. Inferred dark-funnel sessions (0.3) should not be combined with deterministic UTM sessions (1.0) without weighting, or you will overstate AI impact.
Related Articles
AI Answer Length Patterns: Word and Token Targets per Engine in 2026
Reference for AI answer lengths in 2026 — word and token targets for ChatGPT, Perplexity, and Google AI Overviews so writers format extractable answers.
AI Citation Confidence Scoring Framework: Predicting Source Inclusion Likelihood
AI citation confidence scoring framework: a predictive model that scores how likely generative engines are to cite a source based on retrieval, grounding, and trust signals.
AI Citation Format Specification by Engine: How ChatGPT, Perplexity, Gemini, and Claude Render Sources in 2026
Reference specification of how ChatGPT, Perplexity, Gemini, and Claude render source citations in 2026, with format patterns, anchor text, and rendering rules.