AI Search Referrer Attribution Reference Specification

This specification defines how analytics systems should detect, label, and attribute referral traffic from AI search engines. It enumerates the HTTP Referer hosts each major LLM surface emits, the UTM conventions analytics teams should standardize on, and a deterministic decision tree for reconciling sessions that arrive without a referrer header.

TL;DR. AI search clients pass referrer data inconsistently: Perplexity reliably sends perplexity.ai, ChatGPT sometimes sends chatgpt.com or chat.openai.com, Gemini sometimes sends gemini.google.com, and Google AI Overviews pass google.com indistinguishably from organic. Reliable attribution requires (1) a host allow-list, (2) a standardized UTM scheme on every link you control, and (3) a GA4 custom channel group that catches the long tail. Treat the result as a conservative lower bound — copy-paste citations are intrinsically unattributable.

This document is a normative reference. It defines field shapes, host patterns, and decision rules. For step-by-step setup, see the linked tutorials in Related references.

1. Scope and definitions

AI search referrer attribution is the process of identifying inbound web sessions whose origin was an AI-mediated answer surface (chat assistants, answer engines, AI browsers, or AI Overviews) and assigning them to a canonical channel and source.

Term	Definition
AI search surface	Any user-facing product whose primary output is a synthesized answer with optional source citations (ChatGPT, Perplexity, Gemini, Copilot, Claude, You.com, Brave Leo).
AI browser	A browser whose default new-tab or address-bar surface is an LLM (ChatGPT Atlas, Perplexity Comet, Arc Search).
Click-through	A user-initiated navigation from a citation chip, link, or button inside an AI surface to a destination URL.
Citation impression	The act of an AI surface displaying a source link without the user clicking it. Not directly attributable from referrer headers alone.
Dark funnel	Sessions influenced by an AI surface but recorded as Direct, Organic, or (not set) due to missing referrer or UTM data.

Out of scope: training-data attribution, crawler/bot identification (covered in the crawler user-agent reference), and conversion modeling.

2. Why AI traffic is hard to attribute

Three wire-level facts drive the entire spec:

HTTP Referer is optional and often stripped. AI clients that open links in an in-app webview, native app, or sandbox frequently omit the header or rewrite it. When this happens, GA4 records the session as Direct / (not set).
AI Overviews pass google.com. Clicks from a Google AI Overview citation present an HTTP Referer identical to a normal organic SERP click. There is currently no public referrer field that distinguishes them.
Copy-paste is invisible. When a user reads a Perplexity or ChatGPT answer, copies the destination URL, and pastes it into a new tab, no referrer is sent. This is structurally unattributable.

The result: referrer-only attribution is necessary but not sufficient. A complete implementation layers referrer detection, UTM conventions, and channel grouping.

3. Normative referrer host registry

Attribution systems MUST treat the following hosts as AI-search referrers when they appear in document.referrer or the GA4 session_source dimension. Hosts are listed as registrable domains; subdomains MUST match.

3.1 First-party AI assistants

Surface	Referrer hosts	Reliability	Notes
ChatGPT (web)	chatgpt.com, chat.openai.com	Medium	Sent for clickable citations on chatgpt.com; suppressed when opened via in-app link handlers.
ChatGPT Atlas browser	(not set) or chatgpt.com	Low	Internal webviews strip the Referer; treat as Direct fallback.
Perplexity (web)	perplexity.ai, www.perplexity.ai	High	Most reliable AI source; consistently passes referrer on citation clicks.
Perplexity Comet	perplexity.ai or (not set)	Medium	Behavior depends on whether the user clicks a citation versus types a URL.
Google Gemini	gemini.google.com	Medium	Sent for explicit citation clicks; native Android Assistant surface often omits.
Microsoft Copilot	copilot.microsoft.com, bing.com	Medium	Bing-Copilot blended sessions sometimes attribute to bing.com.
Claude (Anthropic)	claude.ai	Low	Most external links open in a new tab without referrer.
You.com	you.com	Medium	Reliable when the user clicks the citation card.
Brave Leo	(not set)	Very low	In-browser sidebar; no referrer in current builds.
Meta AI	meta.ai	Low	Inconsistent across surfaces.

3.2 AI Overviews and SGE

Clicks from Google AI Overviews carry google.com (or country variants) as the referrer, identical to organic SERP clicks. Implementations MUST NOT classify google.com referrals as AI-attributed. AI Overview attribution requires a separate measurement model (branded-search lift, view-through, or first-party citation telemetry) and is out of scope for referrer-based detection.

3.3 Aggregators that proxy AI surfaces

Aggregator hosts (poe.com, phind.com, kagi.com, huggingface.co/chat, t3.chat) MUST be treated as AI-search referrers when present.

4. UTM convention (RECOMMENDED)

For every link an organization controls (citations seeded into prompts, plugin output, structured data, RAG sources), apply a deterministic UTM scheme so attribution survives referrer loss.

utm_source = // chatgpt | perplexity | gemini | copilot | claude

utm_medium = ai_search // fixed literal

utm_campaign = // e.g. geo-citation-readiness

utm_content = // e.g. ai-search-referrer-attribution-spec

utm_term = // optional, opaque hash of seeded prompt

Rules:

utm_medium MUST be the literal string ai_search. This single token is the join key for channel grouping.
utm_source values MUST be lowercase, hyphenated, and drawn from the controlled vocabulary in §3.1.
Implementations MUST preserve UTMs through canonical redirects and CDN edge rules.

5. GA4 channel grouping ruleset

The RECOMMENDED GA4 custom channel group, ordered by precedence, is:

AI Search — Paid surface — session_medium matches ai_search AND session_campaign contains paid.
AI Search — Cited — session_source matches the regex below OR session_medium equals ai_search.
AI Search — Suspected (dark funnel) — session_source is (direct) AND landing page is in the AI-cited URL set AND time-of-day or geographic anomaly score exceeds threshold (heuristic; mark with inferred=true).
Organic Search — fall-through.

Reference regex for rule 2:

regex

Implementations MUST place AI Search rules above Organic Search and Referral so that bing.com Copilot sessions are not absorbed into Organic.

6. Detection decision tree

For each inbound session, evaluate in order and stop at the first match:

If utm_medium = ai_search → assign AI Search / .
Else if document.referrer host matches the regex in §5 → assign AI Search / .
Else if document.referrer host is google.com AND landing page is flagged as AI-Overview-cited in your monitoring tool → assign AI Overviews (inferred).
Else if referrer is empty AND client_id is new AND landing path is in the AI-cited URL set within a 7-day citation freshness window → assign AI Search (inferred) with inferred=true.
Else fall through to standard channel grouping.

Rules 3 and 4 are inferred and MUST be flagged so downstream models can apply confidence weighting.

7. Field reference

Field	Type	Required	Description
ai_surface	enum (§3.1)	yes	The AI product attributed to the session.
ai_attribution_method	enum: utm	referrer	inferred_overview	inferred_dark	yes	How the attribution was derived.
ai_confidence	float 0.0-1.0	yes	1.0 for utm, 0.8 for referrer, 0.5 for inferred_overview, 0.3 for inferred_dark.
ai_citation_url	string	optional	Destination URL recorded in the AI surface, when known.
ai_prompt_intent_hash	string	optional	Opaque hash from utm_term, for cohorting.
ai_first_seen_at	ISO-8601	yes	First session timestamp on this client_id from any AI surface.

Downstream attribution models SHOULD multiply pipeline credit by ai_confidence to avoid over-counting inferred sessions.

8. Conformance levels

Level 1 — Detect. Implements §3 host registry and §5 channel grouping. Sufficient for executive dashboards.
Level 2 — Tag. Adds §4 UTM conventions on all controlled surfaces. Sufficient for content-level ROI reporting.
Level 3 — Reconcile. Adds §6 inferred-attribution rules with confidence flags and a citation freshness index. Required for pipeline-level attribution.

Compliance claims MUST cite the level achieved.

9. Misconceptions

"GA4 has built-in AI channels." It does not. The default channel group routes most AI surfaces into Referral or Direct. A custom channel group is required.
"bing.com is always Copilot." It is not. Bing organic and Copilot share the host. Use UTMs or page-path heuristics to disambiguate.
"Perplexity referrers are 100% reliable." They are the most reliable, not perfect. Mobile app and Comet sessions can still strip the referrer.
"AI Overviews can be isolated from referrer alone." They cannot. A separate measurement model is required.

10. FAQ

Q: Does Perplexity send a referrer header?

Yes. Perplexity is currently the most reliable AI source for referrer-based attribution; clicks from citation chips on perplexity.ai consistently include the Referer header. Native app and AI-browser sessions are less reliable.

Q: Why does ChatGPT traffic show up as Direct in GA4?

Because many ChatGPT click paths (mobile app, Atlas browser, in-app webviews) suppress the Referer header. To recover those sessions, tag any links you control with utm_medium=ai_search and add a GA4 custom channel group as in §5.

Q: Can I attribute clicks from Google AI Overviews?

Not from referrer headers — AI Overview clicks pass google.com exactly like organic. Use a separate measurement model: branded-search lift, citation monitoring tools, or first-party telemetry that detects AI-Overview-driven landing pages.

Q: What utm_medium value should I standardize on?

Use the literal string ai_search. A single, consistent token makes channel grouping, BigQuery joins, and cross-platform reporting deterministic.

Q: How should I weight inferred AI traffic?

Multiply pipeline credit by the ai_confidence value defined in §7. Inferred dark-funnel sessions (0.3) should not be combined with deterministic UTM sessions (1.0) without weighting, or you will overstate AI impact.