AI Search KPIs: Define, Calculate, and Report (Dashboard Spec)
This spec defines the eight KPIs every GEO/AEO program should report, with explicit formulas, sampling rules, and a dashboard layout. The goal is to make AI search measurement comparable across teams and platforms, and to make non-determinism (same prompt, different answer) a managed variable rather than a silent source of error.
TL;DR
The minimum reportable AI search dashboard tracks eight KPIs: citation rate, share of voice (mention + citation), share of answer, query coverage, sentiment polarity, AI-referred traffic, AI-influenced pipeline, and platform overlap. Each KPI is defined by an unambiguous numerator/denominator, a fixed prompt set, a sampling cadence (n ≥ 30 per prompt per platform per week), and a 95% confidence interval. Dashboards report per-platform first and aggregate second, because ChatGPT and Perplexity overlap on only ~11% of cited domains.
Why this spec exists
Ask three teams "what is your AI citation rate?" and you will get three incompatible answers. One is counting any prompt where the brand was mentioned; another is counting only prompts where the brand URL was the cited source; a third is averaging across ChatGPT, Perplexity, and Google AI Overviews despite very different sourcing behaviors. Without a shared spec, AI search KPIs collapse into vibe-based reporting.
This document fixes that. Every KPI here has a numerator, denominator, prompt set definition, sampling cadence, and confidence-interval procedure.
Definitions used throughout
- Prompt set (P): a fixed list of natural-language prompts representing your category and buyer journey. Treat the prompt set as a versioned artifact; freeze it before each reporting cycle.
- Run (r): one execution of a prompt against one platform.
- Response (R): the model's full answer to a run, including any cited sources.
- Mention: brand name appears in R.text.
- Citation: brand domain or specific URL appears in R.sources.
- Sample size (n): number of runs per prompt per platform per reporting window. Default n = 30.
- Reporting window: the time window over which runs are aggregated. Default = weekly.
The eight core KPIs
1. Citation rate
Definition. Share of runs in which any brand-owned URL appears in R.sources.
Formula.
citation_rate = runs_with_brand_citation / total_runsPer-platform reference rates. Track against published baselines: ChatGPT cites sources in ~87% of runs, Google AI Overviews ~85%, Google AI Mode ~76%, Perplexity cites 3-4 of ~10 retrieved pages per query. Your citation rate denominator should be "runs where the platform did cite something", not all runs.
Reporting. Per platform; weekly; with 95% Wilson confidence interval.
2. Mention rate (share-of-mention floor)
Definition. Share of runs where the brand name appears in the response text, regardless of citation.
Formula.
mention_rate = runs_with_brand_mention / total_runsUseful even when the platform does not cite (e.g., ChatGPT runs with citations disabled). Track alongside citation rate; mention without citation is a brand-recognition signal but not a traffic signal.
3. Share of voice (SoV) — mention-based
Definition. Brand mentions divided by total brand mentions in the same response set.
Formula.
sov_mentions = sum_R(brand_mentions_in_R) / sum_R(total_brand_mentions_in_R)Aggregate over a fixed competitive set (5-10 named competitors). If a response mentions five brands and yours is one of them, your share of that response is 20%; average across the prompt set.
4. Share of voice (SoV) — citation-based
Definition. Brand citations divided by total brand citations in the same response set.
Formula.
sov_citations = sum_R(brand_citations_in_R) / sum_R(total_brand_citations_in_R)Report mention-based and citation-based SoV side-by-side. They diverge whenever a brand is named without being the source (e.g., "Notion is a popular workspace tool" with citation to a third-party review).
5. Share of answer (SoA)
Definition. Share of the answer text attributable to your brand's content, measured by counting characters in sentences whose primary citation is a brand-owned URL.
Formula.
share_of_answer = sum_R(chars_in_brand_attributed_sentences) / sum_R(total_chars_in_R)This is the most expensive KPI to compute (requires sentence-level attribution) but the most decision-useful: it shows how much of the substance of an answer your content drove, not just whether you appeared.
6. Query coverage
Definition. Share of the prompt set for which the brand earns at least one citation across any tracked platform within the reporting window.
Formula.
query_coverage = prompts_with_at_least_one_citation / |P|Query coverage is the cleanest single number to put on an executive slide because it is bounded 0-100% and resilient to per-platform noise.
7. Sentiment polarity
Definition. Mean sentiment score (-1 to +1) of sentences that mention or cite the brand, measured by an LLM classifier with a fixed rubric.
Formula.
sentiment = mean(sentiment_score(sentence_i)) for sentences mentioning brandReport with sample size and standard error. Treat any movement smaller than 0.1 as noise unless n is large.
8. Platform overlap
Definition. Jaccard overlap of cited brand-relevant domains between any two platforms.
Formula.
overlap(A, B) = |cited_domains(A) ∩ cited_domains(B)| / |cited_domains(A) ∪ cited_domains(B)|Independent research finds ChatGPT ∩ Perplexity overlap is roughly 11% of cited domains. If your internal overlap looks much higher, your prompt set is probably too narrow or your tooling is collapsing platforms.
Bonus KPIs (track when downstream attribution exists)
- AI-referred traffic. Sessions where the referrer is a known AI engine domain. Tag in analytics with a ?utm_source=ai_
convention or use the platform's official referrer. - AI-influenced pipeline. Pipeline value of opportunities whose first-touch or assisted-touch was an AI-referred session. Requires CRM integration.
Sampling and statistical discipline
AI responses are non-deterministic: the same prompt run minutes apart can yield different brands, citations, and answer structure. Treat every KPI as a sample, not a measurement.
- Default sample size: n = 30 runs per prompt per platform per reporting window. n = 30 is the lowest defensible size for reporting a 95% confidence interval on a proportion.
- Stagger runs across the reporting window (e.g., 30 runs spread over 7 days at varied times) rather than running all 30 in one batch.
- Use Wilson score intervals for citation/mention rates, not normal approximations — they are reliable at small n and proportions near 0% or 100%.
- Freeze the prompt set at the start of each reporting cycle; version it. Changing prompts mid-cycle invalidates trend comparisons.
- Log everything. Store the full response text and source list for each run. Without raw runs you cannot recompute KPIs after a definitional change.
- Separate platforms in every chart. Aggregate "AI search" numbers hide that platforms cite very different domains.
Dashboard layout
A reportable dashboard has five panels. Build them in this order.
Panel 1 — Headline tile
- Query coverage (single number, with WoW delta).
- Citation rate, all platforms blended, with confidence interval.
- AI-referred sessions and pipeline (when available).
Panel 2 — Per-platform breakdown
A small-multiple chart with one row per platform (ChatGPT, Perplexity, Google AI Overviews, Gemini, Copilot, Claude). Columns: citation rate, mention rate, SoV (citations), sentiment.
Panel 3 — Competitive set
Mention-based and citation-based SoV against the named competitive set, weekly trend.
Panel 4 — Prompt-level diagnostics
A table of every prompt with: citation rate, top-cited domain, your rank among cited brands, top competing brand. This is the panel content owners use.
Panel 5 — Coverage gaps and overlap
- Prompts with zero citations across any platform (action: write or update content).
- Platform-overlap matrix (Jaccard) for transparency about how independent the platforms are.
Reporting cadence
- Weekly: all eight core KPIs, per-platform, with confidence intervals.
- Monthly: rollup with prompt-level diagnostics, content-owner action items, and an updated competitive set.
- Quarterly: prompt-set refresh; competitive-set refresh; sentiment rubric calibration; KPI definition review.
Implementation checklist
- [ ] Versioned prompt set committed to a repo.
- [ ] Tracker that runs n = 30 per prompt per platform per week (build or buy).
- [ ] Raw responses stored with timestamp, platform, prompt id, sources, and full text.
- [ ] KPI computation library with explicit numerator/denominator code paths.
- [ ] Wilson confidence intervals on every rate metric.
- [ ] Per-platform charts (no single-number aggregates without an aggregate line and per-platform line).
- [ ] Competitive set (5-10 brands) frozen and versioned.
- [ ] Sentiment classifier with a published rubric and inter-rater calibration.
- [ ] Analytics integration for AI-referred traffic and (if possible) pipeline.
- [ ] Dashboard reviewed weekly; spec reviewed quarterly.
Common reporting mistakes
- Aggregating across platforms without per-platform breakouts. Hides the ~89% of citations that are platform-specific.
- Reporting rates without confidence intervals. Week-over-week noise is mistaken for real movement.
- Counting mentions and citations together. They are different signals; track each.
- Using a single run per prompt. Non-determinism makes single-run measurement actively misleading.
- Changing the prompt set mid-cycle. Trend lines become uninterpretable.
- Confusing share of voice with share of answer. SoV says you appear; SoA says you drove the answer.
FAQ
Q: How many prompts should the prompt set contain?
For a focused B2B category, 50-150 prompts is enough to balance coverage and cost. Cover the buyer journey: top-of-funnel category questions, mid-funnel comparisons, bottom-of-funnel branded and procurement questions. Refresh quarterly.
Q: Why n = 30 runs per prompt per platform per week?
Thirty is the smallest sample size at which a Wilson 95% confidence interval on a proportion is narrow enough to detect realistic week-over-week movement (±10pp). Smaller samples produce wider intervals than the metrics they measure, making trend reads unreliable.
Q: Should I trust a single "AI visibility score" from a tool?
Only as a directional signal. Vendors compute single scores using different prompt sets, weighting schemes, and platform mixes. For accountable reporting, recompute the eight KPIs above on your own raw runs.
Q: How do I compare citation rate across platforms with different sourcing behaviors?
Normalize: report citation rate as runs_with_brand_citation / runs_with_any_citation per platform. Then aggregate using equal weights or your platform-traffic mix. Never average raw rates across platforms with very different base citation behaviors.
Q: What is the difference between share of voice and share of answer?
Share of voice counts whether your brand is mentioned/cited; share of answer measures how much of the actual answer text your content drove. SoV is a presence metric; SoA is a substance metric. SoA is more expensive to compute but more directly tied to buyer perception.
Q: How quickly will I see KPI movement after a content change?
Index inclusion typically lags publication by 1-4 weeks for AI engines. Expect at least one full reporting cycle (4 weeks) before attributing KPI movement to a specific content change, and require statistically significant movement before claiming impact.
Related Articles
How to audit AI Overviews visibility (Google): checklist + metrics
Step-by-step checklist to audit your brand's visibility in Google AI Overviews: build a query set, capture SERPs, score citations and mentions, and report before/after metrics.
Tools for AI Visibility Tracking: What to Measure and How to Choose
How to choose an AI visibility tracking tool: the metrics that matter (citation rate, share-of-voice, query coverage), buyer profiles, and how to read the data to drive GEO/AEO content decisions.
Hallucination triage: a playbook for fixing incorrect AI answers fast
Step-by-step hallucination triage playbook to capture queries, classify failure modes, update content and evidence, and re-verify AI citations fast.