AI Search KPIs: Define, Calculate, and Report (Dashboard Spec)

This spec defines the eight KPIs every GEO/AEO program should report, with explicit formulas, sampling rules, and a dashboard layout. The goal is to make AI search measurement comparable across teams and platforms, and to make non-determinism (same prompt, different answer) a managed variable rather than a silent source of error.

TL;DR

The minimum reportable AI search dashboard tracks eight KPIs: citation rate, share of voice (mention + citation), share of answer, query coverage, sentiment polarity, AI-referred traffic, AI-influenced pipeline, and platform overlap. Each KPI is defined by an unambiguous numerator/denominator, a fixed prompt set, a sampling cadence (n ≥ 30 per prompt per platform per week), and a 95% confidence interval. Dashboards report per-platform first and aggregate second, because ChatGPT and Perplexity overlap on only ~11% of cited domains.

Why this spec exists

Ask three teams "what is your AI citation rate?" and you will get three incompatible answers. One is counting any prompt where the brand was mentioned; another is counting only prompts where the brand URL was the cited source; a third is averaging across ChatGPT, Perplexity, and Google AI Overviews despite very different sourcing behaviors. Without a shared spec, AI search KPIs collapse into vibe-based reporting.

This document fixes that. Every KPI here has a numerator, denominator, prompt set definition, sampling cadence, and confidence-interval procedure.

Definitions used throughout

Prompt set (P): a fixed list of natural-language prompts representing your category and buyer journey. Treat the prompt set as a versioned artifact; freeze it before each reporting cycle.
Run (r): one execution of a prompt against one platform.
Response (R): the model's full answer to a run, including any cited sources.
Mention: brand name appears in R.text.
Citation: brand domain or specific URL appears in R.sources.
Sample size (n): number of runs per prompt per platform per reporting window. Default n = 30.
Reporting window: the time window over which runs are aggregated. Default = weekly.

The eight core KPIs

1. Citation rate

Definition. Share of runs in which any brand-owned URL appears in R.sources.

Formula.

citation_rate = runs_with_brand_citation / total_runs

Per-platform reference rates. Track against published baselines: ChatGPT cites sources in ~87% of runs, Google AI Overviews ~85%, Google AI Mode ~76%, Perplexity cites 3-4 of ~10 retrieved pages per query. Your citation rate denominator should be "runs where the platform did cite something", not all runs.

Reporting. Per platform; weekly; with 95% Wilson confidence interval.

Definition. Share of runs where the brand name appears in the response text, regardless of citation.

Formula.

mention_rate = runs_with_brand_mention / total_runs

Useful even when the platform does not cite (e.g., ChatGPT runs with citations disabled). Track alongside citation rate; mention without citation is a brand-recognition signal but not a traffic signal.

Definition. Brand mentions divided by total brand mentions in the same response set.

Formula.

sov_mentions = sum_R(brand_mentions_in_R) / sum_R(total_brand_mentions_in_R)

Aggregate over a fixed competitive set (5-10 named competitors). If a response mentions five brands and yours is one of them, your share of that response is 20%; average across the prompt set.

Definition. Brand citations divided by total brand citations in the same response set.

Formula.

sov_citations = sum_R(brand_citations_in_R) / sum_R(total_brand_citations_in_R)

Report mention-based and citation-based SoV side-by-side. They diverge whenever a brand is named without being the source (e.g., "Notion is a popular workspace tool" with citation to a third-party review).

Definition. Share of the answer text attributable to your brand's content, measured by counting characters in sentences whose primary citation is a brand-owned URL.

Formula.

share_of_answer = sum_R(chars_in_brand_attributed_sentences) / sum_R(total_chars_in_R)

This is the most expensive KPI to compute (requires sentence-level attribution) but the most decision-useful: it shows how much of the substance of an answer your content drove, not just whether you appeared.

6. Query coverage

Definition. Share of the prompt set for which the brand earns at least one citation across any tracked platform within the reporting window.

Formula.

query_coverage = prompts_with_at_least_one_citation / |P|

Query coverage is the cleanest single number to put on an executive slide because it is bounded 0-100% and resilient to per-platform noise.

7. Sentiment polarity

Definition. Mean sentiment score (-1 to +1) of sentences that mention or cite the brand, measured by an LLM classifier with a fixed rubric.

Formula.

sentiment = mean(sentiment_score(sentence_i)) for sentences mentioning brand

Report with sample size and standard error. Treat any movement smaller than 0.1 as noise unless n is large.

8. Platform overlap

Definition. Jaccard overlap of cited brand-relevant domains between any two platforms.

Formula.

overlap(A, B) = |cited_domains(A) ∩ cited_domains(B)| / |cited_domains(A) ∪ cited_domains(B)|

Independent research finds ChatGPT ∩ Perplexity overlap is roughly 11% of cited domains. If your internal overlap looks much higher, your prompt set is probably too narrow or your tooling is collapsing platforms.

Bonus KPIs (track when downstream attribution exists)

AI-referred traffic. Sessions where the referrer is a known AI engine domain. Tag in analytics with a ?utm_source=ai_ convention or use the platform's official referrer.
AI-influenced pipeline. Pipeline value of opportunities whose first-touch or assisted-touch was an AI-referred session. Requires CRM integration.

Sampling and statistical discipline

AI responses are non-deterministic: the same prompt run minutes apart can yield different brands, citations, and answer structure. Treat every KPI as a sample, not a measurement.

Default sample size: n = 30 runs per prompt per platform per reporting window. n = 30 is the lowest defensible size for reporting a 95% confidence interval on a proportion.
Stagger runs across the reporting window (e.g., 30 runs spread over 7 days at varied times) rather than running all 30 in one batch.
Use Wilson score intervals for citation/mention rates, not normal approximations — they are reliable at small n and proportions near 0% or 100%.
Freeze the prompt set at the start of each reporting cycle; version it. Changing prompts mid-cycle invalidates trend comparisons.
Log everything. Store the full response text and source list for each run. Without raw runs you cannot recompute KPIs after a definitional change.
Separate platforms in every chart. Aggregate "AI search" numbers hide that platforms cite very different domains.

Dashboard layout

A reportable dashboard has five panels. Build them in this order.

Panel 1 — Headline tile

Query coverage (single number, with WoW delta).
Citation rate, all platforms blended, with confidence interval.
AI-referred sessions and pipeline (when available).

Panel 2 — Per-platform breakdown

A small-multiple chart with one row per platform (ChatGPT, Perplexity, Google AI Overviews, Gemini, Copilot, Claude). Columns: citation rate, mention rate, SoV (citations), sentiment.

Panel 3 — Competitive set

Mention-based and citation-based SoV against the named competitive set, weekly trend.

Panel 4 — Prompt-level diagnostics

A table of every prompt with: citation rate, top-cited domain, your rank among cited brands, top competing brand. This is the panel content owners use.

Panel 5 — Coverage gaps and overlap

Prompts with zero citations across any platform (action: write or update content).
Platform-overlap matrix (Jaccard) for transparency about how independent the platforms are.

Reporting cadence

Weekly: all eight core KPIs, per-platform, with confidence intervals.
Monthly: rollup with prompt-level diagnostics, content-owner action items, and an updated competitive set.
Quarterly: prompt-set refresh; competitive-set refresh; sentiment rubric calibration; KPI definition review.

Implementation checklist

[ ] Versioned prompt set committed to a repo.
[ ] Tracker that runs n = 30 per prompt per platform per week (build or buy).
[ ] Raw responses stored with timestamp, platform, prompt id, sources, and full text.
[ ] KPI computation library with explicit numerator/denominator code paths.
[ ] Wilson confidence intervals on every rate metric.
[ ] Per-platform charts (no single-number aggregates without an aggregate line and per-platform line).
[ ] Competitive set (5-10 brands) frozen and versioned.
[ ] Sentiment classifier with a published rubric and inter-rater calibration.
[ ] Analytics integration for AI-referred traffic and (if possible) pipeline.
[ ] Dashboard reviewed weekly; spec reviewed quarterly.

Common reporting mistakes

Aggregating across platforms without per-platform breakouts. Hides the ~89% of citations that are platform-specific.
Reporting rates without confidence intervals. Week-over-week noise is mistaken for real movement.
Counting mentions and citations together. They are different signals; track each.
Using a single run per prompt. Non-determinism makes single-run measurement actively misleading.
Changing the prompt set mid-cycle. Trend lines become uninterpretable.
Confusing share of voice with share of answer. SoV says you appear; SoA says you drove the answer.

FAQ

Q: How many prompts should the prompt set contain?

For a focused B2B category, 50-150 prompts is enough to balance coverage and cost. Cover the buyer journey: top-of-funnel category questions, mid-funnel comparisons, bottom-of-funnel branded and procurement questions. Refresh quarterly.

Q: Why n = 30 runs per prompt per platform per week?

Thirty is the smallest sample size at which a Wilson 95% confidence interval on a proportion is narrow enough to detect realistic week-over-week movement (±10pp). Smaller samples produce wider intervals than the metrics they measure, making trend reads unreliable.

Q: Should I trust a single "AI visibility score" from a tool?

Only as a directional signal. Vendors compute single scores using different prompt sets, weighting schemes, and platform mixes. For accountable reporting, recompute the eight KPIs above on your own raw runs.

Q: How do I compare citation rate across platforms with different sourcing behaviors?

Normalize: report citation rate as runs_with_brand_citation / runs_with_any_citation per platform. Then aggregate using equal weights or your platform-traffic mix. Never average raw rates across platforms with very different base citation behaviors.

Share of voice counts whether your brand is mentioned/cited; share of answer measures how much of the actual answer text your content drove. SoV is a presence metric; SoA is a substance metric. SoA is more expensive to compute but more directly tied to buyer perception.

Q: How quickly will I see KPI movement after a content change?

Index inclusion typically lags publication by 1-4 weeks for AI engines. Expect at least one full reporting cycle (4 weeks) before attributing KPI movement to a specific content change, and require statistically significant movement before claiming impact.

AI Search KPIs: Define, Calculate, and Report (Dashboard Spec)

TL;DR

Why this spec exists

Definitions used throughout

The eight core KPIs

1. Citation rate

6. Query coverage

7. Sentiment polarity

8. Platform overlap

Bonus KPIs (track when downstream attribution exists)

Sampling and statistical discipline

Dashboard layout

Panel 1 — Headline tile

Panel 2 — Per-platform breakdown

Panel 3 — Competitive set

Panel 4 — Prompt-level diagnostics

Panel 5 — Coverage gaps and overlap

Reporting cadence

Implementation checklist

Common reporting mistakes

FAQ

Q: How many prompts should the prompt set contain?

Q: Why n = 30 runs per prompt per platform per week?

Q: Should I trust a single "AI visibility score" from a tool?

Q: How do I compare citation rate across platforms with different sourcing behaviors?

Q: How quickly will I see KPI movement after a content change?

Related Articles

How to audit AI Overviews visibility (Google): checklist + metrics

Tools for AI Visibility Tracking: What to Measure and How to Choose

Hallucination triage: a playbook for fixing incorrect AI answers fast

Thông tin GEO & AI Search

AI Search KPIs: Define, Calculate, and Report (Dashboard Spec)

TL;DR

Why this spec exists

Definitions used throughout

The eight core KPIs

1. Citation rate

2. Mention rate (share-of-mention floor)

3. Share of voice (SoV) — mention-based

4. Share of voice (SoV) — citation-based

5. Share of answer (SoA)

6. Query coverage

7. Sentiment polarity

8. Platform overlap

Bonus KPIs (track when downstream attribution exists)

Sampling and statistical discipline

Dashboard layout

Panel 1 — Headline tile

Panel 2 — Per-platform breakdown

Panel 3 — Competitive set

Panel 4 — Prompt-level diagnostics

Panel 5 — Coverage gaps and overlap

Reporting cadence

Implementation checklist

Common reporting mistakes

FAQ

Q: How many prompts should the prompt set contain?

Q: Why n = 30 runs per prompt per platform per week?

Q: Should I trust a single "AI visibility score" from a tool?

Q: How do I compare citation rate across platforms with different sourcing behaviors?

Q: What is the difference between share of voice and share of answer?

Q: How quickly will I see KPI movement after a content change?

Related Articles

How to audit AI Overviews visibility (Google): checklist + metrics

Tools for AI Visibility Tracking: What to Measure and How to Choose

Hallucination triage: a playbook for fixing incorrect AI answers fast

Thông tin GEO & AI Search