GEO Citation Share-of-Voice Measurement

AI citation share-of-voice measurement requires a deliberately constructed prompt set, per-engine sampling on a fixed cadence, query-volume weighting, a stable competitor cohort, and confidence-interval reporting. Without all five, the resulting percentage is directional at best and misleading at worst.

TL;DR

Share of voice is straightforward to define and surprisingly easy to compute badly. A defensible AI SOV measurement decomposes into five components: a prompt set built from the category, not from your brand; per-engine sampling because no two engines agree; query-volume weighting so your top-intent queries dominate the metric; a stable competitor cohort so the denominator does not move; and confidence intervals so weekly noise does not get reported as quarterly trend.

Why this needs its own framework

Most public guidance treats AI SOV as a one-line formula — your mentions divided by total category mentions. That formula is correct and useless. It hides every methodological decision that determines whether the resulting number means anything: which prompts, on which engines, sampled how often, against which competitors, with what level of statistical noise. The point of this framework is to make those decisions explicit so two analysts can reproduce the same number from the same inputs.

This framework deliberately complements the broader AI search share of voice overview by going methodology-deep where the overview goes audience-wide.

Component 1: Prompt-set construction

Build the prompt set from the category, not from your brand's keyword list. A research-grade prompt set must be externally valid and representative of the market, not tailored to maximize your brand's apparent share.

Sourcing

Customer-discovery transcripts and sales calls ("how did you find us?")
Reddit, Hacker News, and category subreddit threads from the past 12 months
People-Also-Ask harvests for the category's top informational queries
Competitor onboarding flows and demo-request forms
Internal-search logs from your own and partner sites

Sizing

Starter bank: 50-80 prompts (directional reporting, weekly cadence).
Stable bank: 150-300 prompts for most B2B SaaS categories.
Mid-market or multi-segment: 250-500 prompts with cohort tags.

The wrong answer is "the ten queries we remember." Below ~50 prompts the metric does not stabilize.

Buckets

Component 2: Per-engine sampling

Different engines retrieve from different indexes; cross-engine citation overlap on a given prompt is often low. Sample every prompt against every tracked engine on the same cadence. Do not extrapolate from one engine to another.

Minimum viable engine list: ChatGPT, Perplexity, Claude, Google AI Overviews, Gemini. Add any vertical engine your customers explicitly use (e.g. Cursor or GitHub Copilot for devtools).

Per-engine sampling rules:

Replicate runs. Run each prompt at least three times per cadence period. LLM responses are stochastic; a single run is not a measurement.
Pin model versions. Engines update underneath you. Record model_version on every run so version transitions show up as visible discontinuities, not silent drift.
Region stratification. If the brand has material non-US share, sample with explicit region tags.
Authentication parity. Logged-in vs. logged-out responses can differ. Pick one mode and hold it constant.

Component 3: Weighting

Unweighted SOV treats every prompt as equally important. It is rarely correct.

Apply at least one of the following weights and document the choice:

Query-volume weight. Weight each prompt by an external volume estimate (Google KW Planner, Semrush, Ahrefs). A prompt run by 10,000 buyers a month should not equal a prompt run by 50.
Intent weight. Up-weight commercial and transactional prompts when SOV is reported to revenue stakeholders; up-weight informational prompts when reported to brand-marketing stakeholders.
ICP weight. Up-weight prompts produced by ICP-confirmed sources (sales transcripts) relative to scraped community threads.

Mixing weights is acceptable; mixing weights without documenting them is not.

Component 4: Competitor cohort

Define the competitor cohort once and freeze it for the reporting period. Cohort drift is the single largest source of unexplained SOV change.

Cohort rules:

4-8 named competitors covering the major substitution paths in the category.
Track cohort membership versioning (cohort_version), and treat any change as a hard discontinuity in trend reporting.
Run the same prompt set against the cohort, not a competitor-specific subset. The denominator stability is what makes SOV comparable across periods.
Track an "other" bucket for any cited brand outside the cohort. A growing "other" share is a signal that the cohort is incomplete.

Report mention-based SOV and citation-based SOV as separate metrics. They diverge meaningfully and combining them destroys signal.

Component 5: Cadence and confidence intervals

Cadence

Daily: top 50 prompts per cohort. Sufficient for volatility detection.
Weekly: full stable bank. Sufficient for reporting.
Monthly: deep dive against the long tail (250+ prompts) for QBR-grade analysis.

Confidence intervals

For each reported SOV value, compute a 95% confidence interval from the replicate runs and the per-engine sample size. Report the interval, not just the point estimate. A weekly SOV of 14% ± 3% reads very differently from 14% with no interval, and prevents stakeholders from over-reading a one-point swing.

For the weekly stable bank with three replicate runs across five engines and 200 prompts, expect SOV intervals in the ±1-3pp range; tighter intervals require more replicates or a larger bank.

Reporting layout

At minimum, every SOV report should expose:

Overall SOV with confidence interval and prior-period delta.
SOV by engine.
SOV by query class (branded / non-branded / competitor-branded / ambiguous).
SOV by intent bucket.
Mention-based SOV vs. citation-based SOV side by side.
Cohort and prompt-set version pinned at the top.

Common methodological mistakes

Building the prompt set from a brand's existing keyword tracker. Inflates branded share, hides organic discovery weakness.
Single-run sampling. Mistakes stochastic noise for signal.
Letting model versions drift silently. Trend lines become uninterpretable.
Reporting a single SOV percentage averaged across engines. Hides the cross-engine divergence that is the most actionable insight.
Adding competitors mid-quarter without versioning the cohort. Period-over-period comparisons become invalid.
Reporting point estimates without confidence intervals. Weekly noise gets escalated as a trend.

FAQ

Q: How small can the prompt bank be before SOV is unreliable?

Below ~50 prompts the metric does not stabilize across replicates. Treat anything smaller as anecdotal.

Q: Should we report SOV as a single number across engines?

No. Report per-engine SOV with cross-engine deltas. The single-number view masks the most useful insight — which engines you are winning and which you are losing.

Q: Mention-based vs. citation-based SOV — which one matters?

Both, for different audiences. Citation-based SOV correlates more cleanly with downstream traffic and trust signals. Mention-based SOV correlates more cleanly with brand-conversation signals and is closer to traditional PR SOV. Report both and let the audience pick.

Q: How often should the prompt bank be refreshed?

Quarterly. Bump the prompt_set_version at every refresh and treat the change as a discontinuity in trend reporting.

Q: What is the right number of replicate runs?

Three per cadence period is the practical floor. Five is better when reporting confidence intervals to executives. Beyond five, additional replicates show diminishing returns relative to expanding the prompt bank.