Speakable Schema Specification for AI Voice Search
Speakable is a schema.org property that marks specific sections of an article or web page as best suited for text-to-speech playback by voice assistants and AI answer engines. It is officially supported in beta by Google for Article and WebPage types, with section targeting via cssSelector, xpath, or URL id references.
TL;DR
Speakable schema flags the parts of a page that voice assistants should read aloud. Implement it with JSON-LD on Article or WebPage types, target the right blocks with cssSelector (preferred) or xpath, and keep each speakable block under roughly 30 seconds of spoken audio. Google Assistant remains the largest production consumer (still in beta, U.S. English news content), while AI engines increasingly use the same markup as a hint for which paragraphs to extract as voice or summary citations.
Definition
The speakable property is a schema.org vocabulary term that identifies sections of an Article or WebPage that are particularly suited for audio rendering through text-to-speech (TTS). It is canonically defined on the schema.org property page (https://schema.org/speakable) and operationalized by Google as the Speakable structured data feature, currently in beta. The property accepts a SpeakableSpecification value, which exposes three content-locator strategies: cssSelector, xpath, and id-value URL references that point at fragments within the same document. The property can be repeated, so multiple disjoint regions of a page can be flagged independently.
Speakable does not change ranking in classic search results. Its function is to declare an extraction contract: when a voice or AI engine asks the page "which parts of you are appropriate to read aloud," the markup gives a machine-readable answer instead of forcing the engine to guess from headings or paragraph order.
Why this matters
Voice and AI answer surfaces are still growing share of informational queries. Google Assistant uses Speakable markup to select up to three news articles and read marked sections back to users on smart speakers and Android devices. AI Overviews, Perplexity, and ChatGPT Search do not require Speakable, but their extractors behave more reliably on pages that already mark their answer-shaped content. Marking a section as speakable concentrates extractor attention on the highest-quality summary text instead of on boilerplate or navigation.
For publishers, the practical upside is twofold: a small but real share of voice impressions on Google Assistant, and cleaner snippet selection across AI engines that read the same vocabulary. The upside is asymmetric—Speakable adds no risk to traditional rankings, costs only a few lines of JSON-LD per page, and creates a stable answer-shaped surface that newer engines can reuse without changing their crawler.
How it works
A Speakable specification is a child object on an Article or WebPage JSON-LD entity. Its core fields are:
| Field | Type | Required | Purpose |
|---|---|---|---|
| @type | Text | Yes | Always SpeakableSpecification. |
| cssSelector | Text | One of selector/xpath/url | CSS selector targeting one or more elements within the page DOM. |
| xpath | Text | One of selector/xpath/url | XPath 1.0 expression targeting elements within the page DOM. |
| url | URL | One of selector/xpath/url | URL with a fragment (#id) that resolves to an element on the same document. |
At least one of cssSelector, xpath, or url is required. The schema.org definition allows the property to be repeated, so a page can declare multiple speakable regions—for example a headline summary plus a key-points list—each with its own selector.
Voice and AI engines that consume Speakable typically perform four steps:
- Parse the page's JSON-LD and locate any speakable property on the top-level Article or WebPage node.
- Resolve each SpeakableSpecification to one or more DOM nodes using the selector strategy.
- Extract the text content of those nodes, normalize whitespace, and trim away inline navigation or interactive elements.
- Render the extracted text through TTS (Google Assistant) or hand it to a downstream summarizer (AI answer engines).
Google's beta guidance recommends keeping speakable text concise—under roughly 20 to 30 seconds of spoken audio per section—and pointing selectors at content that stands alone without the surrounding article. Sections that depend on a chart, table, or earlier paragraph for meaning will produce confusing audio output and should not be flagged.
Practical application
A minimal implementation on a news article looks like this when serialized as JSON-LD inside a