Multimodal Schema Markup for AI Search: Image, Video, and Audio Optimization Spec
Multimodal schema markup for AI search uses VideoObject, ImageObject, and AudioObject JSON-LD with required transcripts, alt text, chapter markers, and entity links so retrievers index media as first-class citation candidates alongside text.
TL;DR: Generative engines retrieve images, video, and audio when the markup gives them three things: a textual representation (transcript, alt text, caption), an entity binding, and timing or location metadata. This spec defines the required and recommended schema fields per media type, plus the validation gates that keep multimodal pages citation-eligible.
Why text-only schema is insufficient
Most schema guidance documents Article, FAQPage, and ClaimReview. AI engines now index multimodal content directly: Gemini and ChatGPT vision read images, video frames, and audio. Without machine-readable descriptions, the media exists for human users but not for retrievers. Pages that bury media behind unlabeled and
The gap addressed here is end-to-end: which fields each media type requires, what to write in transcripts, how to bind media to entities, and how to validate before publish.
Conformance levels
- Level 1 (minimum): All media has alt text, captions, and a name/description schema field.
- Level 2 (recommended): Transcripts for video and audio; chapter markers; entity bindings via about and mentions.
- Level 3 (advanced): Frame-level descriptions for video, structured data tied to ClaimReview where the media supports a verifiable claim, and parallel hreflang for multilingual transcripts.
A page that ships at Level 2 across all media is materially more citation-eligible than one at Level 1.
ImageObject
Required fields:
- @type: ImageObject
- contentUrl
- name
- description (factual, declarative; reused by image search snippets)
Recommended fields:
- caption
- creditText and creator (Person or Organization)
- copyrightNotice and license
- representativeOfPage: true for the canonical hero image
- width, height, encodingFormat
- about linking to the primary entity
- mentions for secondary entities
Alt text is not part of JSON-LD; it lives on the tag. Make alt text and description describe the image differently: alt text optimizes for accessibility (concrete, brief), description optimizes for retrieval (entity-rich, factual).
VideoObject
Required fields:
- @type: VideoObject
- name
- description
- thumbnailUrl
- uploadDate
- contentUrl or embedUrl
- duration in ISO 8601 (PT5M30S)
Recommended fields:
- transcript (string or URL to a MediaObject with the transcript)
- hasPart array of Clip objects with startOffset, endOffset, and name for chapter markers
- about, mentions, keywords
- inLanguage and translation references
- publisher Organization with sameAs
Transcripts are the highest-leverage field for AI retrieval. A well-segmented transcript with speaker labels and timestamps lets retrievers locate the cited passage and lets engines quote it accurately.
Chapter markers using the Clip pattern produce direct timestamp citations in engines that support them. Each Clip should carry an entity binding via about so a chapter about a specific concept becomes a retrievable answer.
AudioObject
Required fields:
- @type: AudioObject
- name
- description
- contentUrl
- uploadDate
- duration
Recommended fields:
- transcript (linked MediaObject is preferred; inline string acceptable for short clips)
- hasPart for episode segments using Clip
- inLanguage
- creator, publisher
For podcasts, also publish the parent PodcastSeries with episode references. Keep the canonical episode URL stable.
Transcript guidance
Transcripts are content. Write them like content:
- Segment by speaker and topic, not by 30-second windows.
- Include timestamps at segment boundaries.
- Run the transcript through the same QA pass as written articles: extractable phrasing, factual claims with sources, no filler.
- For multilingual content, publish parallel transcripts and link them via hreflang and translationOfWork.
A polished transcript is the difference between a video that is technically indexable and a video that is citation-eligible.
Entity bindings
Every media object should bind to at least one entity:
- about for the primary subject.
- mentions for secondary subjects.
- subjectOf when the media documents a specific named work or event.
Entities should resolve to a Thing with sameAs references to authoritative sources (Wikidata, the entity's official page, a registry). Without entity bindings, the media is retrievable on text alone, which underuses the schema layer.
Validation gates
Before publish, run:
- Schema validator. Google Rich Results test plus Schema Markup Validator for non-Google fields.
- Transcript completeness. All video and audio over 60 seconds must have a transcript.
- Alt text presence. All
tags have non-empty alt text.
- Entity coverage. Each media object has at least one about binding.
- Stable URLs. contentUrl and embedUrl are canonical; CDN URLs without versioning are rejected.
Gate failures should block publish, not warn.
Implementation pitfalls
- Stuffing keywords into alt text. Engines penalize this; write descriptive alt text in flat declarative voice.
- Reusing the same description across media. Each media object's description should be specific to that media.
- Skipping transcripts on short videos. Even 30-second clips benefit from a transcript for retrieval.
- Using embedUrl only. Where possible, also publish contentUrl to a stable host so retrievers can access the source directly.
- Forgetting representativeOfPage. The hero image is your most-cited image; mark it explicitly.
FAQ
Q: Do AI engines actually consume schema in 2026?
Yes. Google AI Overviews, Gemini, and ChatGPT search use structured data signals during retrieval and citation selection. Schema is not the only signal, but it materially improves citation eligibility for media-heavy pages.
Q: Should I duplicate alt text into the JSON-LD description?
No. Write alt text for accessibility (brief, concrete) and description for retrieval (entity-rich, factual). Engines read both for different purposes.
Q: Are auto-generated transcripts good enough?
As a starting point, yes. Citation-grade transcripts are reviewed and corrected by a human, especially for proper nouns, numbers, and technical terms.
Q: Do Clip chapter markers produce direct timestamp citations?
In engines that support them (notably Google), yes. Markers also let editors track which chapters earn citations and where to invest in transcript polish.
Q: Can ImageObject improve text article citations even without an image-heavy page?
Yes. Marking the hero image with representativeOfPage and a strong description helps multimodal retrievers reinforce the page's entity binding, which lifts the article overall.
Related Articles
ClaimReview Schema for AI Trust: Specification and Implementation
Specification for ClaimReview schema applied to AI trust: structure, required fields, valid values, and patterns for non-fact-check publishers.
Knowledge Graph Markup for AI Search: A schema.org Pattern Specification
Knowledge graph markup for AI search: a schema.org pattern specification linking entities, relationships, and citations to win generative engine trust.