Dataset Schema for AI Citations
schema.org/Dataset is the canonical JSON-LD type for a published collection of data — research data, government statistics, machine-learning training corpora, benchmark datasets. It powers Google Dataset Search and is the structured-data backbone for AI engines that need to cite numbers, percentages, and trends. Required: name, description, license. Strongly recommended: distribution (with DataDownload and encodingFormat), creator, citation, variableMeasured, temporalCoverage, spatialCoverage, and a stable identifier like a DOI. FAIR-aligned datasets earn dramatically higher AI citation rates because they are findable, accessible, interoperable, and reusable.
TL;DR
Mark up dataset landing pages with JSON-LD @type: Dataset. Required: name, description (under 5000 characters), license (use canonical Creative Commons URI). Strongly recommended: distribution (one or more DataDownload objects with contentUrl + encodingFormat), creator, citation, identifier (DOI preferred), variableMeasured, temporalCoverage, spatialCoverage, keywords, isAccessibleForFree, version, dateModified. Validate with the Schema.org Validator and Google Rich Results Test. Submit dataset URLs to Google Dataset Search for fastest discovery.
Definition
Dataset is a Schema.org type for "a body of structured information describing some topic(s) of interest." Examples include CSV files, scientific corpora, government statistics, satellite imagery archives, machine-learning benchmark datasets, financial time series, and clinical trial data. Each Dataset object describes the dataset itself; downloadable manifestations are described separately as DataDownload objects nested under distribution. A DataCatalog is a higher-level container of multiple datasets (use it on catalog index pages, not on individual dataset pages).
Google Dataset Search uses Dataset markup to populate a dedicated dataset-search vertical at toolbox.google.com/datasetsearch. AI engines (ChatGPT, Claude, Perplexity, Google AI Mode) reuse the same vocabulary to ground numerical claims, attribute statistics, and cite sources for data-driven answers.
Why Dataset schema matters for AI search
- Numerical grounding. AI engines aggressively prefer to cite numerical claims with structured data sources because numbers are easy to misattribute. A dataset with proper Dataset markup, license, and DOI is preferentially cited over a blog post containing the same numbers.
- Findability inside dataset verticals. Google Dataset Search is a separate index. Without Dataset markup, your data is invisible to it (and to the AI engines that use it as a feed source).
- License-aware AI summarization. AI engines increasingly respect license terms when reproducing data. A dataset with a clear license field (especially CC-BY- or CC0) is more likely to be summarized in full; restrictive or missing licenses cause AI engines to elide details or skip citation.
- FAIR alignment. Datasets that follow FAIR principles (Findable, Accessible, Interoperable, Reusable) produce richer structured data and earn correspondingly higher AI citation rates.
Industry studies suggest LLMs powered by knowledge graphs achieve significantly higher accuracy than those relying solely on unstructured text — a Data World analysis claimed up to 3x. Treat the specific multiplier as directional; the underlying mechanism (structured-data grounding lifts factual accuracy) is well established.
Required and recommended properties
Google's required and recommended properties for the Dataset rich result:
| Property | Type | Status | Description |
|---|---|---|---|
| name | Text | Required | Short title of the dataset. |
| description | Text | Required | 50-5000 characters. Google truncates after 5000. |
| license | URL or CreativeWork | Required (effective) | Canonical license URI (e.g., https://creativecommons.org/publicdomain/zero/1.0/). Use the canonical, language-neutral URI. |
| url | URL | Recommended | Canonical landing page URL. |
| sameAs | URL | Recommended | DOI page, Wikidata, dataset portals. |
| identifier | URL, Text, or PropertyValue | Recommended | DOI preferred. Use PropertyValue for typed identifiers. |
| creator | Person or Organization | Recommended | Use @id to link to a canonical Organization or Person entity. |
| publisher | Organization | Recommended | Distinct from creator if applicable. |
| funder | Person or Organization | Recommended | For grant-funded datasets. |
| citation | CreativeWork or Text | Recommended | Recommended citation string (BibTeX, APA, or scholarly reference). |
| version | Text or Number | Recommended | Semantic version (1.2.0) or release tag. |
| dateModified | Date | Recommended | Most recent modification date. |
| datePublished | Date | Recommended | First publication date. |
| distribution | DataDownload | Recommended | One or more DataDownload objects per format. |
| variableMeasured | Text or PropertyValue | Recommended | What the dataset measures (e.g., "sea surface temperature"). |
| measurementTechnique | Text or URL | Recommended | How variables were measured. |
| temporalCoverage | Text | Recommended | ISO 8601 interval or range (e.g., 2020-01-01/2025-12-31). |
| spatialCoverage | Place | Recommended | Geographic coverage; use Place with geo for precision. |
| keywords | Text | Recommended | Comma-separated topic terms or controlled vocabulary terms. |
| isAccessibleForFree | Boolean | Recommended | True for open data. |
| includedInDataCatalog | DataCatalog | Optional | Parent catalog. |
| issn | Text | Optional | For serial datasets. |
DataDownload sub-properties
| Property | Type | Status | Description |
|---|---|---|---|
| @type | "DataDownload" | Required | |
| contentUrl | URL | Required | Direct download URL. |
| encodingFormat | Text | Required (effective) | MIME type (text/csv, application/json, application/x-parquet, application/zip). |
| name | Text | Recommended | Human-readable name for this distribution. |
| description | Text | Optional | E.g., "Compressed CSV bundle, 4.2 GB". |
| contentSize | Text | Recommended | E.g., "4.2 GB" or "42000000" (bytes). |
| dateModified | Date | Recommended | Distribution-specific modification date. |
Canonical example: open research dataset
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Dataset",
"@id": "https://example.org/datasets/global-sst-2020-2025#dataset",
"name": "Global Sea Surface Temperature Daily Composites, 2020-2025",
"description": "Daily composites of global sea surface temperature derived from MODIS Aqua and VIIRS sensors, gridded at 0.05° resolution. Covers the period 2020-01-01 through 2025-12-31. Includes quality flags, cloud masks, and per-pixel uncertainty estimates. Suitable for climate research, marine biology, and operational oceanography.",
"url": "https://example.org/datasets/global-sst-2020-2025",
"sameAs": "https://doi.org/10.5281/zenodo.example12345",
"identifier": {
"@type": "PropertyValue",
"propertyID": "DOI",
"value": "10.5281/zenodo.example12345"
},
"keywords": [
"sea surface temperature",
"climate",
"oceanography",
"remote sensing",
"MODIS",
"VIIRS"
],
"license": "https://creativecommons.org/licenses/by/4.0/",
"isAccessibleForFree": true,
"creator": {
"@type": "Organization",
"@id": "https://example.org/#organization",
"name": "Example Climate Research Institute",
"url": "https://example.org"
},
"funder": {
"@type": "Organization",
"name": "National Science Foundation",
"identifier": "https://ror.org/021nxhr62"
},
"version": "2.1.0",
"datePublished": "2026-04-01",
"dateModified": "2026-05-01",
"temporalCoverage": "2020-01-01/2025-12-31",
"spatialCoverage": {
"@type": "Place",
"name": "Global ocean",
"geo": {
"@type": "GeoShape",
"box": "-90 -180 90 180"
}
},
"variableMeasured": [
{
"@type": "PropertyValue",
"name": "sea_surface_temperature",
"unitText": "degrees Celsius"
},
{
"@type": "PropertyValue",
"name": "quality_flag",
"description": "0=good, 1=marginal, 2=cloudy, 3=invalid"
}
],
"measurementTechnique": "Satellite remote sensing (thermal infrared)",
"citation": "Example Climate Research Institute (2026). Global Sea Surface Temperature Daily Composites, 2020-2025 (Version 2.1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.example12345",
"distribution": [
{
"@type": "DataDownload",
"name": "NetCDF, daily files (per-day)",
"contentUrl": "https://example.org/datasets/global-sst-2020-2025/netcdf/",
"encodingFormat": "application/x-netcdf",
"contentSize": "1.4 TB total"
},
{
"@type": "DataDownload",
"name": "Parquet, monthly aggregates",
"contentUrl": "https://example.org/datasets/global-sst-2020-2025/parquet/monthly.parquet",
"encodingFormat": "application/x-parquet",
"contentSize": "42 GB"
},
{
"@type": "DataDownload",
"name": "CSV, regional summaries",
"contentUrl": "https://example.org/datasets/global-sst-2020-2025/csv/regional-summary.csv",
"encodingFormat": "text/csv",
"contentSize": "180 MB"
}
]
}
</script>Canonical example: machine-learning benchmark
ML benchmarks (MMLU, GPQA, HumanEval, etc.) deserve full Dataset markup so AI engines can cite them when answering evaluation-related questions.
{
"@context": "https://schema.org",
"@type": "Dataset",
"name": "Example Reasoning Benchmark v1.0",
"description": "5,000 expert-validated multi-step reasoning questions across mathematics, physics, and computer science. Designed to evaluate large language model performance on graduate-level problems.",
"url": "https://example.org/benchmarks/reasoning-v1",
"identifier": {
"@type": "PropertyValue",
"propertyID": "DOI",
"value": "10.5281/zenodo.example99999"
},
"keywords": ["LLM benchmark", "reasoning", "evaluation", "NLP"],
"license": "https://creativecommons.org/licenses/by-sa/4.0/",
"isAccessibleForFree": true,
"creator": {
"@type": "Organization",
"name": "Example AI Research Lab"
},
"version": "1.0",
"datePublished": "2026-03-15",
"variableMeasured": [
{"@type": "PropertyValue", "name": "accuracy", "unitText": "percent"},
{"@type": "PropertyValue", "name": "reasoning_steps", "unitText": "count"}
],
"distribution": [
{
"@type": "DataDownload",
"contentUrl": "https://huggingface.co/datasets/example/reasoning-v1",
"encodingFormat": "application/json"
},
{
"@type": "DataDownload",
"contentUrl": "https://github.com/example/reasoning-v1",
"encodingFormat": "text/x-python"
}
],
"citation": "Example AI Research Lab (2026). Example Reasoning Benchmark v1.0. arXiv:2603.example."
}License selection
Use canonical, language-neutral license URIs. Common choices:
- https://creativecommons.org/publicdomain/zero/1.0/ — CC0 (public domain). Maximum AI citation reuse.
- https://creativecommons.org/licenses/by/4.0/ — CC-BY 4.0 (attribution required). Standard for open research data.
- https://creativecommons.org/licenses/by-sa/4.0/ — CC-BY-SA 4.0 (share-alike). Common for ML benchmarks.
- https://opendatacommons.org/licenses/odbl/1-0/ — ODbL (database-specific share-alike).
- https://www.gnu.org/licenses/gpl-3.0.html — GPLv3 (code-style copyleft).
Use the canonical, non-localized URI. AI engines and Google Dataset Search treat localized variants (e.g., .../by/4.0/deed.de) as separate licenses, which fragments your dataset's attribution graph.
variableMeasured patterns
Simple text
"variableMeasured": "Atmospheric CO2 concentration"PropertyValue (preferred for AI grounding)
"variableMeasured": [
{
"@type": "PropertyValue",
"name": "co2_ppm",
"unitText": "parts per million",
"description": "Monthly mean atmospheric CO2, dry-air mole fraction"
},
{
"@type": "PropertyValue",
"name": "station_id",
"description": "NOAA GML station identifier"
}
]FAIR alignment
FAIR (Findable, Accessible, Interoperable, Reusable) principles align directly with Dataset schema properties:
- Findable → identifier (DOI), name, description, keywords, presence in Google Dataset Search.
- Accessible → distribution.contentUrl, isAccessibleForFree, working downloads.
- Interoperable → encodingFormat (open formats), variableMeasured with controlled vocabularies, measurementTechnique.
- Reusable → license (open), creator, citation, version, clear provenance.
AI engines disproportionately cite FAIR-aligned datasets because the structured data answers the four questions a citation engine asks: "can I find it, can I access it, can I parse it, am I allowed to reuse it?"
DCAT and dataset metadata standards
Dataset is intentionally a lightweight wrapper. For richer metadata, align with DCAT (Data Catalog Vocabulary) on the catalog side and reference DataCite, ISO 19115, or DDI on the dataset side. Schema.org Dataset can carry DCAT-equivalent fields via additionalType or by mirroring properties.
Validation pipeline
- Schema.org Validator — syntax check.
- Google Rich Results Test — confirms dataset eligibility.
- Google Dataset Search submission — add canonical dataset URLs to your sitemap; Google Dataset Search will discover them.
- License URI canonicalization — confirm you are using the language-neutral canonical URI.
- DOI resolution check — confirm DOI resolves to the correct landing page.
- AI smoke test — query Perplexity, ChatGPT, and Google AI Overviews for representative numerical questions your dataset answers. Track whether your dataset is cited.
Common mistakes
- Description below 50 characters or above 5000. Google truncates after 5000. Keep descriptions informative.
- License as a free-text string instead of a URL. Always use canonical URI.
- Localized license URI. Use language-neutral form.
- No distribution. Datasets without downloadable distributions are deprioritized in Google Dataset Search.
- DataDownload missing contentUrl or encodingFormat. Both are effectively required.
- DOI in url instead of identifier. Use the structured identifier PropertyValue.
- Dataset markup on a catalog page. Use DataCatalog for catalogs; Dataset for individual datasets.
- Stale dateModified. Update when data changes.
- No version. Versioning is required for reproducible AI citations.
- variableMeasured as a single text blob.* Prefer PropertyValue array with unit and description.
How to apply
- For each dataset, create a canonical landing page.
- Mint a DOI (Zenodo, DataCite, Crossref) and use it as identifier.
- Draft Dataset JSON-LD with name, description (50-5000 chars), canonical license URI.
- Add creator, publisher, version, datePublished, dateModified.
- Add one DataDownload per format under distribution with contentUrl + encodingFormat.
- Add variableMeasured, measurementTechnique, temporalCoverage, spatialCoverage.
- Validate with Schema.org Validator + Google Rich Results Test.
- Submit canonical URL via sitemap; verify appearance in Google Dataset Search within 14 days.
- Update dateModified and version whenever data changes; never silently update without bumping version.
FAQ
Q: What are the required properties for Dataset?
name, description, and license. In practice distribution (with DataDownload) and identifier (DOI) are effectively required for Google Dataset Search visibility and AI citations.
Q: Should I mint a DOI?
Yes for any research, statistical, or benchmark dataset. DOIs are the most stable identifier and are the AI-citation primitive of choice. Use Zenodo (free), DataCite, or Crossref.
Q: What encodingFormat values should I use?
Use MIME types: text/csv, application/json, application/x-parquet, application/x-netcdf, application/zip. For software-distributed datasets (Python packages, Hugging Face), use text/x-python or the relevant MIME.
Q: How do I describe time coverage for a streaming/live dataset?
Use temporalCoverage with an open-ended ISO 8601 interval ending in ..: "2020-01-01/..". Update dateModified regularly so AI engines know the data is live.
Q: Can I publish multiple distributions of the same data?
Yes — in fact, you should. Provide CSV, Parquet, and JSON if your data warrants it. Each DataDownload has its own contentUrl and encodingFormat.
Q: How does FAIR relate to Dataset schema?
FAIR principles map onto Dataset properties almost one-for-one. Implementing the recommended properties operationalizes FAIR for both human and AI consumers.
Q: Should benchmark datasets use Dataset schema?
Yes. ML benchmarks like MMLU, GPQA, and HumanEval should publish Dataset markup so AI engines can correctly cite them when answering evaluation questions. Include a citation field with the canonical paper reference.
: Schema.org, "Dataset" type — verified 2026-05-03 — supports canonical type definition and variableMeasured guidance. https://schema.org/Dataset
: Google Search Central, "Dataset (Dataset, DataCatalog, DataDownload) structured data" — verified 2026-05-03 — supports required-property list, 5000-character limit, Google Dataset Search integration. https://developers.google.com/search/docs/appearance/structured-data/dataset
: NDE (Netherlands Digital Heritage), "Requirements for Datasets" — verified 2026-05-03 — supports canonical license URI requirement. https://docs.nde.nl/requirements-datasets/
: W3C, "Data Catalog Vocabulary (DCAT) Version 3" — verified 2026-05-03 — supports DCAT alignment for catalog metadata. https://www.w3.org/TR/vocab-dcat-3/
: Carnegie Mellon University Library Guides, "Making Your LLM Dataset FAIR" — verified 2026-05-03 — supports FAIR principles and AI dataset alignment. https://guides.library.cmu.edu/researchdatamanagement/FAIR_llmdatasets
: AILabsAudit, "Schema Markup for AI Search: 7 JSON-LD That Boost Citations" — verified 2026-05-03 — supports Data World 3x knowledge-graph accuracy claim (treat as directional). https://ailabsaudit.com/blog/en/schema-markup-ai-visibility-guide
Related Articles
Event Schema for AI Search
Schema.org Event JSON-LD spec for AI search: required name/startDate/location, virtual and hybrid events, eventStatus, performer linkage, and AI citation patterns.
LocalBusiness Schema for AI Citations
LocalBusiness JSON-LD spec for AI citations: required NAP fields, openingHoursSpecification, geo, sub-types, sameAs, and AI local-intent citation patterns.
Service Schema for AI Search
Schema.org Service spec for AI search citations: required JSON-LD properties, areaServed, provider, serviceType, hoursAvailable, and entity-linking patterns.