Dataset Schema for AI Citations

schema.org/Dataset is the canonical JSON-LD type for a published collection of data — research data, government statistics, machine-learning training corpora, benchmark datasets. It powers Google Dataset Search and is the structured-data backbone for AI engines that need to cite numbers, percentages, and trends. Required: name, description, license. Strongly recommended: distribution (with DataDownload and encodingFormat), creator, citation, variableMeasured, temporalCoverage, spatialCoverage, and a stable identifier like a DOI. FAIR-aligned datasets earn dramatically higher AI citation rates because they are findable, accessible, interoperable, and reusable.

TL;DR

Mark up dataset landing pages with JSON-LD @type: Dataset. Required: name, description (under 5000 characters), license (use canonical Creative Commons URI). Strongly recommended: distribution (one or more DataDownload objects with contentUrl + encodingFormat), creator, citation, identifier (DOI preferred), variableMeasured, temporalCoverage, spatialCoverage, keywords, isAccessibleForFree, version, dateModified. Validate with the Schema.org Validator and Google Rich Results Test. Submit dataset URLs to Google Dataset Search for fastest discovery.

Definition

Dataset is a Schema.org type for "a body of structured information describing some topic(s) of interest." Examples include CSV files, scientific corpora, government statistics, satellite imagery archives, machine-learning benchmark datasets, financial time series, and clinical trial data. Each Dataset object describes the dataset itself; downloadable manifestations are described separately as DataDownload objects nested under distribution. A DataCatalog is a higher-level container of multiple datasets (use it on catalog index pages, not on individual dataset pages).

Google Dataset Search uses Dataset markup to populate a dedicated dataset-search vertical at toolbox.google.com/datasetsearch. AI engines (ChatGPT, Claude, Perplexity, Google AI Mode) reuse the same vocabulary to ground numerical claims, attribute statistics, and cite sources for data-driven answers.

Why Dataset schema matters for AI search

Numerical grounding. AI engines aggressively prefer to cite numerical claims with structured data sources because numbers are easy to misattribute. A dataset with proper Dataset markup, license, and DOI is preferentially cited over a blog post containing the same numbers.
Findability inside dataset verticals. Google Dataset Search is a separate index. Without Dataset markup, your data is invisible to it (and to the AI engines that use it as a feed source).
License-aware AI summarization. AI engines increasingly respect license terms when reproducing data. A dataset with a clear license field (especially CC-BY- or CC0) is more likely to be summarized in full; restrictive or missing licenses cause AI engines to elide details or skip citation.
FAIR alignment. Datasets that follow FAIR principles (Findable, Accessible, Interoperable, Reusable) produce richer structured data and earn correspondingly higher AI citation rates.

Industry studies suggest LLMs powered by knowledge graphs achieve significantly higher accuracy than those relying solely on unstructured text — a Data World analysis claimed up to 3x. Treat the specific multiplier as directional; the underlying mechanism (structured-data grounding lifts factual accuracy) is well established.

Required and recommended properties

Google's required and recommended properties for the Dataset rich result:

Property	Type	Status	Description
name	Text	Required	Short title of the dataset.
description	Text	Required	50-5000 characters. Google truncates after 5000.
license	URL or CreativeWork	Required (effective)	Canonical license URI (e.g., https://creativecommons.org/publicdomain/zero/1.0/). Use the canonical, language-neutral URI.
url	URL	Recommended	Canonical landing page URL.
sameAs	URL	Recommended	DOI page, Wikidata, dataset portals.
identifier	URL, Text, or PropertyValue	Recommended	DOI preferred. Use PropertyValue for typed identifiers.
creator	Person or Organization	Recommended	Use @id to link to a canonical Organization or Person entity.
publisher	Organization	Recommended	Distinct from creator if applicable.
funder	Person or Organization	Recommended	For grant-funded datasets.
citation	CreativeWork or Text	Recommended	Recommended citation string (BibTeX, APA, or scholarly reference).
version	Text or Number	Recommended	Semantic version (1.2.0) or release tag.
dateModified	Date	Recommended	Most recent modification date.
datePublished	Date	Recommended	First publication date.
distribution	DataDownload	Recommended	One or more DataDownload objects per format.
variableMeasured	Text or PropertyValue	Recommended	What the dataset measures (e.g., "sea surface temperature").
measurementTechnique	Text or URL	Recommended	How variables were measured.
temporalCoverage	Text	Recommended	ISO 8601 interval or range (e.g., 2020-01-01/2025-12-31).
spatialCoverage	Place	Recommended	Geographic coverage; use Place with geo for precision.
keywords	Text	Recommended	Comma-separated topic terms or controlled vocabulary terms.
isAccessibleForFree	Boolean	Recommended	True for open data.
includedInDataCatalog	DataCatalog	Optional	Parent catalog.
issn	Text	Optional	For serial datasets.

DataDownload sub-properties

Property	Type	Status	Description
@type	"DataDownload"	Required
contentUrl	URL	Required	Direct download URL.
encodingFormat	Text	Required (effective)	MIME type (text/csv, application/json, application/x-parquet, application/zip).
name	Text	Recommended	Human-readable name for this distribution.
description	Text	Optional	E.g., "Compressed CSV bundle, 4.2 GB".
contentSize	Text	Recommended	E.g., "4.2 GB" or "42000000" (bytes).
dateModified	Date	Recommended	Distribution-specific modification date.

Canonical example: open research dataset

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Dataset",
  "@id": "https://example.org/datasets/global-sst-2020-2025#dataset",
  "name": "Global Sea Surface Temperature Daily Composites, 2020-2025",
  "description": "Daily composites of global sea surface temperature derived from MODIS Aqua and VIIRS sensors, gridded at 0.05° resolution. Covers the period 2020-01-01 through 2025-12-31. Includes quality flags, cloud masks, and per-pixel uncertainty estimates. Suitable for climate research, marine biology, and operational oceanography.",
  "url": "https://example.org/datasets/global-sst-2020-2025",
  "sameAs": "https://doi.org/10.5281/zenodo.example12345",
  "identifier": {
    "@type": "PropertyValue",
    "propertyID": "DOI",
    "value": "10.5281/zenodo.example12345"
  },
  "keywords": [
    "sea surface temperature",
    "climate",
    "oceanography",
    "remote sensing",
    "MODIS",
    "VIIRS"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "isAccessibleForFree": true,
  "creator": {
    "@type": "Organization",
    "@id": "https://example.org/#organization",
    "name": "Example Climate Research Institute",
    "url": "https://example.org"
  },
  "funder": {
    "@type": "Organization",
    "name": "National Science Foundation",
    "identifier": "https://ror.org/021nxhr62"
  },
  "version": "2.1.0",
  "datePublished": "2026-04-01",
  "dateModified": "2026-05-01",
  "temporalCoverage": "2020-01-01/2025-12-31",
  "spatialCoverage": {
    "@type": "Place",
    "name": "Global ocean",
    "geo": {
      "@type": "GeoShape",
      "box": "-90 -180 90 180"
    }
  },
  "variableMeasured": [
    {
      "@type": "PropertyValue",
      "name": "sea_surface_temperature",
      "unitText": "degrees Celsius"
    },
    {
      "@type": "PropertyValue",
      "name": "quality_flag",
      "description": "0=good, 1=marginal, 2=cloudy, 3=invalid"
    }
  ],
  "measurementTechnique": "Satellite remote sensing (thermal infrared)",
  "citation": "Example Climate Research Institute (2026). Global Sea Surface Temperature Daily Composites, 2020-2025 (Version 2.1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.example12345",
  "distribution": [
    {
      "@type": "DataDownload",
      "name": "NetCDF, daily files (per-day)",
      "contentUrl": "https://example.org/datasets/global-sst-2020-2025/netcdf/",
      "encodingFormat": "application/x-netcdf",
      "contentSize": "1.4 TB total"
    },
    {
      "@type": "DataDownload",
      "name": "Parquet, monthly aggregates",
      "contentUrl": "https://example.org/datasets/global-sst-2020-2025/parquet/monthly.parquet",
      "encodingFormat": "application/x-parquet",
      "contentSize": "42 GB"
    },
    {
      "@type": "DataDownload",
      "name": "CSV, regional summaries",
      "contentUrl": "https://example.org/datasets/global-sst-2020-2025/csv/regional-summary.csv",
      "encodingFormat": "text/csv",
      "contentSize": "180 MB"
    }
  ]
}
</script>

Canonical example: machine-learning benchmark

ML benchmarks (MMLU, GPQA, HumanEval, etc.) deserve full Dataset markup so AI engines can cite them when answering evaluation-related questions.

{
  "@context": "https://schema.org",
  "@type": "Dataset",
  "name": "Example Reasoning Benchmark v1.0",
  "description": "5,000 expert-validated multi-step reasoning questions across mathematics, physics, and computer science. Designed to evaluate large language model performance on graduate-level problems.",
  "url": "https://example.org/benchmarks/reasoning-v1",
  "identifier": {
    "@type": "PropertyValue",
    "propertyID": "DOI",
    "value": "10.5281/zenodo.example99999"
  },
  "keywords": ["LLM benchmark", "reasoning", "evaluation", "NLP"],
  "license": "https://creativecommons.org/licenses/by-sa/4.0/",
  "isAccessibleForFree": true,
  "creator": {
    "@type": "Organization",
    "name": "Example AI Research Lab"
  },
  "version": "1.0",
  "datePublished": "2026-03-15",
  "variableMeasured": [
    {"@type": "PropertyValue", "name": "accuracy", "unitText": "percent"},
    {"@type": "PropertyValue", "name": "reasoning_steps", "unitText": "count"}
  ],
  "distribution": [
    {
      "@type": "DataDownload",
      "contentUrl": "https://huggingface.co/datasets/example/reasoning-v1",
      "encodingFormat": "application/json"
    },
    {
      "@type": "DataDownload",
      "contentUrl": "https://github.com/example/reasoning-v1",
      "encodingFormat": "text/x-python"
    }
  ],
  "citation": "Example AI Research Lab (2026). Example Reasoning Benchmark v1.0. arXiv:2603.example."
}

License selection

Use canonical, language-neutral license URIs. Common choices:

https://creativecommons.org/publicdomain/zero/1.0/ — CC0 (public domain). Maximum AI citation reuse.
https://creativecommons.org/licenses/by/4.0/ — CC-BY 4.0 (attribution required). Standard for open research data.
https://creativecommons.org/licenses/by-sa/4.0/ — CC-BY-SA 4.0 (share-alike). Common for ML benchmarks.
https://opendatacommons.org/licenses/odbl/1-0/ — ODbL (database-specific share-alike).
https://www.gnu.org/licenses/gpl-3.0.html — GPLv3 (code-style copyleft).

Use the canonical, non-localized URI. AI engines and Google Dataset Search treat localized variants (e.g., .../by/4.0/deed.de) as separate licenses, which fragments your dataset's attribution graph.

variableMeasured patterns

Simple text

"variableMeasured": "Atmospheric CO2 concentration"

PropertyValue (preferred for AI grounding)

"variableMeasured": [
  {
    "@type": "PropertyValue",
    "name": "co2_ppm",
    "unitText": "parts per million",
    "description": "Monthly mean atmospheric CO2, dry-air mole fraction"
  },
  {
    "@type": "PropertyValue",
    "name": "station_id",
    "description": "NOAA GML station identifier"
  }
]

FAIR alignment

FAIR (Findable, Accessible, Interoperable, Reusable) principles align directly with Dataset schema properties:

Findable → identifier (DOI), name, description, keywords, presence in Google Dataset Search.
Accessible → distribution.contentUrl, isAccessibleForFree, working downloads.
Interoperable → encodingFormat (open formats), variableMeasured with controlled vocabularies, measurementTechnique.
Reusable → license (open), creator, citation, version, clear provenance.

AI engines disproportionately cite FAIR-aligned datasets because the structured data answers the four questions a citation engine asks: "can I find it, can I access it, can I parse it, am I allowed to reuse it?"

DCAT and dataset metadata standards

Dataset is intentionally a lightweight wrapper. For richer metadata, align with DCAT (Data Catalog Vocabulary) on the catalog side and reference DataCite, ISO 19115, or DDI on the dataset side. Schema.org Dataset can carry DCAT-equivalent fields via additionalType or by mirroring properties.

Validation pipeline

Schema.org Validator — syntax check.
Google Rich Results Test — confirms dataset eligibility.
Google Dataset Search submission — add canonical dataset URLs to your sitemap; Google Dataset Search will discover them.
License URI canonicalization — confirm you are using the language-neutral canonical URI.
DOI resolution check — confirm DOI resolves to the correct landing page.
AI smoke test — query Perplexity, ChatGPT, and Google AI Overviews for representative numerical questions your dataset answers. Track whether your dataset is cited.

Common mistakes

Description below 50 characters or above 5000. Google truncates after 5000. Keep descriptions informative.
License as a free-text string instead of a URL. Always use canonical URI.
Localized license URI. Use language-neutral form.
No distribution. Datasets without downloadable distributions are deprioritized in Google Dataset Search.
DataDownload missing contentUrl or encodingFormat. Both are effectively required.
DOI in url instead of identifier. Use the structured identifier PropertyValue.
Dataset markup on a catalog page. Use DataCatalog for catalogs; Dataset for individual datasets.
Stale dateModified. Update when data changes.
No version. Versioning is required for reproducible AI citations.
variableMeasured as a single text blob.* Prefer PropertyValue array with unit and description.

How to apply

For each dataset, create a canonical landing page.
Mint a DOI (Zenodo, DataCite, Crossref) and use it as identifier.
Draft Dataset JSON-LD with name, description (50-5000 chars), canonical license URI.
Add creator, publisher, version, datePublished, dateModified.
Add one DataDownload per format under distribution with contentUrl + encodingFormat.
Add variableMeasured, measurementTechnique, temporalCoverage, spatialCoverage.
Validate with Schema.org Validator + Google Rich Results Test.
Submit canonical URL via sitemap; verify appearance in Google Dataset Search within 14 days.
Update dateModified and version whenever data changes; never silently update without bumping version.

FAQ

Q: What are the required properties for Dataset?

name, description, and license. In practice distribution (with DataDownload) and identifier (DOI) are effectively required for Google Dataset Search visibility and AI citations.

Q: Should I mint a DOI?

Yes for any research, statistical, or benchmark dataset. DOIs are the most stable identifier and are the AI-citation primitive of choice. Use Zenodo (free), DataCite, or Crossref.

Q: What encodingFormat values should I use?

Use MIME types: text/csv, application/json, application/x-parquet, application/x-netcdf, application/zip. For software-distributed datasets (Python packages, Hugging Face), use text/x-python or the relevant MIME.

Q: How do I describe time coverage for a streaming/live dataset?

Use temporalCoverage with an open-ended ISO 8601 interval ending in ..: "2020-01-01/..". Update dateModified regularly so AI engines know the data is live.

Q: Can I publish multiple distributions of the same data?

Yes — in fact, you should. Provide CSV, Parquet, and JSON if your data warrants it. Each DataDownload has its own contentUrl and encodingFormat.

Q: How does FAIR relate to Dataset schema?

FAIR principles map onto Dataset properties almost one-for-one. Implementing the recommended properties operationalizes FAIR for both human and AI consumers.

Q: Should benchmark datasets use Dataset schema?

Yes. ML benchmarks like MMLU, GPQA, and HumanEval should publish Dataset markup so AI engines can correctly cite them when answering evaluation questions. Include a citation field with the canonical paper reference.

: Schema.org, "Dataset" type — verified 2026-05-03 — supports canonical type definition and variableMeasured guidance. https://schema.org/Dataset

: Google Search Central, "Dataset (Dataset, DataCatalog, DataDownload) structured data" — verified 2026-05-03 — supports required-property list, 5000-character limit, Google Dataset Search integration. https://developers.google.com/search/docs/appearance/structured-data/dataset

: NDE (Netherlands Digital Heritage), "Requirements for Datasets" — verified 2026-05-03 — supports canonical license URI requirement. https://docs.nde.nl/requirements-datasets/

: W3C, "Data Catalog Vocabulary (DCAT) Version 3" — verified 2026-05-03 — supports DCAT alignment for catalog metadata. https://www.w3.org/TR/vocab-dcat-3/

: Carnegie Mellon University Library Guides, "Making Your LLM Dataset FAIR" — verified 2026-05-03 — supports FAIR principles and AI dataset alignment. https://guides.library.cmu.edu/researchdatamanagement/FAIR_llmdatasets

: AILabsAudit, "Schema Markup for AI Search: 7 JSON-LD That Boost Citations" — verified 2026-05-03 — supports Data World 3x knowledge-graph accuracy claim (treat as directional). https://ailabsaudit.com/blog/en/schema-markup-ai-visibility-guide