Geodocs.dev

Verified Agent Identity for Citation Trust: A Specification for Authenticated AI Crawlers

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

Verified agent identity is a publisher-side specification using HTTP Message Signatures and registry lookups to authenticate AI crawlers, separate trusted bots from spoofers, and feed citation-trust signals to generative engines.

TL;DR: User-Agent strings are spoofable. Verified agent identity replaces them with cryptographic signatures the publisher can validate against a public registry. Implement four primitives: a signed request handshake, a registry lookup, a trust label exposed to the rendering pipeline, and an audit log. The result is an authenticated crawler population whose visits become a clean trust signal for generative engines that consume your content.

Why User-Agent strings are not enough

Publishers traditionally identify AI crawlers by their User-Agent string. The string is plain text, easy to copy, and impossible to verify. A scraper claiming to be GPTBot looks identical at the network layer to the real GPTBot. Logs, allowlists, and analytics built on User-Agent are therefore unreliable and produce noisy citation telemetry.

Verified agent identity replaces self-declared identity with cryptographic identity. The crawler signs each request with a private key. The publisher verifies the signature against a public key listed in a registry. Failures are dropped or rate-limited; successes are recorded as authenticated traffic.

Specification overview

A conformant verified-agent-identity implementation consists of four components.

Component 1: Signed request handshake

The crawler signs each HTTP request using the IETF HTTP Message Signatures standard. Required headers in the signature input:

  • (request-target)
  • host
  • date
  • digest (for any request body)
  • signature-input
  • signature

The keyid parameter inside signature-input identifies the signer and points the verifier to the registry record.

Component 2: Registry lookup

The verifier resolves keyid against a registry. Two registry models exist in production:

  • Vendor-published JWKS. The vendor (for example, an LLM provider) hosts a JSON Web Key Set at a stable URL. The publisher caches it with a short TTL.
  • Federated registry. A neutral registry (Web Bot Auth, Know Your Agent) lists vetted vendors and their keys. The publisher trusts the registry's allowlist and revocation feed.

A conformant publisher MUST refresh the registry inside 24 hours and MUST honor revocation entries immediately.

Component 3: Trust label

Validated requests are tagged with a trust label that travels through the request lifecycle. The label includes:

  • agent_id (canonical name from registry)
  • trust_tier (vendor, partner, unverified)
  • verified_at (timestamp)

The rendering layer reads the label to decide what content to serve. Authenticated trusted crawlers MAY receive the same content as humans plus extended structured data. Unauthenticated traffic claiming a known User-Agent receives the unauthenticated tier.

Component 4: Audit log

Every verification result is logged with the request line, key id, registry source, signature outcome, and trust label. The log is the publisher's source of truth for citation telemetry and for incident response when a vendor key is rotated or compromised.

Implementation profile

A minimum implementation:

  1. Pin the registries you will trust (vendor JWKS plus one federated registry).
  2. Stand up an edge verification function that runs before content is served.
  3. Cache registry responses; honor revocation feeds within minutes.
  4. Serve a 401 with WWW-Authenticate: Signature for invalid signatures from User-Agents that claim a registered identity.
  5. Tag valid requests with the trust label and emit telemetry to the audit log.
  6. Expose a public /.well-known/agent-identity document listing accepted registries, supported algorithms, and a contact for vendor onboarding.

Trust tiers

Publishers benefit from a tiered policy rather than a binary allow/deny:

  • Tier A (vendor verified). Cryptographic signature validates against vendor JWKS. Highest trust; full content; minimal rate limit.
  • Tier B (federated verified). Validates against federated registry. Standard rate limit.
  • Tier C (claimed but unverified). User-Agent matches a known pattern but no signature. Reduced content; aggressive rate limit; logged for review.
  • Tier D (unknown). No claim, no signature. Default robots policy applies.

The tier is part of the trust label and is consumed by analytics, billing for licensed crawls, and the GEO content layer.

How verified identity becomes a GEO signal

Generative engines reward sources that survive trust filtering. When a vendor's authenticated crawler visits a page and successfully retrieves canonical content, the vendor's downstream retrieval index becomes more confident in the source. Three concrete effects:

  • Citation persistence. Authenticated visits reduce the probability the page is dropped during index refresh.
  • Canonical reconciliation. Vendors that authenticate can reconcile redirect chains and canonical tags more accurately.
  • Licensed content access. Tier A crawlers can be granted access to paywalled or licensed content with explicit terms, captured in the audit log.

Common implementation pitfalls

  • Trusting only User-Agent for known vendors. Continue to support User-Agent for backwards compatibility, but downgrade unsigned traffic to Tier C.
  • Long registry cache TTLs. A compromised key cached for a week is a compromised week. Cache for minutes, not days, with revocation feed monitoring.
  • Skipping the audit log. Without a log, you cannot prove the crawler accessed the canonical version on the date in question, which weakens any dispute over miscitation.
  • Hard-blocking unverified traffic. Some legitimate vendors are still onboarding signing. Use rate limits and content tiers, not full blocks, while the ecosystem matures.
  • Mixing identity verification with rate limiting policy. Keep them separate; identity tells you who is asking, rate policy tells you how often they may.

FAQ

Q: Is verified agent identity the same as Web Bot Auth?

Web Bot Auth is one of the federated registry implementations of verified agent identity. The broader specification covers the handshake, registry lookup, trust label, and audit log regardless of which registry is used.

Q: Do I need to support every vendor's signing scheme?

Support HTTP Message Signatures as the baseline. Vendors that ship a JWKS at a stable URL plug into a single verification path. Federated registries normalize the rest.

Q: What status code should I return for invalid signatures?

Return 401 Unauthorized with WWW-Authenticate: Signature when the User-Agent claims a registered identity but the signature fails. Return the unauthenticated tier (not 401) for traffic that does not claim verified identity at all.

Q: Will signing slow my edge?

Verification is cheap once registry keys are cached. The hot path is signature validation, which adds sub-millisecond overhead at the edge. The cold path (registry refresh) runs out-of-band.

Q: What about agentic browsers (ChatGPT Atlas, Comet) that act on behalf of a user?

Those are a different category. They claim user delegation, not vendor identity. A separate spec (delegated agent identity) is emerging; pair it with verified vendor identity for full coverage.

Related Articles

specification

Agent Citation Attribution Specification: Verifiable Source Tracking for Autonomous AI Agents

Specification defining HTTP headers, provenance manifests, and chain-of-citation markup so autonomous AI agents produce verifiable citations to source content.

specification

Browser Agent Crawl Etiquette: A Specification for Polite Autonomous AI Browsing

A specification defining how browser-based AI agents should identify themselves, throttle requests, and respect publisher signals to maintain citation trust.

framework

GEO Authority Signal Engineering: A 6-Phase Framework for AI Citation Trust

GEO authority signal engineering framework: a 6-phase model for building trust signals that lift AI citation rates across ChatGPT, Perplexity, and Gemini.

Cập nhật tin tức

Thông tin GEO & AI Search

Bài viết mới, cập nhật khung làm việc và phân tích ngành. Không spam, hủy đăng ký bất cứ lúc nào.