Geodocs.dev

Browser Agent Crawl Etiquette: A Specification for Polite Autonomous AI Browsing

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

This specification defines five pillars of polite browser-agent behavior — identity, pacing, respect, provenance, and observability — so autonomous AI agents like ChatGPT Atlas, Perplexity Comet, and Claude for Chrome can browse on a user's behalf without exhausting publisher rate limits or eroding citation trust.

TL;DR

  • Identify: send a parseable, documented User-Agent string AND a signed AgentID header per the IETF AgentID draft protocol.
  • Pace: ≤ 1 request per second per origin, with a session-level burst budget no greater than a typical human reader.
  • Respect: honor robots.txt, the proposed /.well-known/agents.txt, and HTTP 429 / Retry-After semantics.
  • Provenance: include a delegation chain (Acting-For: user-id and OAuth evidence) so publishers can authorize per-user, not per-agent-class.
  • Observability: log every fetch, expose an agent feedback URL, and surface the audit trail to both user and publisher.

1. Scope and audience

This specification applies to browser agents — AI systems that drive a real browser session (Chromium, WebKit, or Gecko) to read, click, fill forms, or extract content on behalf of a human user. Examples include ChatGPT Atlas, Perplexity Comet, Claude for Chrome, Microsoft Copilot Vision agent mode, and Gemini agent flows.

This spec does not cover:

  • Server-side training crawlers (e.g. GPTBot, ClaudeBot, PerplexityBot) — these are governed by robots.txt rules tracked by the ai-robots-txt project.
  • Headless backend scrapers using rotating residential IPs — these violate this spec by definition.
  • API-based retrieval (e.g. web_search tool calls that hit a search API rather than a publisher origin).

2. Why a separate etiquette is needed

Browser agents differ from traditional crawlers in three operationally significant ways:

  1. They share the user's network identity. Most browser agents inherit the user's residential IP, cookies, and Chrome user-agent string, making them indistinguishable from the human user via legacy detection tools.
  2. They operate at human-session pace, but at machine consistency. Behavioral fingerprints show near-zero millisecond gaps between events and out-of-order execution sequences — a human browses unevenly; an agent does not.
  3. They are litigated. In November 2025, Amazon sued Perplexity, alleging its browser agent violated User-Agent identification headers while accessing Amazon's systems — the first major case making agent identification operationally urgent.

The outcome: publishers cannot apply CAPTCHA or IP block defenses without harming legitimate users. The only durable equilibrium is mutual etiquette: agents identify and throttle themselves; publishers offer a structured channel to negotiate access.

3. Pillar 1 — Identity

3.1 User-Agent header (REQUIRED)

A browser agent MUST append an agent suffix to the standard browser user-agent string. The suffix format is:

Mozilla/5.0 (...) Chrome/130.0.0.0 Safari/537.36 AgentClient/-/ (+)

Examples:

... Chrome/130.0.0.0 Safari/537.36 AgentClient/openai-atlas/1.4 (+https://openai.com/atlas/agent-policy)

... Chrome/130.0.0.0 Safari/537.36 AgentClient/perplexity-comet/0.9 (+https://perplexity.ai/comet/policy)

... Chrome/130.0.0.0 Safari/537.36 AgentClient/anthropic-claude-chrome/2.0 (+https://anthropic.com/claude-chrome/policy)

The + link MUST resolve to a stable, public page describing the agent's behavior, contact channel, and opt-out instructions. Vendors MUST NOT strip the AgentClient suffix to evade detection.

In addition to the User-Agent suffix, a browser agent SHOULD present a signed AgentID header per the IETF AgentID draft protocol. The header carries:

  • The agent's stable public-key identifier (agent_id).
  • A short-lived Agent Identity Token (AIT) signed by the agent's private key.
  • A delegation chain (evidence) referencing the OAuth grant that authorized the action (e.g. oauth2:token_exchange).
  • A unique jti claim to prevent replay.

Key material MUST be stored in hardware security modules or cloud secret managers in production; rotation SHOULD support a 72-hour overlap window.

3.3 Forbidden behaviors

  • MUST NOT spoof another browser's user-agent string.
  • MUST NOT rotate residential proxies to evade origin detection.
  • MUST NOT strip the Sec-CH-UA Client Hints when the underlying browser supports them.
  • MUST NOT present a different identity to bot-detection vendors than to publishers.

4. Pillar 2 — Pacing

4.1 Per-origin rate ceiling

A browser agent MUST NOT exceed:

  • 1 request per second to any single origin during normal operation.
  • 3 concurrent requests to any single origin during a multi-resource load.
  • 60 requests per minute sustained across any 5-minute window.

These ceilings approximate the well-established "human-speed" guidance of ≤ 6 page requests per minute used by polite traditional crawlers, adjusted upward to account for sub-resource fetches a real browser must make.

4.2 Session burst budget

A single agent task SHOULD NOT exceed 150 requests to one origin in a single user session without an explicit user-confirmation prompt. Tasks that legitimately need more (e.g. "export every page in this knowledge base") MUST prompt the user with the expected request count and request consent.

4.3 Backoff on errors

On any HTTP 429, 503, or 5xx response, the agent MUST:

  1. Honor Retry-After if present (seconds or HTTP-date).
  2. In the absence of Retry-After, apply exponential backoff: 2s, 4s, 8s, 16s, capped at 60s.
  3. Halt the task entirely after 3 consecutive backoffs and surface the failure to the user.

4.4 No silent retries on 4xx

4xx responses other than 429 (e.g. 401, 403, 451) MUST NOT be auto-retried. The agent MUST report the response to the user verbatim.

5. Pillar 3 — Respect for publisher signals

5.1 robots.txt

Browser agents MUST respect robots.txt directives keyed to the AgentClient product name (e.g. User-agent: openai-atlas). Publishers MAY use the canonical wildcard User-agent: * to apply to all browser agents.

Note: robots.txt compliance is voluntary at a technical level, but publishers expect it as table stakes. Vendors who treat robots.txt as advisory MUST disclose this prominently in their public agent policy.

5.2 agents.txt (forward-compatible)

When present, browser agents SHOULD fetch /.well-known/agents.txt at the start of an origin session and respect any directives it contains. The emerging agents.txt proposal extends robots.txt with agent-specific scopes (read, action, payment), rate ceilings, and authentication endpoints.

Minimum supported directives:

/.well-known/agents.txt

User-agent: *

Max-RPS: 0.5

Max-Session-Requests: 200

Auth-Endpoint: https://example.com/agents/auth

Feedback: agents@example.com

Disallow-Action: /checkout

5.3 HTTP-level signals

The agent MUST honor:

  • Cache-Control: no-store — do not memoize the response.
  • X-Robots-Tag: noai or noimageai — exclude from any AI summarization.
  • Sec-Fetch-Site: cross-site mismatches — abort if the publisher signals a forbidden context.

6. Pillar 4 — Provenance and delegation

A browser agent operates on behalf of a user, not as an independent actor. The spec requires explicit provenance so that publishers can authorize per-user, not per-agent-class.

6.1 Acting-For header

Each request SHOULD include:

Acting-For: user-id=; tenant=; session=

The user-id MUST be stable per (vendor, end-user) pair but MUST NOT be reversible to PII without OAuth-mediated consent.

6.2 Scope minimization

The agent MUST only request the scopes its current task requires. "Scope-creep" requests (e.g. asking for write access during a read-only task) MUST trigger user re-consent. This aligns with widely cited just-in-time access guidance for AI agent authorization.

6.3 Action attribution

For any action that mutates server state (POST, PUT, DELETE, agentic checkout), the agent MUST:

  • Surface the action to the user with the exact target URL and request body summary.
  • Wait for explicit confirmation per action, unless the user has pre-approved a recurring template.
  • Log the confirmation event with a verifiable timestamp.

7. Pillar 5 — Observability

7.1 Agent-side audit log

The agent runtime MUST maintain a structured per-task audit log including: timestamp, method, URL, status, bytes, latency, and the user-confirmation event (if any). The log MUST be exportable by the user.

7.2 Publisher feedback channel

Vendors MUST publish a feedback URL or email in the User-Agent docs link. Feedback intake categories should include: rate-limit complaints, content-license violations, action-misuse claims, and identity-spoofing reports. SLA guidance: acknowledge within 5 business days.

Larger vendors SHOULD offer publishers a self-serve dashboard with: monthly request volume, top URLs, error rate, and an opt-out toggle that propagates within 24 hours.

8. Conformance levels

  • Level 1 - Identifiable. Pillars 1 and 2 implemented. Minimum bar for any vendor claiming "polite" behavior.
  • Level 2 - Cooperative. Adds Pillar 3 (robots.txt + 429 handling) and Pillar 5.1 (audit log).
  • Level 3 - Authoritative. Adds Pillar 4 (provenance) and Pillar 5.2/5.3 (publisher feedback + dashboards).
  • Level 4 - Negotiated. Implements agents.txt directives, signed AgentID, and per-publisher rate negotiation.

Vendors SHOULD publish their conformance level on the documented policy page.

9. Test vectors

Conformance probe

curl -A "Mozilla/5.0 (...) Chrome/130.0.0.0 Safari/537.36 AgentClient/test/0.1 (+https://example.test/policy)"

-H "Acting-For: user-id=u_abc; tenant=t_xyz; session=s_1"

https://publisher.example/.well-known/agents.txt

A conformant agent runtime MUST:

  1. Fetch /.well-known/agents.txt before any non-trivial second request.
  2. Pace subsequent requests according to the file's Max-RPS.
  3. Stop on 429 + Retry-After.
  4. Log all four steps.

10. FAQ

Q: Does this spec apply to my AI search retrieval crawler?

No. This spec is for browser agents driving real browser sessions. Server-side retrieval crawlers should follow the publisher's robots.txt and the relevant model vendor's published crawler policy.

Q: My agent sometimes uses a hidden Chromium for sub-resource loads. Does it count?

Yes — if any phase of the agent task drives a real browser, the entire task MUST comply with this spec. Mixed flows do not exempt the user-facing portion.

Q: What about agents that solve CAPTCHAs?

Solving a CAPTCHA does not constitute consent from the publisher. An agent that solves a CAPTCHA on a page protected by robots.txt-disallow or agents.txt-disallow is non-conformant regardless of whether it succeeded.

Q: How does this relate to the AgentID IETF draft?

AgentID supplies the cryptographic identity layer (Pillar 1.2 and 4.1). This spec defines the operational behavior on top of that identity — pacing, respect, observability — that AgentID alone does not.

Q: Will publishers actually adopt agents.txt?

Adoption is early but accelerating. Cloudflare's AI Crawl Control already exposes structured directives, and the proposal mindset is shifting from blocking bots to managing agents. Treat agents.txt support as forward-compatible.

11. Citations

: IETF, AgentID: An Identity Protocol for Autonomous AI Agents (draft-gudlab-agentid-protocol-00).

: ai-robots-txt, A list of AI agents and robots to block (GitHub).

: Stape, AI Browser Tracking: What Marketers and Analysts Should Know (Feb 2026).

: Human Security, AI Agent Detection: A Guide to Identifying Autonomous Traffic.

: Jones Walker LLP, NIST's AI Agent Standards Initiative — reporting Amazon v. Perplexity, November 2025.

: Stack Overflow community guidance on polite crawler request pacing.

: Cloudflare, robots.txt setting (Cloudflare bot solutions docs).

: M. Lanham, The robots.txt for AI Agents Is Coming (Medium, 2025).

: Oso, Best Practices of Authorizing AI Agents.

Related Articles

framework

Answer Block Architecture Framework: Engineering Extractable Answer Units for AI Engines

A 5-component framework for engineering extractable answer blocks that ChatGPT, Perplexity, and Google AI Overviews cite cleanly — with schema bindings and length rules.

specification

Agent Permission Model Specification: Documenting Tool Access for AI Agents

A documentation specification for AI agent permission models: scopes, least-privilege defaults, MCP session policies, and consent flows agents can parse.

specification

Agent Skill Manifest Specification: Publishing SKILL.md for AI Agent Discovery

Agent Skill Manifest specification: how to author and publish SKILL.md so Claude, ChatGPT, Codex, Gemini, and Copilot agents discover and reuse your docs.

Cập nhật tin tức

Thông tin GEO & AI Search

Bài viết mới, cập nhật khung làm việc và phân tích ngành. Không spam, hủy đăng ký bất cứ lúc nào.