Geodocs.dev

Agent Long-Running Job Documentation Specification

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

A long-running agent job is any tool call that exceeds a few seconds and must be modeled as an asynchronous operation. This specification defines how to document the kickoff, status, progress, cancellation, and timeout semantics so AI agents can call the tool deterministically.

TL;DR

Long-running tools must (1) return a job identifier from an async kickoff, (2) expose either a GET /jobs/{id} polling endpoint or a Server-Sent Events stream for progress, (3) declare a finite set of status states, and (4) document cancellation, idempotency keys, and timeout SLAs explicitly. Without these, AI agents will retry, hallucinate completion, or stall mid-plan.

Why this specification matters

Most tool failures inside agent runtimes happen on operations that take longer than a single request-response cycle. Synchronous HTTP tooling assumes a few-second budget; AI agents executing real work — generating reports, running batch evaluations, syncing CRM data — routinely exceed it. When a tool blocks for 60 seconds and the agent gateway times out at 30, the agent receives an error, the work continues server-side, and the agent retries. The result is duplicate writes, runaway billing, and incoherent traces.

This document specifies the minimum contract a long-running tool must publish so an AI agent can plan around it. It applies to any tool exposed via Model Context Protocol (MCP), OpenAI function calling, Anthropic tool use, or a custom HTTP gateway.

Scope and applicability

This specification applies when at least one of the following is true:

  • A tool's p95 latency exceeds 5 seconds.
  • A tool depends on third-party APIs whose latency the caller cannot bound.
  • A tool may produce intermediate results useful to an agent before completion.
  • A tool's outcome is non-idempotent (money movement, email send, file write).

If none of these hold, a synchronous tool with explicit timeouts is acceptable.

Required components

1. Asynchronous kickoff

The kickoff endpoint MUST return immediately with a job identifier and a status URL. The HTTP 202 Accepted status is the canonical response, with the Location header pointing to the status resource (RFC 7231 §6.3.3).

Example:

POST /v1/reports
Idempotency-Key: 4a8c1f7e-3d6a-4f3b-9a8e-1f9c7d2b4a8c
Prefer: respond-async
Content-Type: application/json

{ "report_type": "quarterly", "period": "2026-Q1" }

HTTP/1.1 202 Accepted
Location: /v1/jobs/job_01HZX7Q
Content-Type: application/json

{

"job_id": "job_01HZX7Q",

"status": "queued",

"status_url": "/v1/jobs/job_01HZX7Q",

"estimated_duration_seconds": 90

}

The Prefer: respond-async header (RFC 7240) gives clients an explicit way to opt into async behavior on tools that can run either way. Idempotency keys prevent duplicate submissions when an agent retries the kickoff.

2. Status states

A tool MUST expose a finite, documented state machine. The recommended minimum set:

  • queued — accepted but not yet started.
  • running — actively processing.
  • succeeded — completed; result available.
  • failed — terminated with an error; error object MUST be present.
  • canceled — terminated by cancellation; partial results MAY be present.
  • timed_out — exceeded the documented timeout SLA.

States MUST be enumerated in the tool's schema so agents can branch deterministically. Avoid free-form status strings; agents cannot reliably interpret "almost done" or "processing your request".

3. Progress retrieval: polling vs streaming

Tool builders MUST document at least one of:

Polling. A GET /jobs/{id} endpoint that returns the current state, a progress percentage if known, and a server-suggested poll interval via Retry-After (RFC 7231 §7.1.3). Agents that respect Retry-After avoid hammering the gateway.

Server-Sent Events (SSE). A GET /jobs/{id}/events endpoint that streams text/event-stream updates. SSE is appropriate when intermediate output is useful to the agent (token streaming, partial results, log tailing). The MCP working group has documented async operation patterns that include resource subscriptions for progress notifications, allowing hosts to disconnect and reconnect while an operation runs.

Choose polling when the agent can tolerate latency in receiving completion. Choose SSE when intermediate state changes the agent's plan. Document both if both are supported.

4. Cancellation

A DELETE /jobs/{id} or POST /jobs/{id}:cancel endpoint MUST exist for any tool whose work consumes billable resources, holds locks, or produces external side-effects. The response semantics:

  • 202 Accepted if cancellation has been initiated but the job is not yet in a terminal state.
  • 200 OK if the job is now canceled.
  • 409 Conflict if the job is already terminal (succeeded, failed, timed_out).

Cancellation MUST be idempotent. Agents will issue cancellations on user interrupt, plan revision, or budget exhaustion.

5. Timeout SLA

Every long-running tool MUST publish:

  • A maximum job duration after which the server transitions the job to timed_out.
  • A guidance value for the agent's own client-side budget (typically 1.5-2× the server SLA).
  • The cleanup behavior on timeout (rollback, partial commit, no-op).

Without a published timeout, agents either give up too early (wasting completed work) or wait too long (blocking downstream steps).

6. Idempotency

Kickoff endpoints MUST honor an Idempotency-Key header (Stripe convention; IETF draft draft-ietf-httpapi-idempotency-key-header). Duplicate submissions with the same key MUST return the original job_id, not create a new job.

7. Result retrieval

Final results MUST be retrievable from the status resource for at least 24 hours after completion. Large results (over 1 MB) SHOULD be returned via a separate GET /jobs/{id}/result endpoint or a signed URL. Agents may retrieve results asynchronously, after summarization steps, or after the user resumes a session.

Documentation requirements

A long-running tool's published documentation MUST include:

  1. Operation name and description in the tool schema.
  2. Kickoff parameters with types and required or optional flags.
  3. Status state enum with terminal vs non-terminal markers.
  4. Polling and/or SSE endpoint with example payloads.
  5. Cancellation semantics including side-effect behavior.
  6. Timeout SLA in seconds.
  7. Idempotency-key contract.
  8. Error taxonomy with retryable vs non-retryable classification.
  9. Rate limits for both kickoff and status endpoints.
  10. Example trace showing kickoff → poll → result.

Each MUST appear in the tool description string the model reads, not only in human-facing API reference. AI agents do not have an out-of-band channel to discover semantics.

Common mistakes

  • Returning 200 OK from the kickoff with the result still pending. Agents treat 200 as success and never poll.
  • Free-form status strings. "Working on it" cannot be branched on.
  • Missing Retry-After. Agents poll at their default rate, which is either too fast (rate-limit) or too slow (latency).
  • Cancellation not documented. Agents cannot release resources on user interrupt.
  • Hidden timeout. Server kills the job at 5 minutes, but the documentation says nothing; agents wait indefinitely.
  • No idempotency key. Network retries cause duplicate writes.

FAQ

Q: Should every tool be async?

No. Tools that reliably complete in under 5 seconds and have idempotent semantics should remain synchronous. Async overhead is real: an extra round-trip for kickoff, polling load, state management. Reserve async for operations that genuinely require it.

Q: Polling or SSE — which should I default to?

Default to polling. It works with every HTTP toolchain, survives proxy timeouts, and is easier for agents to integrate. Add SSE when intermediate output materially changes the agent's behavior — token streaming, log tailing, multi-step pipelines where the agent can short-circuit on early signals.

Q: How do I document a tool with both sync and async modes?

Use the Prefer: respond-async header (RFC 7240). Document the default mode, the trigger header, and how the response shape changes. Agents reading the schema can then choose mode based on the user's plan budget.

Q: What happens if the agent loses the job ID?

The tool SHOULD provide a GET /jobs?...filter endpoint allowing the agent to recover by user, session, or idempotency key. Without recovery, lost job IDs become orphan billable work.

Related Articles

specification

Agent Authentication Documentation Spec

Document authentication for autonomous agents: OAuth flows, API keys, scopes, error states, and consent UX patterns AI agents need to operate safely.

specification

Agent Circuit Breaker Specification

Specification for circuit breakers protecting AI agent calls to LLM providers and tools, including state transitions, threshold tuning, fallback strategies, and observability hooks.

specification

Agent Output Validation Documentation Specification

A specification for validating AI agent outputs against JSON Schema with runtime hooks, error formats, and partial-output handling for tool builders.

Cập nhật tin tức

Thông tin GEO & AI Search

Bài viết mới, cập nhật khung làm việc và phân tích ngành. Không spam, hủy đăng ký bất cứ lúc nào.