Agent Streaming Output Documentation Spec: Events, Errors, Partial State

This specification defines how AI agent platforms should document streaming output so any consumer — UI, downstream agent, or evaluation harness — can deterministically interpret event types, partial state, error frames, and resume semantics. It maps to OpenAI Agents SDK, Anthropic Messages API, Vercel AI SDK, and AG-UI without prescribing a single transport.

TL;DR

Agent streaming output is more than tokens. A complete spec documents (1) a named event taxonomy spanning lifecycle, text, tool calls, state, and errors; (2) the shape of partial state at every emission; (3) error frames distinct from transport errors; and (4) cancellation, resume, and replay semantics. Without all four, consumers cannot build reliable UIs, multi-agent handoffs, or trajectory-replay evaluators.

Why streaming output needs a documentation contract

A streamed agent run is a sequence of typed events — not a string. UIs render token-by-token text, tool-call status, and progress; downstream agents consume tool_called and handoff_requested events to coordinate; evaluation harnesses replay the trajectory to score behavior. When the stream contract is implicit, every consumer reverse-engineers it, and changes silently break clients.

Modern agent runtimes already expose distinct event taxonomies. The OpenAI Agents SDK exposes RawResponseStreamEvent, RunItemStreamEvent (with names like tool_called, tool_output, message_output_created, handoff_requested, mcp_approval_requested), and AgentUpdatedStreamEvent. Anthropic's Messages API streams message_start, content_block_start, content_block_delta, content_block_stop, message_delta, message_stop, and ping. The AG-UI protocol normalizes streams across backends with five event categories — lifecycle, text, tool, state, and special — across roughly seventeen named events. Vercel's AI SDK exposes a fullStream of typed TextStreamPart parts including text deltas, tool calls, tool results, and errors. Each is correct for its runtime; documentation is what lets a consumer move between them.

1. Event taxonomy (REQUIRED)

Every streaming spec MUST publish a closed taxonomy of event names grouped into the following categories. Mark optional categories explicitly.

Category	Purpose	Example names
Lifecycle	Mark run boundaries and state transitions	run_started, run_completed, run_failed, agent_updated
Text	Stream assistant text deltas and message boundaries	message_started, text_delta, message_completed
Tool	Surface tool invocation, arguments, and results	tool_called, tool_arguments_delta, tool_output, tool_failed
State	Synchronize shared agent or workflow state	state_snapshot, state_delta, checkpoint_saved
Control	Approvals, handoffs, cancellations	approval_requested, approval_resolved, handoff_requested, cancelled
Error	Recoverable and terminal error frames	recoverable_error, terminal_error

Documentation MUST list, for each event:

The exact event name and category.
Whether it is emitted at-most-once, exactly-once, or zero-or-more times per run.
The full payload schema (JSON Schema or equivalent) and required vs optional fields.
Ordering guarantees relative to other events (for example, tool_output always follows the matching tool_called).
Whether it carries a stable correlation id (run_id, message_id, tool_call_id, block_index).

2. Partial state shape (REQUIRED)

Streaming emits incomplete data. The spec MUST define how a consumer reconstructs the in-progress state at every event boundary.

Token deltas. Describe how to concatenate text_delta.value into the current assistant message; specify whitespace and newline preservation rules.
Tool argument deltas. When tool arguments stream as JSON fragments, document whether deltas are byte-level, token-level, or field-level, and whether partial JSON is guaranteed parsable at any boundary.
Content blocks. When blocks are indexed (Anthropic's content_block_* events use a stable index), document how interleaved blocks (text, tool_use, thinking) are reconstructed.
Cumulative vs incremental fields. Explicitly mark which fields are cumulative (Anthropic's usage.output_tokens in message_delta is cumulative) and which are incremental.
Partial-state schema. Publish a PartialAgentState type so consumers can render an in-progress view without guessing.

3. Error frames (REQUIRED)

Errors that occur inside a stream are not the same as transport errors. The spec MUST distinguish:

Transport errors — connection drop, HTTP 5xx before any event. Document the HTTP semantics and reconnection policy.
Stream-level errors — emitted as a typed event (for example, error parts in Vercel AI SDK's fullStream). Document whether the stream terminates after the event or continues with degraded output.
Tool-level errors — tool_failed or tool_output with is_error: true. Document whether the agent retries internally and whether retries are visible to the consumer.
Recoverable vs terminal — every error event MUST carry a boolean or enum that tells the consumer whether to expect more events.
Error code stability — publish a closed list of error codes and their HTTP analog. Avoid leaking provider-specific codes through unmodified.

4. Cancellation, resume, and replay (REQUIRED)

Long-running agents disconnect, get cancelled, and need to be replayed. The spec MUST document:

Cancellation semantics. How a consumer cancels (close the iterator, send a control message, abort the request) and which events are guaranteed to be emitted before the stream closes. The OpenAI Agents SDK guarantees the stream is not complete until the iterator finishes, so post-processing events arrive after the last visible token.
Resume semantics. Whether a stream is resumable by run_id, what window of events is replayable, and whether replayed events are marked (for example, replayed: true).
Checkpoint contracts. When the runtime saves checkpoints (Microsoft Agent Framework exposes on_checkpoint_save / on_checkpoint_restore), document what state is captured and what is reconstructed from the event log.
Idempotency. Every emitted event SHOULD carry an event_id that lets a consumer deduplicate after a reconnect.
Heartbeats. Document the keep-alive mechanism (Anthropic uses ping events) and the maximum acceptable silence before a consumer should reconnect.

5. Transport, ordering, and backpressure (REQUIRED)

Most current agent runtimes ship streams over Server-Sent Events because SSE works on standard HTTP and ships with browser-native EventSource reconnect. The spec MUST document:

Transport. SSE, chunked transfer, WebSocket, or framework-specific iterator.
Wire format. JSON-per-event, named SSE events, or framed binary.
Ordering guarantee. Whether events from a single run are strictly ordered (almost always yes) and whether parallel branches (for example, parallel tool calls) interleave.
Backpressure. How a slow consumer affects the producer; whether the runtime buffers, drops, or applies flow control.
Maximum payload size. Per-event limits and chunking rules for large tool outputs.

6. Versioning and compatibility

Document a stream_protocol_version field on every event or on the run_started event. Treat additions of new event types as minor; renaming or repurposing an event is a major change. Provide a deprecation window of at least one major release before removing an event, and publish a compatibility matrix listing which event names are supported per version.

Documentation checklist

Use this list when shipping a new streaming agent runtime or updating an existing one:

[ ] Closed event taxonomy with category, cardinality, payload schema, and ordering for each event.
[ ] Stable correlation ids (run_id, message_id, tool_call_id, block_index).
[ ] Partial-state reconstruction algorithm with worked examples.
[ ] Cumulative vs incremental field annotations on every numeric field.
[ ] Error frame taxonomy distinct from transport errors, with recoverable/terminal markers.
[ ] Cancellation semantics including which events are emitted after cancel.
[ ] Resume and replay semantics with event-id idempotency.
[ ] Heartbeat / keep-alive event documented.
[ ] Transport, wire format, ordering, and backpressure documented.
[ ] Versioning rule and deprecation policy.
[ ] End-to-end example showing every event type in a single run.

Mapping to existing runtimes

Spec requirement	OpenAI Agents SDK	Anthropic Messages	Vercel AI SDK	AG-UI
Lifecycle events	AgentUpdatedStreamEvent, run completion via iterator end	message_start, message_stop	start, finish parts	Lifecycle category
Text deltas	RawResponseStreamEvent	content_block_delta (text)	text part	Text Message Events
Tool events	tool_called, tool_output RunItemStreamEvent	content_block_* with tool_use	tool-call, tool-result	Tool Call Events
State sync	Run items, history compaction	Not native	Custom data parts	State Management Events
Errors	Surfaced via exceptions and run state	error SSE event	error part on fullStream	Lifecycle RUN_ERROR
Resume	Iterator-bound; no built-in replay	None native	None native	Spec-defined

The takeaway: every popular runtime covers most categories, but none documents all six requirements above. A geodocs-grade spec is the union — and the layer that lets a single client adapter target any of them.

FAQ

Q: Is a streaming spec different from an SSE protocol?

Yes. SSE is a transport — it defines how named events are delivered over HTTP. A streaming spec defines the event semantics: which events exist, what payload they carry, how partial state is reconstructed, and how errors and resume work. Most agent runtimes use SSE as the transport but each defines its own event semantics on top.

Q: Do I need separate event types for tokens and tool calls?

Yes. Tokens are textual deltas appended to the current assistant message; tool calls are typed objects with arguments, results, and ids. Conflating them forces consumers to parse text to detect tool calls — which is exactly what fine-grained tool streaming standards exist to avoid.

Q: Should I emit cumulative or incremental fields?

Both, but mark each one. Cumulative fields (running token counts, full message text) are easier for late-joining consumers; incremental fields (deltas) are smaller and lower latency. Anthropic, for example, documents usage in message_delta as cumulative — the spec must say so explicitly so clients do not double-count.

Q: How do I document resume after disconnect?

Publish (1) an event_id on every event, (2) a window — typically minutes — during which the runtime keeps the event log, (3) the resume endpoint and the last_event_id parameter, and (4) a flag (e.g. replayed: true) on events emitted from the buffer.

Q: What is the minimum viable streaming spec for a v1 agent?

Lifecycle (run_started, run_completed, run_failed), text (text_delta), tool (tool_called, tool_output), and a single error frame with a recoverable/terminal flag. Add state, approvals, and resume in v2 once you have shipped consumers using v1 and have real telemetry on which events they actually depend on.