Agent Observability Documentation Checklist: Tracing, Logs, and Trajectory Replay for Production AI Agents

Production-ready AI agents need three layers of observability evidence in their docs: trace spans aligned with OpenTelemetry GenAI conventions, structured logs that capture every model and tool call, and trajectory replay artifacts that let reviewers reproduce any past run. This 30-point checklist defines the minimum documentation contract before an agent ships.

TL;DR

Agent observability documentation must cover three things: (1) a trace span schema based on the OpenTelemetry GenAI semantic conventions, (2) a structured log contract for every model and tool call with shared conversation and run identifiers, and (3) a replayable trajectory artifact stored per run. If any of the 30 checklist items below is missing or partial, the agent is not yet production-ready.

Why agent observability documentation needs its own checklist

Agent runtimes are non-deterministic, fan out across sub-agents, tools, and external APIs, and fail in ways that only surface late in long workflows. Generic SRE runbooks miss the agent-specific signals: tool selection, reasoning steps, handoffs, and trajectories. The OpenTelemetry community has now standardized GenAI spans, attributes, and events for exactly this gap, and major vendors — Datadog, Dynatrace, Microsoft Foundry, SigNoz — consume those conventions natively.

Most teams already have monitoring dashboards. What they lack is a written contract: the documentation that tells the next on-call engineer, auditor, or LLM reader what each span and log field means and how to replay a run from cold storage. This checklist is that contract. It is intentionally documentation-first; tooling choices come second.

For broader context, see the AI Agents hub and the related Agent Evaluation Harness reference.

How to use this checklist

Treat each item as a documentation deliverable, not an implementation task. The artifact is a docs section, schema file, or example payload that lives next to the agent spec. Mark every item as present, partial, or missing. Anything below "present" must have a tracked follow-up before the agent ships to production.

The checklist groups 30 items into three sections — Tracing (A1-A12), Structured logs (B13-B22), and Trajectory replay (C23-C30) — followed by cross-cutting requirements that apply to all three.

Section A — Trace span schema (12 items)

[ ] A1. Root span per agent run. Document the root gen_ai.invoke_agent span name, its required attributes, and its lifecycle (one span per user-visible request).
[ ] A2. OpenTelemetry GenAI conventions cited. The schema explicitly references the OpenTelemetry GenAI semantic conventions as the source of truth for span and attribute names; provider-specific extensions are listed separately.
[ ] A3. Operation name attribute. Every span sets gen_ai.operation.name to one of the documented values (chat, embeddings, invoke_agent, execute_tool, create_agent).
[ ] A4. Provider name attribute. gen_ai.provider.name is set per call (for example openai, anthropic, aws.bedrock) and documented as the discriminator for provider-specific attribute sets.
[ ] A5. Model and request attributes. Required attributes include gen_ai.request.model, gen_ai.request.temperature, and gen_ai.request.max_tokens, with default-null behaviour spelled out.
[ ] A6. Token and cost attributes. Spans record gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and a derived cost attribute, with the cost formula and currency documented.
[ ] A7. Tool call spans. Each tool invocation is its own child span (gen_ai.execute_tool) with gen_ai.tool.name, an arguments hash, and a result status enum.
[ ] A8. Sub-agent spans. Multi-agent handoffs emit nested spans with gen_ai.agent.name and a documented parent/child relationship for fan-out workflows.
[ ] A9. Error type attribute. Failed spans set error.type to a low-cardinality string from a documented enum: provider error, timeout, guardrail violation, tool error, schema violation.
[ ] A10. Conversation and run IDs. All spans carry gen_ai.conversation.id and gen_ai.run.id so traces can be filtered to a single user journey across long sessions.
[ ] A11. Content recording mode. The docs declare whether prompts and outputs are recorded inline (gen_ai.input.messages, gen_ai.output.messages) or stored externally with reference URIs, and the privacy rationale for the choice.
[ ] A12. Sampling and retention. Sampling rate, retention window, and PII redaction rules are written next to the span schema, not buried in a separate runbook.

Section B — Structured log contract (10 items)

[ ] B13. JSON-only log format. All agent logs are emitted as single-line JSON; the docs explicitly reject free-text logs because they cannot be joined to traces.
[ ] B14. Required correlation IDs. Every log line carries trace_id, span_id, run_id, and conversation_id — same names as the trace schema, no aliases.
[ ] B15. Event taxonomy. A closed enum of event values is documented (agent.started, agent.tool_selected, agent.tool_failed, agent.guardrail_blocked, agent.run_completed, agent.handoff).
[ ] B16. Decision logs. Every reasoning step that selects a tool, sub-agent, or branch emits a decision log with decision.input_summary and decision.rationale.
[ ] B17. Tool I/O logs. Tool invocations log a request/response pair with truncated payloads and a hash of the full payload that points back to the replay store.
[ ] B18. Guardrail and policy logs. Blocked or rewritten outputs produce a log with policy.id, policy.version, and policy.action so audits can reconstruct enforcement decisions.
[ ] B19. Cost and quota logs. Token usage, dollar cost, and rate-limit headers are logged on every model call, not only sampled ones.
[ ] B20. PII and secret redaction. A documented redactor runs before emit; the docs list which fields are redacted by default, the regex set used, and how to opt out for debugging.
[ ] B21. Log levels mapped to severity. debug/info/warn/error are mapped to user-visible severity bands, and the docs state which levels page on-call.
[ ] B22. Schema version field. Every log carries schema_version; the docs describe how breaking schema changes are introduced and migrated without dropping correlation.

Section C — Trajectory replay artifact (8 items)

[ ] C23. Trajectory format defined. A canonical JSON or JSONL trajectory schema is documented: an ordered list of {step, role, action, observation, state_hash} tuples per run.
[ ] C24. Storage location and lifecycle. The docs name the bucket, retention window, encryption setting, and access controls for stored trajectories.
[ ] C25. Deterministic replay harness. A CLI or SDK call that takes a run_id and reproduces the trajectory step-by-step is documented, including how fixture-based model responses are loaded.
[ ] C26. Tool stubs for replay. Replays do not call live tools; the docs define how recorded tool responses are looked up and how mismatches between recorded and live tools are surfaced.
[ ] C27. Diff view for trajectory drift. A documented diff format compares two trajectories field by field for regression review during pull-request approval.
[ ] C28. Failure-to-success relabeling. Following the AgentHER paradigm, failed trajectories are kept and relabeled rather than discarded; the docs describe the relabeling pipeline and where outputs land.
[ ] C29. Privacy review for replay data. Stored trajectories are subject to the same redaction policy as logs; the docs link the policy explicitly and define a deletion SLA for user-initiated requests.
[ ] C30. Reproducibility evidence in releases. Each release notes which trajectories were replayed, the diff result, and who signed off. No release ships without this evidence.

Cross-cutting documentation requirements

Every checklist item lives in a section of the agent spec — not in a tooling runbook — so the documentation survives vendor changes.
The agent spec links the OpenTelemetry GenAI semantic conventions as the upstream source for span and attribute names.
A change-log column tracks who edited each section and when, so observability docs do not silently drift behind the implementation.
The spec includes at least one worked example: a sample run with its full span tree, log stream, and trajectory file, redacted for the docs.

Common mistakes

Documenting dashboards, not contracts. Screenshots of a vendor dashboard age fast. Document the schema; vendors are interchangeable.
Recording prompts inline by default. Inline content recording breaks privacy reviews. Default to external content storage with span references, per the GenAI conventions.
One giant trace per conversation. Long conversations swallow regressions. Use a root span per request, with conversation.id as the join key for multi-turn analysis.
Replay that calls live tools. Replay must use recorded tool responses; otherwise it is just a re-run, not a trajectory test, and it cannot prove regressions are fixed.
Sparse error taxonomy. A free-text error.message is high-cardinality and useless for alerting. The docs must define the closed enum on error.type.

FAQ

Q: What is agent observability documentation?

It is the written contract that defines how an AI agent's runs are traced, logged, and replayed. It includes a span schema, a structured log schema, and a trajectory replay format, and it lives next to the agent's product spec rather than in a separate ops wiki.

Q: How is this different from LLM monitoring?

LLM monitoring tracks individual model calls — latency, tokens, errors. Agent observability tracks complete agent cycles: multi-step reasoning, tool execution, sub-agent handoffs, and how individual calls combine into workflows. The documentation must reflect that broader scope, including a defined trajectory artifact.

Q: Do we have to use the OpenTelemetry GenAI conventions?

Strongly recommended. Microsoft Agent Framework, Microsoft Foundry, Inkeep, Maxim, and most vendor SDKs already emit GenAI conventions, and Datadog, Dynatrace, and SigNoz consume them natively. Custom span names create migration debt with no upside.

Q: What is a trajectory replay and why document it?

A trajectory replay is a re-execution of a recorded agent run against stubbed tools and fixture model responses. Documenting the replay format makes regressions reproducible across releases and lets reviewers confirm that a fix actually changes the behaviour of the failing case rather than producing a coincidentally similar result.

Q: How long does it take to apply this checklist?

For a single-agent system, expect about one engineering week to draft the schemas and one review cycle to land them. Multi-agent systems with several tools take two to three weeks. The cost is paid back the first time on-call needs to debug a long-running session in production.