Agent Trace Instrumentation Specification: OpenTelemetry for AI Agents
AI agents emit nested LLM calls, tool invocations, and retrieval lookups. OpenTelemetry's gen_ai semantic conventions provide a vendor-neutral way to capture them. This specification defines the span hierarchy, attribute set, sampling, and vendor integration patterns needed to debug production agents.
TL;DR
Wrap each agent run in a root span and nest child spans for every LLM call, tool call, and retrieval lookup. Use OpenTelemetry's gen_ai.* semantic conventions for model, prompt, response, and token attributes. Capture prompts behind a feature flag and redact PII at write time. Sample by error and latency, not at random. Export to any OTLP-compatible backend (Langfuse, Datadog, Honeycomb, or a self-hosted collector).
Why this specification exists
An agent run is a tree, not a single LLM call. Without structured tracing, you cannot answer basic questions: which tool failed, how many tokens went into the second LLM call, why the agent looped. OpenTelemetry already solves distributed tracing for HTTP services; the gen_ai semantic conventions extend it to LLM and agent workflows (OpenTelemetry gen_ai semantic conventions). Adopting them avoids vendor lock-in and lets you swap observability backends without re-instrumenting.
Span hierarchy
An agent run produces a tree of spans:
agent.run
agent.plan
gen_ai.client.operation (LLM call)
agent.tool_call (notion.searchPages)
db.client.operation (downstream HTTP/DB call)
agent.tool_call (notion.updatePage)
db.client.operation
agent.reflect
gen_ai.client.operation
Rules:
- Exactly one root span per agent run, named agent.run.
- Each LLM call is a gen_ai.client.operation span.
- Each tool call is an agent.tool_call span; downstream HTTP/DB work nests inside.
- Sub-agent calls become a child agent.run span linked via span_link.
- Retrieval lookups are spans (agent.retrieval) with attributes for store, top_k, and latency.
Required attributes
agent.run
- agent.id — stable agent identity.
- agent.principal — on-behalf-of user.
- agent.session_id — session/run identifier.
- agent.input.tokens — total input tokens across LLM calls.
- agent.output.tokens — total output tokens across LLM calls.
- agent.tool_calls.count — number of tool invocations.
- agent.outcome — success, failure, cancelled.
gen_ai.client.operation
Use OpenTelemetry gen_ai conventions (gen_ai docs):
- gen_ai.system — e.g., openai, anthropic, google_gemini.
- gen_ai.request.model — model identifier.
- gen_ai.request.temperature, gen_ai.request.top_p, gen_ai.request.max_tokens.
- gen_ai.usage.input_tokens, gen_ai.usage.output_tokens.
- gen_ai.response.id, gen_ai.response.finish_reasons.
- gen_ai.cost.input_usd, gen_ai.cost.output_usd (custom or via processor).
agent.tool_call
- tool.name — fully-qualified tool name (e.g., notion.updatePage).
- tool.namespace, tool.verb, tool.noun for routing analytics.
- tool.argument_size_bytes (do not store raw arguments by default).
- tool.outcome — success, denied, error.
- tool.error.code, tool.error.type if non-success.
agent.retrieval
- retrieval.store — e.g., pgvector, pinecone.
- retrieval.top_k, retrieval.filter_count.
- retrieval.latency_ms.
- retrieval.result_count.
Token cost attribution
Attribute cost on the LLM span itself, then aggregate at the run span via a span processor. This avoids double-counting and keeps per-call detail.
run.input_tokens = sum(child.gen_ai.usage.input_tokens)
run.output_tokens = sum(child.gen_ai.usage.output_tokens)
run.cost_usd = sum(child.gen_ai.cost.input_usd + child.gen_ai.cost.output_usd)
Use a processor (not the SDK) to compute cost from a model price table. That keeps the price table in one place.
Privacy-aware prompt and response capture
- Default: do not log prompts or responses; record only sizes and token counts.
- Opt-in for debugging: gate verbose capture behind a feature flag scoped to a tenant.
- Redaction: if you do capture, run the same PII redactor used for episodic memory before persisting.
- Retention: shorter than your default trace retention. Treat verbose captures like raw logs.
Correlation IDs
- Propagate W3C traceparent and tracestate headers to every downstream HTTP call.
- Surface the trace ID to the user-facing UI when an agent fails so users can quote it in a bug report.
- Include the trace ID in the audit log entry for the action; cross-reference both directions.
Sampling
A fixed-rate random sampler hides the failures you actually want to see.
- Tail-based sampling on the collector: keep traces with errors, slow LLM calls, or unusual tool sequences.
- Always-on sampling for low-traffic admin agents.
- Per-tenant caps so a noisy tenant cannot evict everyone else's traces.
Document the sampling policy alongside the SLOs the team is committing to.
Vendor integration
Any OTLP-compatible backend can ingest these traces.
| Backend | Notes |
|---|---|
| Langfuse | Native gen_ai support; built for LLM observability (Langfuse OTel) |
| Datadog | LLM Observability product reads gen_ai attributes (Datadog LLM Observability) |
| Honeycomb | Strong query model for high-cardinality agent attributes (Honeycomb AI) |
| Self-hosted | OTel Collector + Tempo/Jaeger + Grafana |
Keep the SDK and instrumentation backend-agnostic. The collector decides where data lands.
Sample trace (truncated)
agent.run:
agent.id: "geodocs-writer"
agent.outcome: "success"
agent.input.tokens: 12450
agent.output.tokens: 3120
children:
- gen_ai.client.operation:
gen_ai.system: "anthropic"
gen_ai.request.model: "claude-3.5-sonnet"
gen_ai.usage.input_tokens: 5400
gen_ai.usage.output_tokens: 1200
- agent.tool_call:
tool.name: "notion.searchPages"
tool.outcome: "success"
- agent.tool_call:
tool.name: "notion.updatePage"
tool.outcome: "success"Validation checklist
- [ ] Every agent run produces exactly one root span.
- [ ] LLM calls use gen_ai.* conventions.
- [ ] Tool calls record outcome and error type.
- [ ] Token usage rolls up to the run span.
- [ ] Verbose capture is gated and redacted.
- [ ] Trace IDs are surfaced to user-facing errors.
- [ ] Sampling keeps errors and slow calls.
- [ ] Backend is interchangeable via OTLP.
FAQ
Q: Should I roll my own tracing or use OpenTelemetry?
Use OpenTelemetry. Vendor-specific SDKs lock you in and rarely cover both HTTP and LLM spans cleanly. The gen_ai conventions are stable enough for production.
Q: How do I avoid leaking prompts in traces?
Default to size-only capture. Gate verbose capture per tenant and run the same redactor you use for memory. Keep verbose retention short.
Q: How do I attribute cost across many models?
Keep a price table keyed by gen_ai.system + gen_ai.request.model. Compute cost in a span processor and write it back as a span attribute, then aggregate at the run span.
Q: How should I sample agent traces?
Use tail-based sampling: keep errors, slow runs, and runs that exceed token budgets; downsample the rest. Random sampling at the SDK loses the most useful traces.
Q: What about sub-agents and parallel calls?
Make each sub-agent run a child agent.run span. Use span_link for parallel calls so the trace shows the fan-out structure clearly.
Related Articles
Agent Knowledge Base Specification: Structure, Refresh, and Versioning
Production specification for AI agent knowledge bases: document model, chunking strategies, metadata enrichment, refresh cadence, version pinning, and rollback.
Agent Multi-Step Reasoning Specification: ReAct, Plan-and-Execute, and Reflection
Specification for AI agent multi-step reasoning patterns: ReAct, Plan-and-Execute, Reflexion, Tree of Thoughts, and Self-Consistency.
Agent Permission Model Specification: RBAC, Scopes, and Tool-Level Auth
Production specification for AI agent permissions: RBAC, OAuth scope mapping, tool-level auth, consent prompts, time-bound grants, and MCP propagation.