Agent Trace Instrumentation Specification: OpenTelemetry for AI Agents

AI agents emit nested LLM calls, tool invocations, and retrieval lookups. OpenTelemetry's gen_ai semantic conventions provide a vendor-neutral way to capture them. This specification defines the span hierarchy, attribute set, sampling, and vendor integration patterns needed to debug production agents.

TL;DR

Wrap each agent run in a root span and nest child spans for every LLM call, tool call, and retrieval lookup. Use OpenTelemetry's gen_ai.* semantic conventions for model, prompt, response, and token attributes. Capture prompts behind a feature flag and redact PII at write time. Sample by error and latency, not at random. Export to any OTLP-compatible backend (Langfuse, Datadog, Honeycomb, or a self-hosted collector).

Why this specification exists

An agent run is a tree, not a single LLM call. Without structured tracing, you cannot answer basic questions: which tool failed, how many tokens went into the second LLM call, why the agent looped. OpenTelemetry already solves distributed tracing for HTTP services; the gen_ai semantic conventions extend it to LLM and agent workflows (OpenTelemetry gen_ai semantic conventions). Adopting them avoids vendor lock-in and lets you swap observability backends without re-instrumenting.

Span hierarchy

An agent run produces a tree of spans:

agent.run

agent.plan

gen_ai.client.operation (LLM call)

agent.tool_call (notion.searchPages)

db.client.operation (downstream HTTP/DB call)

agent.tool_call (notion.updatePage)

db.client.operation

agent.reflect

gen_ai.client.operation

Rules:

Exactly one root span per agent run, named agent.run.
Each LLM call is a gen_ai.client.operation span.
Each tool call is an agent.tool_call span; downstream HTTP/DB work nests inside.
Sub-agent calls become a child agent.run span linked via span_link.
Retrieval lookups are spans (agent.retrieval) with attributes for store, top_k, and latency.

Required attributes

agent.run

agent.id — stable agent identity.
agent.principal — on-behalf-of user.
agent.session_id — session/run identifier.
agent.input.tokens — total input tokens across LLM calls.
agent.output.tokens — total output tokens across LLM calls.
agent.tool_calls.count — number of tool invocations.
agent.outcome — success, failure, cancelled.

gen_ai.client.operation

Use OpenTelemetry gen_ai conventions (gen_ai docs):

gen_ai.system — e.g., openai, anthropic, google_gemini.
gen_ai.request.model — model identifier.
gen_ai.request.temperature, gen_ai.request.top_p, gen_ai.request.max_tokens.
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens.
gen_ai.response.id, gen_ai.response.finish_reasons.
gen_ai.cost.input_usd, gen_ai.cost.output_usd (custom or via processor).

agent.tool_call

tool.name — fully-qualified tool name (e.g., notion.updatePage).
tool.namespace, tool.verb, tool.noun for routing analytics.
tool.argument_size_bytes (do not store raw arguments by default).
tool.outcome — success, denied, error.
tool.error.code, tool.error.type if non-success.

agent.retrieval

retrieval.store — e.g., pgvector, pinecone.
retrieval.top_k, retrieval.filter_count.
retrieval.latency_ms.
retrieval.result_count.

Token cost attribution

Attribute cost on the LLM span itself, then aggregate at the run span via a span processor. This avoids double-counting and keeps per-call detail.

run.input_tokens = sum(child.gen_ai.usage.input_tokens)

run.output_tokens = sum(child.gen_ai.usage.output_tokens)

run.cost_usd = sum(child.gen_ai.cost.input_usd + child.gen_ai.cost.output_usd)

Use a processor (not the SDK) to compute cost from a model price table. That keeps the price table in one place.

Privacy-aware prompt and response capture

Default: do not log prompts or responses; record only sizes and token counts.
Opt-in for debugging: gate verbose capture behind a feature flag scoped to a tenant.
Redaction: if you do capture, run the same PII redactor used for episodic memory before persisting.
Retention: shorter than your default trace retention. Treat verbose captures like raw logs.

Correlation IDs

Propagate W3C traceparent and tracestate headers to every downstream HTTP call.
Surface the trace ID to the user-facing UI when an agent fails so users can quote it in a bug report.
Include the trace ID in the audit log entry for the action; cross-reference both directions.

Sampling

A fixed-rate random sampler hides the failures you actually want to see.

Tail-based sampling on the collector: keep traces with errors, slow LLM calls, or unusual tool sequences.
Always-on sampling for low-traffic admin agents.
Per-tenant caps so a noisy tenant cannot evict everyone else's traces.

Document the sampling policy alongside the SLOs the team is committing to.

Vendor integration

Any OTLP-compatible backend can ingest these traces.

Backend	Notes
Langfuse	Native gen_ai support; built for LLM observability (Langfuse OTel)
Datadog	LLM Observability product reads gen_ai attributes (Datadog LLM Observability)
Honeycomb	Strong query model for high-cardinality agent attributes (Honeycomb AI)
Self-hosted	OTel Collector + Tempo/Jaeger + Grafana

Keep the SDK and instrumentation backend-agnostic. The collector decides where data lands.

Sample trace (truncated)

agent.run:
  agent.id: "geodocs-writer"
  agent.outcome: "success"
  agent.input.tokens: 12450
  agent.output.tokens: 3120
  children:
    - gen_ai.client.operation:
        gen_ai.system: "anthropic"
        gen_ai.request.model: "claude-3.5-sonnet"
        gen_ai.usage.input_tokens: 5400
        gen_ai.usage.output_tokens: 1200
    - agent.tool_call:
        tool.name: "notion.searchPages"
        tool.outcome: "success"
    - agent.tool_call:
        tool.name: "notion.updatePage"
        tool.outcome: "success"

Validation checklist

[ ] Every agent run produces exactly one root span.
[ ] LLM calls use gen_ai.* conventions.
[ ] Tool calls record outcome and error type.
[ ] Token usage rolls up to the run span.
[ ] Verbose capture is gated and redacted.
[ ] Trace IDs are surfaced to user-facing errors.
[ ] Sampling keeps errors and slow calls.
[ ] Backend is interchangeable via OTLP.

FAQ

Q: Should I roll my own tracing or use OpenTelemetry?

Use OpenTelemetry. Vendor-specific SDKs lock you in and rarely cover both HTTP and LLM spans cleanly. The gen_ai conventions are stable enough for production.

Q: How do I avoid leaking prompts in traces?

Default to size-only capture. Gate verbose capture per tenant and run the same redactor you use for memory. Keep verbose retention short.

Q: How do I attribute cost across many models?

Keep a price table keyed by gen_ai.system + gen_ai.request.model. Compute cost in a span processor and write it back as a span attribute, then aggregate at the run span.

Q: How should I sample agent traces?

Use tail-based sampling: keep errors, slow runs, and runs that exceed token budgets; downsample the rest. Random sampling at the SDK loses the most useful traces.

Q: What about sub-agents and parallel calls?

Make each sub-agent run a child agent.run span. Use span_link for parallel calls so the trace shows the fan-out structure clearly.

Agent Trace Instrumentation Specification: OpenTelemetry for AI Agents

TL;DR

Why this specification exists

Span hierarchy

Required attributes

agent.run

gen_ai.client.operation

agent.tool_call

agent.retrieval

Token cost attribution

Privacy-aware prompt and response capture

Correlation IDs

Sampling

Vendor integration

Sample trace (truncated)

Validation checklist

FAQ

Q: Should I roll my own tracing or use OpenTelemetry?

Q: How do I avoid leaking prompts in traces?

Q: How do I attribute cost across many models?

Q: How should I sample agent traces?

Q: What about sub-agents and parallel calls?

Related Articles

Agent Knowledge Base Specification: Structure, Refresh, and Versioning

Agent Multi-Step Reasoning Specification: ReAct, Plan-and-Execute, and Reflection

Agent Permission Model Specification: RBAC, Scopes, and Tool-Level Auth

GEO & AI Search Insights