Geodocs.dev

Agent Trace Instrumentation Specification: OpenTelemetry for AI Agents

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

AI agents emit nested LLM calls, tool invocations, and retrieval lookups. OpenTelemetry's gen_ai semantic conventions provide a vendor-neutral way to capture them. This specification defines the span hierarchy, attribute set, sampling, and vendor integration patterns needed to debug production agents.

TL;DR

Wrap each agent run in a root span and nest child spans for every LLM call, tool call, and retrieval lookup. Use OpenTelemetry's gen_ai.* semantic conventions for model, prompt, response, and token attributes. Capture prompts behind a feature flag and redact PII at write time. Sample by error and latency, not at random. Export to any OTLP-compatible backend (Langfuse, Datadog, Honeycomb, or a self-hosted collector).

Why this specification exists

An agent run is a tree, not a single LLM call. Without structured tracing, you cannot answer basic questions: which tool failed, how many tokens went into the second LLM call, why the agent looped. OpenTelemetry already solves distributed tracing for HTTP services; the gen_ai semantic conventions extend it to LLM and agent workflows (OpenTelemetry gen_ai semantic conventions). Adopting them avoids vendor lock-in and lets you swap observability backends without re-instrumenting.

Span hierarchy

An agent run produces a tree of spans:

agent.run

agent.plan

gen_ai.client.operation (LLM call)

agent.tool_call (notion.searchPages)

db.client.operation (downstream HTTP/DB call)

agent.tool_call (notion.updatePage)

db.client.operation

agent.reflect

gen_ai.client.operation

Rules:

  • Exactly one root span per agent run, named agent.run.
  • Each LLM call is a gen_ai.client.operation span.
  • Each tool call is an agent.tool_call span; downstream HTTP/DB work nests inside.
  • Sub-agent calls become a child agent.run span linked via span_link.
  • Retrieval lookups are spans (agent.retrieval) with attributes for store, top_k, and latency.

Required attributes

agent.run

  • agent.id — stable agent identity.
  • agent.principal — on-behalf-of user.
  • agent.session_id — session/run identifier.
  • agent.input.tokens — total input tokens across LLM calls.
  • agent.output.tokens — total output tokens across LLM calls.
  • agent.tool_calls.count — number of tool invocations.
  • agent.outcome — success, failure, cancelled.

gen_ai.client.operation

Use OpenTelemetry gen_ai conventions (gen_ai docs):

  • gen_ai.system — e.g., openai, anthropic, google_gemini.
  • gen_ai.request.model — model identifier.
  • gen_ai.request.temperature, gen_ai.request.top_p, gen_ai.request.max_tokens.
  • gen_ai.usage.input_tokens, gen_ai.usage.output_tokens.
  • gen_ai.response.id, gen_ai.response.finish_reasons.
  • gen_ai.cost.input_usd, gen_ai.cost.output_usd (custom or via processor).

agent.tool_call

  • tool.name — fully-qualified tool name (e.g., notion.updatePage).
  • tool.namespace, tool.verb, tool.noun for routing analytics.
  • tool.argument_size_bytes (do not store raw arguments by default).
  • tool.outcome — success, denied, error.
  • tool.error.code, tool.error.type if non-success.

agent.retrieval

  • retrieval.store — e.g., pgvector, pinecone.
  • retrieval.top_k, retrieval.filter_count.
  • retrieval.latency_ms.
  • retrieval.result_count.

Token cost attribution

Attribute cost on the LLM span itself, then aggregate at the run span via a span processor. This avoids double-counting and keeps per-call detail.

run.input_tokens = sum(child.gen_ai.usage.input_tokens)

run.output_tokens = sum(child.gen_ai.usage.output_tokens)

run.cost_usd = sum(child.gen_ai.cost.input_usd + child.gen_ai.cost.output_usd)

Use a processor (not the SDK) to compute cost from a model price table. That keeps the price table in one place.

Privacy-aware prompt and response capture

  • Default: do not log prompts or responses; record only sizes and token counts.
  • Opt-in for debugging: gate verbose capture behind a feature flag scoped to a tenant.
  • Redaction: if you do capture, run the same PII redactor used for episodic memory before persisting.
  • Retention: shorter than your default trace retention. Treat verbose captures like raw logs.

Correlation IDs

  • Propagate W3C traceparent and tracestate headers to every downstream HTTP call.
  • Surface the trace ID to the user-facing UI when an agent fails so users can quote it in a bug report.
  • Include the trace ID in the audit log entry for the action; cross-reference both directions.

Sampling

A fixed-rate random sampler hides the failures you actually want to see.

  • Tail-based sampling on the collector: keep traces with errors, slow LLM calls, or unusual tool sequences.
  • Always-on sampling for low-traffic admin agents.
  • Per-tenant caps so a noisy tenant cannot evict everyone else's traces.

Document the sampling policy alongside the SLOs the team is committing to.

Vendor integration

Any OTLP-compatible backend can ingest these traces.

BackendNotes
LangfuseNative gen_ai support; built for LLM observability (Langfuse OTel)
DatadogLLM Observability product reads gen_ai attributes (Datadog LLM Observability)
HoneycombStrong query model for high-cardinality agent attributes (Honeycomb AI)
Self-hostedOTel Collector + Tempo/Jaeger + Grafana

Keep the SDK and instrumentation backend-agnostic. The collector decides where data lands.

Sample trace (truncated)

agent.run:
  agent.id: "geodocs-writer"
  agent.outcome: "success"
  agent.input.tokens: 12450
  agent.output.tokens: 3120
  children:
    - gen_ai.client.operation:
        gen_ai.system: "anthropic"
        gen_ai.request.model: "claude-3.5-sonnet"
        gen_ai.usage.input_tokens: 5400
        gen_ai.usage.output_tokens: 1200
    - agent.tool_call:
        tool.name: "notion.searchPages"
        tool.outcome: "success"
    - agent.tool_call:
        tool.name: "notion.updatePage"
        tool.outcome: "success"

Validation checklist

  • [ ] Every agent run produces exactly one root span.
  • [ ] LLM calls use gen_ai.* conventions.
  • [ ] Tool calls record outcome and error type.
  • [ ] Token usage rolls up to the run span.
  • [ ] Verbose capture is gated and redacted.
  • [ ] Trace IDs are surfaced to user-facing errors.
  • [ ] Sampling keeps errors and slow calls.
  • [ ] Backend is interchangeable via OTLP.

FAQ

Q: Should I roll my own tracing or use OpenTelemetry?

Use OpenTelemetry. Vendor-specific SDKs lock you in and rarely cover both HTTP and LLM spans cleanly. The gen_ai conventions are stable enough for production.

Q: How do I avoid leaking prompts in traces?

Default to size-only capture. Gate verbose capture per tenant and run the same redactor you use for memory. Keep verbose retention short.

Q: How do I attribute cost across many models?

Keep a price table keyed by gen_ai.system + gen_ai.request.model. Compute cost in a span processor and write it back as a span attribute, then aggregate at the run span.

Q: How should I sample agent traces?

Use tail-based sampling: keep errors, slow runs, and runs that exceed token budgets; downsample the rest. Random sampling at the SDK loses the most useful traces.

Q: What about sub-agents and parallel calls?

Make each sub-agent run a child agent.run span. Use span_link for parallel calls so the trace shows the fan-out structure clearly.

Related Articles

specification

Agent Knowledge Base Specification: Structure, Refresh, and Versioning

Production specification for AI agent knowledge bases: document model, chunking strategies, metadata enrichment, refresh cadence, version pinning, and rollback.

specification

Agent Multi-Step Reasoning Specification: ReAct, Plan-and-Execute, and Reflection

Specification for AI agent multi-step reasoning patterns: ReAct, Plan-and-Execute, Reflexion, Tree of Thoughts, and Self-Consistency.

specification

Agent Permission Model Specification: RBAC, Scopes, and Tool-Level Auth

Production specification for AI agent permissions: RBAC, OAuth scope mapping, tool-level auth, consent prompts, time-bound grants, and MCP propagation.

Stay Updated

GEO & AI Search Insights

New articles, framework updates, and industry analysis. No spam, unsubscribe anytime.