Agent Tool Result Caching Spec: Keys, TTL, Invalidation

Agent tool result caching is the runtime contract that stores tool-call outputs keyed by tool name and canonical arguments, with TTLs chosen by idempotency class and explicit invalidation hooks. It eliminates duplicate side effects on retry and resume while keeping data fresh.

TL;DR

Cache keys MUST be (tool_name, canonical_args_hash) plus an optional namespace, never the raw prompt.
TTL policy follows the tool's idempotency class: pure functions cache forever, read-only API calls minutes-to-hours, mutations zero or single-use.
Invalidation is event-driven: upstream data-change webhooks, schema versions, or explicit cache-bust flags from the agent prompt.
Cache stampede protection (single-flight, request coalescing) is required for any tool with non-trivial latency.
Never cache non-idempotent tools without a per-call idempotency key; doing so risks duplicate side effects.

Definition

Agent tool result caching is the runtime mechanism that records the output of a tool call so that subsequent calls with equivalent inputs return the stored result instead of re-invoking the tool. In an agent runtime, tool calls are the most expensive operations — LLM-orchestrated API calls, code execution, retrieval queries — and the dominant source of latency, cost, and side effects. A correctly specified cache turns retries, resumes, and partial reruns into cheap operations.

The cache is distinct from prompt caching, which lives at the LLM provider boundary and stores tokenized prefixes for reuse. Prompt caching is a model-side performance optimization; tool-result caching is an orchestrator-side correctness and efficiency mechanism. The two are complementary, but their contracts and TTLs differ.

Why this matters

Agents commonly retry tool calls due to transient errors, recover from checkpoints, or revisit earlier states during reasoning. Without caching, each retry re-executes the underlying tool, which is wasteful for read operations and dangerous for writes. A read-only HTTP GET hit twice merely doubles cost; a payment-API POST hit twice can double-charge a customer.

A second motivation is determinism for testing. Agent test suites that mock tool calls with cached fixtures can replay agent traces deterministically, which is otherwise nearly impossible given LLM non-determinism and live API drift. Practitioners typically observe that adopting a cache layer materially reduces flakiness in agent end-to-end tests.

Finally, caching turns long-horizon agents into iterative editors. When a user nudges the agent ("redo step 3 with this constraint"), the cache lets the runtime preserve all unaffected work and re-run only the diff.

How it works

A compliant cache exposes three operations: get(key), put(key, value, ttl), and invalidate(pattern). The key construction is the most important part of the contract.

The canonical key is a hash of the tuple (tool_name, canonical_args, namespace). Canonical arguments are produced by sorting object keys, stripping non-deterministic fields (timestamps, request IDs), and serializing in a stable format such as JCS (JSON Canonicalization Scheme, RFC 8785). The namespace lets a single tool serve multiple cache scopes — per-user, per-tenant, per-environment.

TTL is chosen by idempotency class:

Idempotency class	TTL	Examples
Pure function	indefinite	Currency conversion at fixed rate, hash, math
Read-only with stable data	hours to days	Document fetch, schema lookup, DNS
Read-only with volatile data	seconds to minutes	Stock price, weather, search ranking
Mutating with idempotency key	single-use, then permanent	Stripe charge, message send
Mutating without idempotency key	0 (do not cache)	Random side effect tool

Invalidation has three drivers. Time-based expiry handles slow drift. Event-based invalidation listens for upstream signals — webhooks, change-data-capture streams, schema-version bumps — and evicts affected keys. Explicit invalidation lets the agent or operator force-bust a key, useful when an agent detects suspect data and wants to retry from source.

Stampede protection prevents N concurrent agents from all missing the cache and hammering the upstream tool. Single-flight (coalesce concurrent identical requests onto one upstream call) and probabilistic early expiry are both common solutions.

Practical application

A five-step adoption plan:

Classify every tool by idempotency class. Audit the tool registry once and record the class on each tool definition. New tools cannot be registered without a class.

Build the canonical-args function once and reuse it. A library that takes (tool_name, args) and returns a stable hash is the single most reusable piece of infrastructure in the cache layer.

Pick a backend matched to TTL and size. In-process LRU is fine for sub-second pure-function caches. Redis with TTL eviction is the production default for read-only caches. Postgres or DynamoDB is appropriate for caches that must survive restarts and be queryable.

Wire invalidation hooks before launch. Identify the top three sources of stale data (upstream DB updates, schema migrations, manual operator edits) and ensure each emits invalidation events the cache subscribes to.

Instrument hit/miss rates per tool. A cache without metrics is impossible to tune. Track hit rate, latency saved, and invalidation event counts as first-class signals.

A typical pseudo-code wrapper:

async def cached_call(tool, args, ctx):

key = canonical_key(tool.name, args, ctx.namespace)

if cached := await cache.get(key):

return cached

result = await tool.invoke(args, idempotency_key=key)

await cache.put(key, result, ttl=tool.ttl)

return result

Note that the same key is also passed as the upstream idempotency key, so even a cache miss on retry reuses the previous upstream result.

Common mistakes

Caching non-idempotent tools without an idempotency key is the most dangerous mistake. The cache hides the duplicate-call problem until the cache evicts, at which point the side effect fires twice. The fix is to enforce idempotency keys in the tool registry as a precondition for cache eligibility.

Keying on the raw LLM-generated argument string is the second-most-common error. LLMs serialize objects with non-deterministic key order and whitespace, producing cache misses for semantically identical requests. Always pass arguments through canonicalization before hashing.

Ignoring stale-data drift is a third pitfall. Read-only caches with hour-long TTLs feel safe but quietly serve outdated data when the upstream changes mid-window. Pair every read-only cache with at least time-based expiry plus, where available, an invalidation hook.

Finally, cache stampedes — hundreds of agents hitting an empty cache simultaneously — melt downstream services. Single-flight coalescing or randomized expiry jitter prevents this.

FAQ

Q: How is this different from prompt caching?

Prompt caching, as offered by Anthropic and OpenAI, stores tokenized prompt prefixes inside the LLM provider so the model skips re-processing them. It is invisible to the orchestrator and saves model tokens and latency only. Tool-result caching lives in the agent runtime, stores tool outputs, prevents duplicate side effects, and is the layer at which deterministic replay is possible. Both are useful and stacked: a checkpointed agent often reads from prompt cache for the LLM step and tool-result cache for tool steps.

Q: When should TTL be zero?

Whenever a tool has side effects that cannot be reversed and lacks an idempotency key. Examples include sending an email through a non-idempotent transport, posting to a write-only webhook, or any stateful operation where the caller cannot prove the operation has not already happened. A TTL of zero forces every call to hit the upstream, which then becomes the single source of truth.

Q: How do I invalidate on upstream data change?

Three mechanisms in order of robustness. First, change-data-capture: subscribe to a database CDC stream (Debezium, native CDC) and translate row-level events into key-pattern invalidations. Second, application-emitted invalidation events: services that own data emit "data X changed" events to a pub/sub topic the cache subscribes to. Third, version stamps: include a schema or content version in the cache key so a version bump implicitly invalidates the entire prior generation.

Q: Should the cache survive process restarts?

For pure-function and read-only caches, yes — a Redis or Postgres backing store is cheap insurance. For caches whose entries are bound to a specific run (one-shot mutation results, partial computations), no — those should live on the per-run checkpoint, not the global cache, so they cannot leak between users.

Q: How does this interact with agent checkpointing?

The checkpoint is the per-run system of record; the cache is a global side-channel that opportunistically returns prior results. On resume, the runtime reads the checkpoint to find pending tool calls, then issues each call with its idempotency key, which the cache layer recognizes and serves. The two layers cooperate but are independent; you can have either without the other, but production agents typically run both.