Agent Context Window Budgeting Specification

This specification defines how agent builders pre-allocate a fixed token budget across system prompt, tool definitions, conversation history, tool outputs, and a reserved working buffer, then declare summarization and eviction triggers when any bucket overflows. It pairs with prompt caching and tool-response truncation to keep long-running agents reliable.

TL;DR

Treat the context window as a budget with five named buckets (system, tools, history, tool-outputs, working) plus a reserved headroom. Declare per-bucket caps, an overflow trigger, and an eviction policy in the agent's runtime config. Pair the budget with prompt caching for the static prefix and a 25,000-token-style cap on tool outputs so a single noisy response can never push the agent over its limit.

Why budgeting beats letting the window fill

Anthropic's context-engineering guidance frames context as "a critical but finite resource" — the engineering task is to maintain the optimal set of tokens during inference, not to maximize how many fit. OpenAI exposes the same concern through tier-specific window caps (32K Plus, 128K Pro, 256K+ for Thinking models). Both providers price every input token on every turn; an unbudgeted agent with a 100K conversation history pays for those 100K tokens on each subsequent call.

Unbudgeted agents fail in three predictable ways: tool definitions crowd out history, a single oversized tool output pushes the agent over its limit mid-task, or recent reasoning gets evicted because no rule said which messages to drop first. A budget makes those decisions before they become incidents.

The five-bucket model

Declare a fixed share of the window for each bucket. Caps are illustrative for a 128K window; scale proportionally for other model sizes.

Bucket	Typical share	Contents
system	5-10%	Identity, persistent instructions, output-format rules.
tools	10-20%	Tool definitions and schemas the agent can call.
history	30-40%	Prior user/assistant turns, summarized as needed.
tool_outputs	25-35%	Results returned from tool calls in the current task.
working	10-15%	Reserved headroom for the model's response and reasoning.

The shares should sum to no more than 90%; the remaining 10% is hard headroom that the agent never plans to use. Hitting headroom is itself a signal that triggers compaction.

Per-bucket caps and triggers

Declare each cap explicitly in a runtime config so the agent runtime can enforce it.

context_budget:
  total_window: 128000
  buckets:
    system: { max_tokens: 8000, on_overflow: "reject" }
    tools: { max_tokens: 16000, on_overflow: "prune-unused" }
    history: { max_tokens: 48000, on_overflow: "summarize" }
    tool_outputs: { max_tokens: 32000, on_overflow: "truncate" }
    working: { max_tokens: 16000, on_overflow: "reject" }
  headroom_tokens: 8000

Three overflow strategies cover the common cases:

reject: refuse to add the new content; agent must shrink first.
summarize: invoke an LLM-based compressor on the bucket.
truncate: drop oldest entries up to a target watermark.
prune-unused: remove tool definitions the agent has not invoked in the last N turns.

Summarization and eviction policy

Letta's agent-memory pattern recommends evicting only a portion (e.g., 70%) of messages once a bucket is full so continuity is preserved, then recursively summarizing the evicted messages alongside any earlier summary. The multi-layer cascade pattern formalizes this further: compress tool outputs first, slide a window over older history second, and invoke LLM summarization only as a last resort. Each layer is cheaper than the next.

A minimal eviction policy declaration:

eviction:
  layers:
    - { target: "tool_outputs", strategy: "truncate-largest", at_pct: 80 }
    - { target: "history", strategy: "sliding-window", keep_recent_turns: 8, at_pct: 85 }
    - { target: "history", strategy: "recursive-summarize", evict_pct: 70, at_pct: 95 }

The at_pct field declares the fill threshold that fires the layer. Cascading thresholds (80 → 85 → 95) prevent every overflow from invoking a model call.

Pairing with prompt caching

The system and tools buckets are static across most turns, which makes them ideal cache surfaces. Anthropic's Claude exposes both automatic caching and explicit cache_control breakpoints on every active model; OpenAI exposes prompt caching on the Responses and Chat Completions APIs. Cached tokens read at roughly 10% of full input cost on Claude, and the 5-minute TTL is a common gotcha — cache cost is amortized only when invocations are frequent enough to keep the cache warm.

Declare the cache breakpoint at the boundary between tools and history. Everything above the breakpoint is cacheable; everything below is per-turn. If the agent's tool list mutates within a session, that change invalidates the cache and resets the TTL clock.

Tool-output truncation

The single most common cause of context overflow is an uncapped tool response. Anthropic's Claude Code restricts tool responses to 25,000 tokens by default and recommends pagination, range selection, filtering, and truncation with sensible defaults for any tool that could plausibly return more.

Apply the same rule to every tool the agent owns:

Cap the response at a known token count.
Return a structured indicator (truncated: true, next_cursor: "...") so the agent knows to paginate.
Strip non-essential fields by default; expose them behind an explicit include argument.

A tool that returns 200K tokens of HTML on a typo is not a tool problem; it is a budget breach.

Five worked examples

1. Customer-support agent (128K window)

8K system, 12K tools, 56K history, 32K tool outputs, 16K working, 4K headroom. History summarizes at 85%; tool outputs truncate at 80%. Cache breakpoint after tools.

2. Code-editing agent (200K window)

10K system, 20K tools, 80K history (large because file contents accumulate), 60K tool outputs (file reads), 24K working, 6K headroom. Tool outputs truncate aggressively after the model has summarized them into history.

3. Research agent with web search (256K window)

12K system, 16K tools, 64K history, 120K tool outputs (search + page text), 32K working, 12K headroom. Tool outputs evict via truncate-largest before any history summarization fires.

4. Voice agent on a 32K window

2K system, 4K tools, 12K history, 8K tool outputs, 4K working, 2K headroom. History uses an aggressive 4-turn sliding window; summarization fires at 70% rather than 95%.

5. Long-running automation agent (128K window, hours of runtime)

8K system, 12K tools, 40K history with episodic summaries persisted to external memory, 48K tool outputs, 16K working, 4K headroom. Eviction writes to a vector store on the way out so older context can be re-retrieved on demand.

Common mistakes

Letting tool defs grow unbounded. Adding ten new tools to a session can silently double the static prefix. Cap the bucket and prune unused tools.
Single-strategy eviction. Summarizing on every overflow is expensive; truncating on every overflow loses information. Use a cascade.
Caching a mutating prefix. Putting the user's name in the cached system bucket invalidates the cache on every new user. Move per-user data below the cache breakpoint.
Ignoring tool-output caps. A search tool that returns full page HTML eats half the budget on one call.
Counting characters instead of tokens. Token counts vary by tokenizer; budget in tokens against the actual encoder for the deployed model.

FAQ

Q: How big should the headroom be?

Reserve at least 5-10% of the total window as hard headroom the agent never plans to use. Headroom absorbs the variance between the agent's pre-call estimate and the true tokenized size, plus the model's response. Hitting headroom should be treated as an alarm, not a normal operating condition.

Q: When should summarization fire instead of truncation?

Truncation is correct for tool outputs and ephemeral data: cheap, fast, lossy in a way the agent already expects. Summarization is correct for history that contains decisions, facts, or commitments the agent will need later. A cascade firing truncation first and summarization only at higher fill levels gets the cost-quality tradeoff right.

Q: How does prompt caching interact with the budget?

Caching does not change the budget — cached tokens still count against the window — but it changes the cost. Place the cache breakpoint at the boundary of the static prefix (system + tools) so per-turn input cost drops by roughly an order of magnitude on cached models. Account for the 5-minute TTL by either keeping calls frequent or accepting that the first call after a gap pays full price.

Q: Should the budget change with model size?

Proportional shares hold across model sizes; absolute caps scale with the window. A 32K voice agent and a 256K research agent can both target ~10% system, ~10% tools, and so on. The cascade thresholds typically tighten as the absolute window shrinks because the cost of a single oversized tool output is a much larger share of the budget.

Q: How does the budget interact with external memory?

External memory (vector stores, episodic logs) is the destination for evicted context. Eviction should write summaries and key facts to the store on the way out, and a retrieval tool should pull them back into the tool_outputs bucket on demand. The budget governs the live window; the store governs the cumulative session.