Geodocs.dev

Agent Conversation Summarization: Triggers, Schema, and Retention

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

Agent conversation summarization compresses long turn histories into a structured running summary—covering decisions, facts, open questions, and tool-call traces—triggered when token usage approaches a budget threshold while retaining the last N raw turns for fidelity. The pattern preserves decision-relevant signal across multi-hour or multi-day agent sessions without exceeding model context windows.

TL;DR

  • Trigger on token budget, not turn count. Run summarization when running token usage exceeds roughly two-thirds of the model's context window, since turns vary widely in length.
  • Summarize into a fixed schema, not free-form prose: decisions made, facts established, open questions, pending tool calls, and user preferences.
  • Always keep the last N raw turns (commonly 4-8) untouched so the agent can answer follow-ups grounded in exact recent wording.
  • Use a separate, cheaper model for summarization, and validate that the resulting summary plus retained tail still fits the budget before continuing.

Definition

Agent conversation summarization is the runtime pattern of compressing the historical turns of an agent's conversation into a smaller running summary that the agent can read on every subsequent turn, while preserving the most recent turns verbatim. It sits between two extremes: the naive pattern of replaying every turn (which exhausts the context window and inflates cost), and the lossy pattern of truncating old turns by hard cutoff (which silently drops decisions, user preferences, and tool results the agent still needs).

In production agent systems, conversation summarization typically operates on three inputs—the prior running summary, a buffer of recent raw turns, and any newly produced turns—and emits a new running summary plus an updated tail buffer. This is the explicit memory pattern documented in LangChain's ConversationSummaryMemory and the buffered hybrid in ConversationSummaryBufferMemory, and it is the conceptual basis for hierarchical memory systems such as MemGPT.

Why this matters

Agent sessions that span complex multi-step work routinely accumulate hundreds of turns, dozens of tool calls, and large pasted documents. Without summarization, three failure modes appear together:

  1. Context-window exhaustion. Even long-context models such as Claude 3.5 Sonnet and GPT-4o have hard ceilings, and Anthropic's long-context guidance explicitly recommends compressing low-signal history rather than relying on raw extension.
  2. Cost amplification. Per-turn cost scales with input tokens, so replaying the entire history on every turn means cost grows roughly quadratically with conversation length—a pattern practitioners commonly observe when instrumenting provider usage dashboards.
  3. Attention dilution. As the prompt grows, models typically allocate less attention to any individual fact, increasing the rate of "lost-in-the-middle" failures where the agent forgets a constraint stated earlier.

A correctly designed summarization layer addresses all three by keeping the prompt close to a stable, predictable size while carrying forward the semantic content the agent actually needs to act.

How it works

A conversation summarization layer is a small state machine that runs after each agent turn (or every K turns) and decides whether to compress.

stateDiagram-v2
    [*] --> Append
    Append --> Measure: turn complete
    Measure --> Append: tokens below threshold
    Measure --> Summarize: tokens at or above threshold
    Summarize --> Validate
    Validate --> Append: summary plus tail fits budget
    Validate --> Summarize: still over budget

The components map to four concrete responsibilities.

Trigger. The recommended trigger is total prompt tokens crossing a threshold (commonly 60-75% of context window) rather than a fixed turn count, because turns vary by orders of magnitude (a 5-word user reply vs. a 20-page pasted contract). Token counts can be obtained from the model provider's tokenizer or, for OpenAI, the usage field returned on every completion.

Summary schema. Free-form summaries drift in quality. A fixed schema is more reliable:

  • decisions: list of decisions the agent or user has committed to.
  • facts: stable facts established during the conversation (user identity, environment, preferences).
  • open_questions: items the agent has flagged as needing follow-up.
  • pending_tool_calls: tool invocations awaiting user confirmation or external completion.
  • recent_tool_results: pointers (and short digests) to recent tool outputs whose full payloads were dropped from the tail.

This schema is auditable, testable, and survives model swaps better than narrative prose.

Tail retention. Even with a strong summary, models perform better on follow-ups when they can see the exact recent wording. A common policy is "keep the last 4-8 turns raw"; the exact value should be tuned against the agent's evaluation suite.

Summarizer model. Summarization is well-suited to a smaller, cheaper model than the agent's primary reasoning model, because it is a constrained extraction task. OpenAI's cookbook and most production stacks use a smaller side model for this step.

Validation. After summarization, the new prompt size must be re-measured. If the summary plus tail still exceeds budget, the system should compress more aggressively (e.g., drop older recent_tool_results digests) before proceeding.

Practical application

A minimal production-grade implementation typically has the following surface:

class ConversationSummarizer:
    def __init__(self, model, schema, threshold_ratio=0.65, tail_turns=6):
        self.model = model
        self.schema = schema
        self.threshold_ratio = threshold_ratio
        self.tail_turns = tail_turns

def maybe_summarize(self, summary, history, ctx_window):

prompt_tokens = count_tokens(summary, history)

if prompt_tokens / ctx_window < self.threshold_ratio:

return summary, history

head = history[:-self.tail_turns]

tail = history[-self.tail_turns:]

new_summary = self.model.summarize(

prior=summary, turns=head, schema=self.schema

)

return new_summary, tail

Three implementation choices recur in well-instrumented systems:

  1. Idempotent summarization. Make the summarizer accept the prior summary as input so re-summarizing is additive, not destructive. This protects against partial failures.
  2. Tool-result handling. Tool results (e.g., a 50KB JSON payload from a search API) should not enter the conversation history raw. Store them in a side store keyed by tool-call ID and inject only a short digest into the running summary.
  3. Recoverable on failure. If the summarizer model errors, the system should fall back to a deterministic truncation policy (drop oldest non-tail turn) rather than failing the user-facing turn.

Common mistakes

  • Summarizing too aggressively. Compressing on every turn loses recent fidelity and adds latency. Trigger only when needed.
  • Free-form prose summaries. Without a schema, summaries drift in tone and granularity across runs and across model versions.
  • Dropping tool-call traces. A summary that omits tool calls forces the agent to re-execute identical lookups. Always keep at least a digest with the tool name and key arguments.
  • Single-shot summarization. Re-summarizing without including the prior summary loses information accumulated in earlier compactions.
  • No size validation after summary. The summary itself can be large; without re-measuring, the next turn can still overflow.

FAQ

Q: How is conversation summarization different from context-window budgeting?

Context-window budgeting is the broader discipline of allocating a fixed token budget across system prompt, tools schema, retrieved documents, and conversation history. Summarization is one technique inside that budget specifically for the conversation-history slice. A complete agent runtime usually does both: a budgeter decides how many tokens conversation history may consume, and the summarizer compresses history to fit that allocation.

Q: Should the agent itself summarize, or should a separate model do it?

A separate, smaller model is typically preferred. Summarization is a constrained extraction task, not a reasoning task, so cheaper models perform well, and decoupling avoids polluting the agent's reasoning state with the summarization sub-task. Anthropic and OpenAI both document patterns where a side model handles auxiliary compression while the primary model focuses on the user-facing task.

Q: How do I preserve tool-call traces across summarization?

Store the full tool result in a side store keyed by tool-call ID and place only a structured digest (tool name, key arguments, short outcome description, and the ID for retrieval) in the summary's recent_tool_results field. If the agent later needs the full payload, it can fetch it by ID rather than relying on it being inlined in the prompt.

Related Articles

specification

Agent Context Window Budgeting Specification

Agent context window budgeting spec: token allocation buckets, summarization triggers, eviction policies, prompt caching pairing, and worked examples.

specification

Agent Evaluation Harness Documentation: How to Spec an Eval Suite for AI Agents

Specification for documenting an AI agent evaluation harness — eval suites, scorers, datasets, and trajectory grading that humans and docs agents can both consume.

specification

Agent Knowledge Base Integration: RAG, MCP, and Direct API Patterns

Spec for connecting AI agents to internal knowledge bases via RAG vector stores, MCP servers, or direct retrieval APIs with provenance and ACL stamping.

Topics
Cập nhật tin tức

Thông tin GEO & AI Search

Bài viết mới, cập nhật khung làm việc và phân tích ngành. Không spam, hủy đăng ký bất cứ lúc nào.