Geodocs.dev

Agent State Management Patterns Specification

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

Agent state management is the discipline of choosing the right storage layer for each class of state — short-term, working, long-term, and durable execution — and committing checkpoints often enough that any agent run can be resumed exactly where it left off, even after a crash.

TL;DR

LLMs are stateless; agents are not. A production agent runtime must explicitly model four state classes — short-term context, working scratchpad, long-term memory, and durable workflow state — and back each by an appropriate storage layer. This spec defines the required state classes, storage backends, checkpoint contract, and recovery semantics every Geodocs-aligned agent platform must implement.

Scope

This specification covers what an agent runtime stores, where, and how it recovers state across crashes, restarts, and human-in-the-loop pauses. It is the persistence companion to Agent Error Recovery Patterns Specification. Cross-thread sharing, multi-agent coordination, and memory pruning policy are downstream concerns that build on this layer.

1. State Classes

Every agent runtime MUST distinguish at least four state classes. Conflating them leads to either expensive over-persistence or fatal under-persistence.

ClassLifetimeTypical contentRead frequencyLatency budget
Short-term contextCurrent LLM callRecent N messages, current tool resultsEvery step<10 ms
Working scratchpadSingle agent run / threadPlan, intermediate results, partial outputEvery step<50 ms
Long-term memoryUser / tenant lifetimePreferences, episodic facts, semantic notesPer relevant query<200 ms
Durable workflow stateWorkflow lifetime (minutes-months)Step status, signals, retry countersPer workflow event<500 ms

These classes map to the temporal scopes used in agent-memory literature and to LangGraph's distinction between thread-level and cross-thread state, made explicit in the LangGraph persistence docs.

2. Storage Backends

ClassRecommended primaryAcceptable alternativesAvoid
Short-term contextIn-process memoryRedis (when stateless workers)SQL row-per-message
Working scratchpadRedis / in-processLangGraph InMemorySaver for prototypesS3 / object stores
Long-term memoryVector DB + SQLDynamoDB, Postgres + pgvectorAppend-only logs
Durable workflow stateWorkflow engine (Temporal, LangGraph + checkpointer)Postgres / DynamoDB / Redis as checkpoint backendIn-memory only

Reference architectures: the AWS DynamoDB + LangGraph guide, the Redis langgraph-checkpoint-redis integration, and Temporal's durable execution model are canonical and SHOULD be preferred over hand-rolled persistence.

3. Checkpoint Contract

A checkpoint is a snapshot of the agent's state at a specific point in execution. The runtime MUST emit checkpoints to durable storage at every transition between major steps. Each checkpoint MUST contain:

  • A unique, monotonically increasing ID (per the LangGraph Checkpoint API).
  • The thread / run identifier.
  • The serialized working scratchpad (plan, intermediate results).
  • The current step pointer (which node / activity is next).
  • The error history and retry counters.
  • A timestamp and the agent / model version.

Checkpoints MUST be written before any externally-visible side effect is taken. Pairing this with idempotency keys (see Agent Error Recovery Patterns Specification) gives crash-proof execution: on resume, the runtime replays from the last checkpoint, and idempotent tools collapse duplicate calls.

4. Recovery Semantics

On restart, the runtime MUST:

  1. Locate the most recent checkpoint for the thread (latest row by monotonic ID, no full scan).
  2. Verify the checkpoint matches the current agent / model version policy. If incompatible, route to manual review rather than auto-resume.
  3. Rehydrate the working scratchpad and resume from the next step pointer.
  4. Re-issue any in-flight tool call using its original idempotency key, allowing the target service to short-circuit duplicates.
  5. Record a runtime.recovery span (see Agent Tracing and Spans Specification) with the checkpoint ID and gap duration.

The runtime MUST NOT auto-resume across breaking schema changes. Schema migrations require an explicit replay or compensation policy.

5. Long-Term Memory

Long-term memory persists across runs and threads. The runtime MUST implement at least:

  • Episodic store: append-only log of (timestamp, actor, event, summary) records, indexed by user / tenant.
  • Semantic store: vector index of distilled facts, preferences, and patterns derived from the episodic log. The A-MEM paper (Chen et al., 2025) is one principled approach; simpler RAG-over-history setups are acceptable for smaller systems.
  • Pruning policy: a documented retention window per record class, plus a redaction path for user-requested deletion.

Long-term memory writes MUST be explicit, not a side effect of every step. The agent (or a dedicated memory writer) decides what to remember; uncontrolled writes inflate cost and leak signal.

6. Multi-Tenancy and Isolation

State MUST be partitioned by tenant and by user. The runtime MUST:

  • Include tenant_id and user_id in every checkpoint and memory record.
  • Enforce tenant isolation at the storage layer (separate keyspaces, row-level security, or per-tenant tables) — not only at the application layer.
  • Encrypt sensitive fields at rest (PII, secrets, tool credentials).

7. Observability

For every state operation, the runtime SHOULD emit:

  • Counter: checkpoints written per minute, by thread.
  • Histogram: checkpoint write latency.
  • Counter: recoveries per minute, with recovery_reason.
  • Gauge: active threads, by state.

FAQ

Q: Do I need a workflow engine like Temporal for every agent?

No. Short-running agents (single user turn, no multi-step external side effects) can use LangGraph with an in-memory or Redis checkpointer. Workflow engines pay off when runs span minutes to months, cross many tools, or require strong durability guarantees.

Q: Can I store all state in the LLM context window?

No. The context window is short-term context only — it is volatile and bounded by token limits. Working scratchpad, long-term memory, and durable workflow state must live outside the model.

Q: How often should the runtime checkpoint?

At minimum: before every externally-visible side effect, after every model step, and at every human-in-the-loop pause. More frequent is rarely harmful with append-only or LSM-style checkpoint stores.

Q: What about cross-thread memory?

Cross-thread memory is the long-term store, not the working scratchpad. Sharing scratchpad state across threads creates race conditions; long-term memory writes are explicit and serialized through the memory writer.

Related Articles

specification

Agent Conversation Summarization: Triggers, Schema, and Retention

Specification for compressing agent conversation history into running summaries: triggers, summary schema, retention rules, and recovery patterns for long-running chats.

specification

Agent Error Recovery Patterns Specification

Specification for agent error recovery — retry strategies, idempotency keys, Saga compensation, poison-message handling, and runbook-friendly error codes.

specification

Agent Evaluation Harness Documentation: How to Spec an Eval Suite for AI Agents

Specification for documenting an AI agent evaluation harness — eval suites, scorers, datasets, and trajectory grading that humans and docs agents can both consume.

Topics
Cập nhật tin tức

Thông tin GEO & AI Search

Bài viết mới, cập nhật khung làm việc và phân tích ngành. Không spam, hủy đăng ký bất cứ lúc nào.