Agent State Management Patterns Specification
Agent state management is the discipline of choosing the right storage layer for each class of state — short-term, working, long-term, and durable execution — and committing checkpoints often enough that any agent run can be resumed exactly where it left off, even after a crash.
TL;DR
LLMs are stateless; agents are not. A production agent runtime must explicitly model four state classes — short-term context, working scratchpad, long-term memory, and durable workflow state — and back each by an appropriate storage layer. This spec defines the required state classes, storage backends, checkpoint contract, and recovery semantics every Geodocs-aligned agent platform must implement.
Scope
This specification covers what an agent runtime stores, where, and how it recovers state across crashes, restarts, and human-in-the-loop pauses. It is the persistence companion to Agent Error Recovery Patterns Specification. Cross-thread sharing, multi-agent coordination, and memory pruning policy are downstream concerns that build on this layer.
1. State Classes
Every agent runtime MUST distinguish at least four state classes. Conflating them leads to either expensive over-persistence or fatal under-persistence.
| Class | Lifetime | Typical content | Read frequency | Latency budget |
|---|---|---|---|---|
| Short-term context | Current LLM call | Recent N messages, current tool results | Every step | <10 ms |
| Working scratchpad | Single agent run / thread | Plan, intermediate results, partial output | Every step | <50 ms |
| Long-term memory | User / tenant lifetime | Preferences, episodic facts, semantic notes | Per relevant query | <200 ms |
| Durable workflow state | Workflow lifetime (minutes-months) | Step status, signals, retry counters | Per workflow event | <500 ms |
These classes map to the temporal scopes used in agent-memory literature and to LangGraph's distinction between thread-level and cross-thread state, made explicit in the LangGraph persistence docs.
2. Storage Backends
| Class | Recommended primary | Acceptable alternatives | Avoid |
|---|---|---|---|
| Short-term context | In-process memory | Redis (when stateless workers) | SQL row-per-message |
| Working scratchpad | Redis / in-process | LangGraph InMemorySaver for prototypes | S3 / object stores |
| Long-term memory | Vector DB + SQL | DynamoDB, Postgres + pgvector | Append-only logs |
| Durable workflow state | Workflow engine (Temporal, LangGraph + checkpointer) | Postgres / DynamoDB / Redis as checkpoint backend | In-memory only |
Reference architectures: the AWS DynamoDB + LangGraph guide, the Redis langgraph-checkpoint-redis integration, and Temporal's durable execution model are canonical and SHOULD be preferred over hand-rolled persistence.
3. Checkpoint Contract
A checkpoint is a snapshot of the agent's state at a specific point in execution. The runtime MUST emit checkpoints to durable storage at every transition between major steps. Each checkpoint MUST contain:
- A unique, monotonically increasing ID (per the LangGraph Checkpoint API).
- The thread / run identifier.
- The serialized working scratchpad (plan, intermediate results).
- The current step pointer (which node / activity is next).
- The error history and retry counters.
- A timestamp and the agent / model version.
Checkpoints MUST be written before any externally-visible side effect is taken. Pairing this with idempotency keys (see Agent Error Recovery Patterns Specification) gives crash-proof execution: on resume, the runtime replays from the last checkpoint, and idempotent tools collapse duplicate calls.
4. Recovery Semantics
On restart, the runtime MUST:
- Locate the most recent checkpoint for the thread (latest row by monotonic ID, no full scan).
- Verify the checkpoint matches the current agent / model version policy. If incompatible, route to manual review rather than auto-resume.
- Rehydrate the working scratchpad and resume from the next step pointer.
- Re-issue any in-flight tool call using its original idempotency key, allowing the target service to short-circuit duplicates.
- Record a runtime.recovery span (see Agent Tracing and Spans Specification) with the checkpoint ID and gap duration.
The runtime MUST NOT auto-resume across breaking schema changes. Schema migrations require an explicit replay or compensation policy.
5. Long-Term Memory
Long-term memory persists across runs and threads. The runtime MUST implement at least:
- Episodic store: append-only log of (timestamp, actor, event, summary) records, indexed by user / tenant.
- Semantic store: vector index of distilled facts, preferences, and patterns derived from the episodic log. The A-MEM paper (Chen et al., 2025) is one principled approach; simpler RAG-over-history setups are acceptable for smaller systems.
- Pruning policy: a documented retention window per record class, plus a redaction path for user-requested deletion.
Long-term memory writes MUST be explicit, not a side effect of every step. The agent (or a dedicated memory writer) decides what to remember; uncontrolled writes inflate cost and leak signal.
6. Multi-Tenancy and Isolation
State MUST be partitioned by tenant and by user. The runtime MUST:
- Include tenant_id and user_id in every checkpoint and memory record.
- Enforce tenant isolation at the storage layer (separate keyspaces, row-level security, or per-tenant tables) — not only at the application layer.
- Encrypt sensitive fields at rest (PII, secrets, tool credentials).
7. Observability
For every state operation, the runtime SHOULD emit:
- Counter: checkpoints written per minute, by thread.
- Histogram: checkpoint write latency.
- Counter: recoveries per minute, with recovery_reason.
- Gauge: active threads, by state.
FAQ
Q: Do I need a workflow engine like Temporal for every agent?
No. Short-running agents (single user turn, no multi-step external side effects) can use LangGraph with an in-memory or Redis checkpointer. Workflow engines pay off when runs span minutes to months, cross many tools, or require strong durability guarantees.
Q: Can I store all state in the LLM context window?
No. The context window is short-term context only — it is volatile and bounded by token limits. Working scratchpad, long-term memory, and durable workflow state must live outside the model.
Q: How often should the runtime checkpoint?
At minimum: before every externally-visible side effect, after every model step, and at every human-in-the-loop pause. More frequent is rarely harmful with append-only or LSM-style checkpoint stores.
Q: What about cross-thread memory?
Cross-thread memory is the long-term store, not the working scratchpad. Sharing scratchpad state across threads creates race conditions; long-term memory writes are explicit and serialized through the memory writer.
Related Articles
Agent Conversation Summarization: Triggers, Schema, and Retention
Specification for compressing agent conversation history into running summaries: triggers, summary schema, retention rules, and recovery patterns for long-running chats.
Agent Error Recovery Patterns Specification
Specification for agent error recovery — retry strategies, idempotency keys, Saga compensation, poison-message handling, and runbook-friendly error codes.
Agent Evaluation Harness Documentation: How to Spec an Eval Suite for AI Agents
Specification for documenting an AI agent evaluation harness — eval suites, scorers, datasets, and trajectory grading that humans and docs agents can both consume.