Geodocs.dev

Agent Knowledge Base Specification: Structure, Refresh, and Versioning

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

An agent knowledge base is the curated retrieval corpus that grounds responses. This specification defines the document model, chunking strategies, metadata schema, refresh cadence, version pinning, and rollback procedures so an agent can answer reliably and recover from bad ingests.

TL;DR

Model each document as an immutable versioned record with chunk children. Pick one chunking strategy per content type: fixed for short, semantic for prose, hierarchical for structured docs. Enrich every chunk with source URL, version, ACL, and freshness. Refresh on a cadence matched to source volatility, pin a known-good version per agent, and keep an audit trail so any answer can be traced back to a specific KB revision.

Why a knowledge base spec exists

An agent that grounds on an unstructured pile of files cannot be debugged. Without versioning, you cannot reproduce why an agent answered the way it did yesterday. Without refresh discipline, the agent quietly drifts off the source of truth. Without rollback, a bad ingest poisons every conversation. This specification fixes those failure modes.

Document model

A knowledge base contains documents. Each document is a versioned record of a real-world artifact (a help-doc page, a product spec, a runbook). Documents own one or more chunks that retrieval returns to the agent.

Document record

{
  "document_id": "kb_doc_94c1",
  "source_uri": "https://help.example.com/articles/refunds",
  "source_type": "web",
  "version": "2026-05-03T08:00:00Z",
  "checksum": "sha256:abc...",
  "title": "How refunds work",
  "language": "en",
  "acl": ["role:customer", "role:agent"],
  "tags": ["billing", "policy"],
  "published_at": "2026-04-30T00:00:00Z",
  "deprecated": false,
  "deprecated_at": null,
  "deprecated_reason": null
}

Chunk record

{
  "chunk_id": "kb_doc_94c1#chunk_07",
  "document_id": "kb_doc_94c1",
  "document_version": "2026-05-03T08:00:00Z",
  "text": "Refunds are issued to the original payment method within 5-10 business days...",
  "position": 7,
  "heading_path": ["Refunds", "Timing"],
  "vector": null,
  "metadata": {
    "source_uri": "https://help.example.com/articles/refunds#timing",
    "updated_at": "2026-05-03T08:00:00Z",
    "acl": ["role:customer"],
    "freshness_window_days": 30
  }
}

Chunking strategies

StrategyWhen to useTypical sizeOverlap
FixedShort, uniform content (FAQs, snippets)200-400 tokens0-20%
SemanticLong-form prose where boundaries matter300-800 tokens10-20%
HierarchicalDocs with clear headings (specs, runbooks)parent + child levelsparent always returned with child
Sentence-windowQ&A and chat logs1-3 sentences + windowwindow 2-6 sentences

Chunk size and overlap influence both retrieval recall and context-window cost. Recent Anthropic research on contextual retrieval reports meaningful recall improvements from prepending a short generated context blurb to each chunk, especially for long documents (Anthropic, 2024 — Contextual Retrieval). Treat exact percentage gains as workload-dependent.

Default rule. Fixed chunking for utility content, semantic for marketing pages, hierarchical for technical specs. Pick one strategy per source type and document the choice in the source registry.

Metadata enrichment

Every chunk must carry the metadata the agent needs to (a) cite, (b) filter by ACL, (c) reason about freshness, and (d) recover provenance.

  • source_uri — deep link, ideally with a fragment to the section.
  • document_version — ISO timestamp or content hash of the source.
  • acl — list of role or tenant identifiers required for access.
  • updated_at — source change time, not ingest time.
  • freshness_window_days — how long the chunk is considered current.
  • deprecated — set true to soft-delete a chunk without removing it from history.

Refresh cadence

Source volatilityCadenceTrigger
Static reference (legal text, glossary)QuarterlyManual review
Slow-moving docsWeekly batchSource webhook or daily diff
Operational runbooksDailyCI publish hook
Product catalogHourlyCDC stream
Live conversations / ticketsNear real-timeEvent stream

Ingestion runs must be idempotent. Re-running a refresh on the same source version must produce the same chunks and the same chunk IDs.

Version pinning

Agents pin to a release tag of the knowledge base, not to "latest". A release tag is a labelled snapshot of all document versions, chunks, and embedding vectors at a point in time.

  • Production agents pin to a vetted release.
  • Canary agents pin to a candidate release for a fraction of traffic.
  • Eval agents pin to known-good releases used by regression tests.

Swapping the live release should be a single config change, not a re-ingest.

Rollback procedure

  1. Detect a bad ingest via eval regression, traffic anomaly, or operator report.
  2. Move the production agent's release pin back to the previous known-good tag.
  3. Mark the bad release as quarantined; do not delete it (auditors need the trail).
  4. File a postmortem with the source diff that caused the regression.
  5. Re-ingest with a fix and create a new release tag; never reuse the bad tag's name.

Source-of-truth governance

  • Every chunk traces to exactly one document.
  • Every document traces to exactly one source artifact.
  • When two sources disagree, the document model records both versions and a canonical: true flag identifies the chosen one.
  • Agents must never paraphrase across two conflicting documents in the same answer without explicit conflict handling.

Conflict resolution

When retrieval returns chunks from sources that disagree:

  1. Prefer the chunk with the higher-precedence source (configured per domain).
  2. If precedence ties, prefer the more recent updated_at.
  3. If still tied, surface both to the agent and require an explicit citation in the response.
  4. Log the conflict so reviewers can update precedence rules.

Audit trail

For every retrieved answer, persist:

  • The agent run ID and prompt hash.
  • The release tag of the KB that was queried.
  • The list of chunk IDs and versions that were returned.
  • The final response with inline references back to those chunk IDs.

Audit records are append-only and retained for at least the regulatory window your domain requires.

Validation checklist

  • [ ] Every document has a stable document_id and a versioned record.
  • [ ] Every chunk references its parent document and version.
  • [ ] ACL fields propagate from document to chunk.
  • [ ] Refresh jobs are idempotent.
  • [ ] Release tags are immutable.
  • [ ] Rollback runbook is tested at least quarterly.
  • [ ] Audit trail captures release tag + chunk IDs per response.

FAQ

Q: Should I store the embedding inside the chunk record or in the vector store?

Keep the canonical chunk in the document store and write embeddings to the vector store keyed by chunk_id. That way you can re-embed without losing the source of truth.

Q: How big should a chunk be?

Usually 200-800 tokens. Smaller chunks improve precision; larger chunks preserve context. Test both ends of the range with your eval set (Chroma research on chunking).

Q: How do I version a knowledge base across embedding model changes?

Treat an embedding model change as a new release of the KB. Re-embed every chunk and create a new release tag; do not mix embeddings from two models in the same retrieval index.

Q: How do I delete sensitive content?

Mark the document as deprecated: true and re-ingest with the redaction. Keep the original record in cold storage if your policy requires audit access; never silently overwrite history.

Q: What goes in the metadata vs the chunk text?

Facts live in text; routing signals live in metadata. ACL, source URL, and freshness are metadata. Section headings can live in both — in metadata for filtering, in text for the LLM to read.

Related Articles

specification

Agent Memory Pattern Specification: Short-Term, Long-Term, and Episodic

Specification for AI agent memory: working, episodic, semantic, and procedural tiers with consolidation, eviction, and PII handling.

specification

Agent Permission Model Specification: RBAC, Scopes, and Tool-Level Auth

Production specification for AI agent permissions: RBAC, OAuth scope mapping, tool-level auth, consent prompts, time-bound grants, and MCP propagation.

specification

Agent Tool Naming Conventions Specification for LLM Routing Reliability

Specification for naming AI agent tools to maximize LLM routing reliability: verb-noun, namespaces, length, anti-collision, and deprecation rules.

Topics
Cập nhật tin tức

Thông tin GEO & AI Search

Bài viết mới, cập nhật khung làm việc và phân tích ngành. Không spam, hủy đăng ký bất cứ lúc nào.