Agent Health Check Specification

A production AI agent health check defines three probe contracts — liveness, readiness, and startup — plus dependency probing for LLM providers and external tools, with a degraded-mode flag that lets orchestrators route around partially failing agents instead of restarting them. Each probe answers exactly one question and stays cheap enough to run every few seconds.

TL;DR

Liveness answers "is the process alive?" — restarts the container if it fails.
Readiness answers "should this instance receive traffic?" — drains traffic without restarting.
Startup answers "has initialization finished?" — gates the other probes during slow boot.
LLM-provider and tool dependencies live in readiness, never liveness — outages should drain, not restart.
Pair the probes with a degraded-mode flag so partial-capability agents stay serving instead of flapping.

Definition

An agent health check is a contract between an AI agent and its orchestrator (Kubernetes, Nomad, ECS, Cloud Run) that exposes the agent's operational state through one or more HTTP, TCP, or exec probes. Unlike a generic web service, an AI agent has multiple external dependencies — at minimum an LLM inference provider, often a vector store, and one or more tool APIs — that fail independently and at different rates than the agent process itself.

A complete agent health check spec defines:

Liveness probe — a near-zero-cost endpoint that confirms the agent process is running and not deadlocked.
Readiness probe — a slightly more expensive endpoint that confirms the agent can accept traffic and that critical dependencies respond.
Startup probe — a probe that gates the first two during initialization (model warmup, vector index loading, tool catalog fetch).
Dependency probes — internal checks against the LLM provider, vector store, and tool APIs, surfaced through readiness.
Degraded-mode signal — a flag the agent flips when secondary capabilities fail but core service still works.

The Kubernetes probe model (Kubernetes documentation) is the de-facto baseline; this spec extends it for the agent-specific concerns of LLM ping cost, partial-capability degradation, and tool dependency probing.

Why it matters

AI agents have failure modes that traditional services do not. A pod can stay alive while its LLM provider returns 5xx errors for an hour. A vector store can return stale embeddings without ever timing out. A single tool API outage can cascade into rollup failures across multi-step reasoning chains. Without dedicated readiness probes for these dependencies, orchestrators send traffic to instances that will fail every request — a pattern the Kubernetes best-practices guide calls out as the primary reason readiness exists (Google Cloud Blog).

The cost asymmetry is also new. A traditional readiness probe pings a database and returns in milliseconds. An LLM "alive" check that runs an actual completion can cost cents per probe and add latency to inference budgets. A health check spec must therefore ration LLM probes — typically a cached result with a refresh window of 30 to 60 seconds — instead of pinging the provider on every request from the orchestrator.

Finally, restart amplification is a real risk. If liveness checks include LLM dependency status and the provider degrades for ten minutes, every agent pod restarts in a cascade, often hitting rate limits and prolonging the incident. The bright-line rule is that liveness must never depend on anything outside the process; everything external belongs in readiness.

How it works

The three probe types map to three orchestrator actions:

Probe	Question answered	Failure action	Typical period	LLM call?
Liveness	Is the process alive?	Restart container	10s	No
Readiness	Can this instance serve?	Drain traffic	5s	Cached only
Startup	Has init finished?	Delay other probes	5s, up to 60s	One-time

A liveness endpoint should be a process-internal health flag — a heartbeat updated by the main event loop, exposed through /healthz. If the loop deadlocks, the heartbeat goes stale and liveness fails. The endpoint must complete in under 100 ms and must not call any external API (Kubernetes documentation).

A readiness endpoint at /readyz aggregates dependency status from a background poller. The poller checks each dependency on its own schedule:

LLM provider: a single chat.completions call with a one-token max, cached for 30 seconds. Some teams use the provider's /models endpoint instead, which is free but only confirms API reachability, not inference capacity.
Vector store: a known-key lookup that exercises the index, cached for 15 seconds.
Tool APIs: a HEAD or lightweight GET against each tool's health route, cached for 30 seconds.

The readiness response combines those statuses. If any critical dependency fails, readiness returns 503. If only non-critical dependencies fail, readiness returns 200 with an X-Agent-Mode: degraded header that downstream callers can inspect.

A startup probe runs before liveness and readiness begin. For agents that load vector indexes or warm caches, set failureThreshold * periodSeconds to cover the longest expected startup, commonly 60 to 120 seconds. Once startup succeeds, Kubernetes hands off to the other probes (Kubernetes documentation).

Practical application

A reference Kubernetes manifest for an agent pod:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 3
  timeoutSeconds: 1
readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  periodSeconds: 5
  failureThreshold: 2
  timeoutSeconds: 2
startupProbe:
  httpGet:
    path: /readyz
    port: 8080
  periodSeconds: 5
  failureThreshold: 24

Inside the agent process, run a dependency poller in a background task. Persist last-known status in memory with a freshness timestamp, and have the readiness handler read from that cache rather than calling dependencies inline — this keeps probe latency bounded even when a provider is slow.

For alerting, page when readiness has been failing for more than two minutes (sustained) or when more than 30 percent of pods report degraded mode for more than five minutes. Do not page on individual liveness failures unless the same pod restarts more than three times in ten minutes.

For cost control, cap LLM probe spend by sharing the cached health result across all replicas in a deployment via a sidecar or a dedicated health-check service. With 100 replicas pinging the LLM every 30 seconds independently, cost grows linearly; with one shared poller, it is constant.

Common mistakes

Putting LLM checks in liveness. A provider blip restarts every pod and amplifies the outage. LLM status belongs in readiness only.
Calling dependencies inline. A 30-second LLM timeout in the readiness handler causes the kubelet to mark the pod failed and drain traffic the moment the LLM is slow. Always cache.
Ignoring partial capability. Agents that lose one of five tools should serve the four still-working capabilities, not return 503 for everything. Use a degraded-mode flag.
Same probe for liveness and readiness without isolation. Reusing the readiness endpoint for liveness with a higher failure threshold is sometimes recommended, but only safe if readiness does not call external services. For agents, this almost always fails the safety check.
No startup probe. Agents with cold-start vector indexes get killed by liveness during boot. Add a startup probe with realistic thresholds.

FAQ

Q: Should the LLM provider be in liveness or readiness?

Always readiness. Liveness checks should restart the container only when something inside the process is wrong; an LLM outage is external. Putting it in liveness creates a restart cascade that worsens the incident.

Q: How often should I probe the LLM provider for health?

Every 30 to 60 seconds, with results cached and shared across replicas if possible. More frequent probes drive cost without improving signal — LLM outages last minutes, not seconds.

Q: What's the difference between a degraded-mode flag and a failed readiness probe?

A failed readiness probe drains all traffic from the instance. A degraded-mode flag keeps the instance serving traffic but signals callers (through a header, response field, or service-mesh metadata) that some capabilities are unavailable. Use degraded mode when partial service is better than no service.

Q: Do I need a startup probe for short-boot agents?

Only if total startup exceeds your liveness initialDelaySeconds + periodSeconds * failureThreshold. For agents that load embeddings or warm caches, a startup probe is almost always required. For agents that boot in under five seconds, liveness alone is sufficient.

Yes. One /readyz aggregating all dependencies is simpler and matches Kubernetes conventions. Internally, structure it as a fan-out over a status registry so each capability's status remains separable for observability.

Agent Health Check Specification

TL;DR

Definition

Why it matters

How it works

Practical application

Common mistakes

FAQ

Q: Should the LLM provider be in liveness or readiness?

Q: How often should I probe the LLM provider for health?

Q: What's the difference between a degraded-mode flag and a failed readiness probe?

Q: Do I need a startup probe for short-boot agents?

Related Articles

Agent Authentication Documentation Spec

Agent Circuit Breaker Specification

Agent Context Window Budgeting Specification

Thông tin GEO & AI Search

Agent Health Check Specification

TL;DR

Definition

Why it matters

How it works

Practical application

Common mistakes

FAQ

Q: Should the LLM provider be in liveness or readiness?

Q: How often should I probe the LLM provider for health?

Q: What's the difference between a degraded-mode flag and a failed readiness probe?

Q: Do I need a startup probe for short-boot agents?

Q: Can I share probe endpoints across agent capabilities?

Related Articles

Agent Authentication Documentation Spec

Agent Circuit Breaker Specification

Agent Context Window Budgeting Specification

Thông tin GEO & AI Search