Agent Multi-Tool Orchestration Pattern Specification
This specification defines how agents compose multiple tool calls in a single turn — parallel calls when independent, sequential when dependent — and how tool builders should document orchestration contracts (fan-out limits, batching rules, dependencies, error propagation) so OpenAI, Anthropic, Gemini, and MCP-driven agents can call them safely without trial and error.
TL;DR
Multi-tool orchestration is the pattern an agent uses when one user turn requires more than one tool call. Tool builders should publish four contracts in the tool definition: parallelism mode (parallel-safe vs sequential-only), dependency declaration (which inputs come from which other tools), fan-out limits (max concurrent calls), and error-propagation behavior (fail-fast vs partial-success). Without these, agents waste turns rediscovering ordering through retries.
Why orchestration needs a specification
Modern agent runtimes routinely emit multiple tool calls in a single model turn. OpenAI's Chat Completions and Responses APIs return an array of tool_calls and accept a parallel_tool_calls boolean to enable or disable this behavior. Anthropic's Claude returns multiple tool_use blocks per turn by default and exposes disable_parallel_tool_use to gate it. Both vendors document the surface, but neither specifies how an individual tool should declare whether it is safe to be called in parallel with others, or what happens when one call's input depends on another's output.
The gap shows up in production. Practitioner reports describe agents that issue parallel calls with a hidden ordering dependency, retrieve incoherent partial state, then loop until they backtrack into a sequential call. Anthropic's Programmatic Tool Calling work and Microsoft's Agent Framework superstep model both attempt to solve the problem at the runtime layer, but the underlying contract — what does each tool promise about parallel invocation — is still expected to live in tool documentation. This specification defines that contract.
Core orchestration modes
A tool definition should declare exactly one orchestration mode.
| Mode | Meaning | When to use |
|---|---|---|
| parallel-safe | The tool may be invoked any number of times concurrently within a single turn with no shared-state side effects. | Read-only retrieval, idempotent lookups, stateless transforms. |
| sequential-only | The tool must be the only invocation of itself in a turn; the agent should serialize calls. | Stateful writes, rate-limited APIs, transactional operations. |
| fan-out-bounded | Parallel calls allowed up to a declared concurrency limit. | High-cost reads (search, embeddings), webhook fan-out, batch APIs. |
| dependent | The tool's input requires another tool's output; agents must call dependencies first. | Lookups keyed on IDs returned by an earlier tool, post-write verification. |
Declare the mode in the tool's JSON Schema under an x-orchestration extension so agents that read OpenAPI or MCP tool descriptors can surface it without a custom parser.
Dependency declaration
Dependencies are declared as a list of upstream tool names whose output is required to populate this tool's input. The minimum shape:
{
"x-orchestration": {
"mode": "dependent",
"depends_on": [
{
"tool": "search_documents",
"required_fields": ["document_id"]
}
]
}
}Agents that respect the declaration can construct a topological ordering before they emit calls. Agents that ignore it still receive enough context in the description to recover after one failed turn rather than ten.
Fan-out limits
Fan-out is the number of concurrent invocations of the same tool inside a single turn. Microsoft's Agent Framework treats fan-out as a superstep with a synchronization barrier; the framework collects pending messages, routes them concurrently to executors, and waits on all results before advancing. Cloud-agnostic agents implement the same pattern manually.
Declare a fan-out limit when:
- The downstream API has a per-second or per-minute quota.
- Cost scales linearly with parallel calls (LLM-as-tool, paid search).
- Memory pressure on the agent context grows quadratically with response size.
The shape:
{
"x-orchestration": {
"mode": "fan-out-bounded",
"max_concurrency": 5,
"backoff": { "strategy": "exponential", "initial_ms": 200 }
}
}Batching patterns
When an agent will plausibly call the same tool more than three times in one turn, expose a batch variant. The Anthropic engineering team's example pattern is to wrap the loop inside a single Code Execution call so the model writes a script that orchestrates the workflow rather than round-tripping through the API. For tools that cannot be batched server-side, document the equivalent client pattern explicitly:
- Provide a batch argument that accepts an array.
- Return an array of per-item results, including per-item errors, in the same order as the input.
- Cap the batch size and document the cap.
A batch endpoint replaces N parallel calls with one call, eliminating the orchestration surface entirely for the common case.
Error propagation contracts
Define how partial failure is reported. Two modes are sufficient:
- fail-fast: any failed sub-call aborts the batch or fan-out. Use for transactional writes.
- partial-success: each sub-result carries its own status; the agent decides how to proceed. Use for read-heavy workloads where stale data is acceptable.
The response envelope should always include:
{
"results": [
{ "index": 0, "status": "ok", "data": {} },
{ "index": 1, "status": "error", "error": { "code": "NOT_FOUND", "message": "..." } }
],
"summary": { "ok": 1, "error": 1 }
}Exposing per-item status lets the agent retry only the failed entries instead of replaying the entire batch.
Five worked examples
1. Document search + fetch (dependent)
search_documents returns IDs; fetch_document resolves an ID to content. Declare fetch_document as dependent on search_documents.document_id. Agents that read the contract issue search first, then fan out fetches up to the declared concurrency.
2. Multi-region weather lookup (fan-out-bounded)
A get_weather tool with max_concurrency: 10 and partial-success semantics. The agent fans out to ten cities at once; one timeout does not cascade.
3. Vector search + reranker (sequential-only on rerank)
Vector search is parallel-safe. The reranker is sequential-only because it carries internal cache state; the agent must serialize rerank calls per turn.
4. CRM record write (fail-fast batch)
A create_contacts tool exposes a batch array and fail-fast semantics, wrapped in a server-side transaction. One bad row aborts the batch; agents do not need to think about partial state.
5. Webhook fan-out from a workflow tool
A dispatch_event tool fans out to N webhook URLs. Declared as fan-out-bounded with max_concurrency: 25 and HMAC signature in each call — agents do not orchestrate per-URL retries; the tool surface owns retry.
Common mistakes
- Implicit dependencies. Returning IDs from tool A that tool B silently consumes, with no depends_on declaration. Agents discover the order through failed calls, costing turns.
- Unbounded fan-out. Letting agents emit hundreds of parallel calls because no concurrency cap is declared. Quota exhaustion looks like a model bug from the user's side.
- Hidden statefulness. Marking a tool parallel-safe while it mutates a shared cache. The first parallel batch is fast; the second one corrupts state.
- Inconsistent error envelopes. Mixing top-level errors with per-item errors in the same response shape forces the agent to branch on response structure instead of status code.
- Conflating parallelism with batching. Five parallel calls are not the same as one batch call. Parallel calls multiply context tokens; a batch call collapses them.
FAQ
Q: When should a tool default to parallel-safe?
Default to parallel-safe only for read-only operations with no shared mutable state and no per-second quotas tighter than ten requests. Anything that writes, that paginates internal state, or that hits a low-quota third-party API should default to sequential-only or fan-out-bounded.
Q: How do agents discover the orchestration contract?
Agents read it from the tool's JSON Schema or MCP descriptor. OpenAI function definitions accept arbitrary keys outside the validated schema; place orchestration metadata under x-orchestration so it travels with the definition without breaking validation.
Q: What is the difference between fan-out-bounded and a batch endpoint?
Fan-out-bounded keeps N independent calls but caps concurrency on the agent side. A batch endpoint collapses those N calls into one server-side request. Batch is more efficient when the agent reliably knows the inputs upfront; fan-out is more flexible when each call's input depends on per-call decisions.
Q: How does this spec interact with MCP?
MCP tool definitions accept annotations (idempotent, openWorld, readOnlyHint, destructiveHint). Treat readOnlyHint: true as a strong signal for parallel-safe, and destructiveHint: true as a strong signal for sequential-only. The spec extends MCP's hints with explicit dependency and fan-out shape.
Q: Should partial-success or fail-fast be the default?
For read-heavy tools, default to partial-success so a single failure does not invalidate the whole turn. For write-heavy tools and transactional operations, default to fail-fast to avoid leaving the system in a half-applied state. Make the choice explicit in the tool description either way.
Related Articles
Agent Context Window Budgeting Specification
Agent context window budgeting spec: token allocation buckets, summarization triggers, eviction policies, prompt caching pairing, and worked examples.
Agent Long-Running Job Documentation Specification
Specification for documenting long-running agent jobs: async kickoff, status polling, SSE progress, cancellation, and timeout SLAs that AI agent tools must publish.
Agent Output Validation Documentation Specification
A specification for validating AI agent outputs against JSON Schema with runtime hooks, error formats, and partial-output handling for tool builders.