What Is Tool Calling for AI Agents? Definition, Patterns, and Best Practices
Tool calling is the mechanism that lets an AI agent invoke external functions, APIs, or services by emitting a structured, schema-validated request that a host application executes and returns to the model. It is the foundational primitive behind modern agent loops in OpenAI, Anthropic Claude, Google Gemini, and the Model Context Protocol.
TL;DR
Tool calling — also called function calling or tool use — is how an LLM bridges from text generation to real-world action. The model receives a list of tool definitions (typically JSON Schema), decides when a user request requires external data or side effects, emits a structured tool call, waits for the host to execute it, and resumes generation with the tool's result. OpenAI, Anthropic, and Google Gemini implement this pattern with near-identical semantics; the Model Context Protocol (MCP) standardizes tool exposure across vendors.
Definition
Tool calling is a structured mechanism by which a large language model requests the execution of an externally defined function and incorporates the function's return value into its subsequent reasoning and output. Instead of fabricating answers from training data alone, the model emits a deterministic, schema-validated payload — typically a JSON object containing a name and arguments — that the host application parses, executes, and returns to the model in a follow-up turn.
The terms function calling and tool calling are used interchangeably across the industry. OpenAI's documentation explicitly notes that "Function calling (also known as tool calling) provides a powerful and flexible way for OpenAI models to interface with external systems" (OpenAI, 2026). Anthropic uses tool use as the canonical term for the same primitive in Claude's API (Anthropic, 2026).
In practice, tool calling has three components: (1) a tool definition declaring the function's name, description, and JSON-schema-typed parameters; (2) a tool call emitted by the model containing the chosen function name and concrete argument values; and (3) a tool result that the host returns to the model after executing the call. Together these form the agentic loop that distinguishes a passive chatbot from an action-taking agent.
Why tool calling matters
Without tool calling, an LLM is bounded by its training cutoff and its inability to perform side effects. It cannot fetch live data, mutate a database, send an email, run a calculation deterministically, or interact with any system outside the prompt window. Tool calling collapses this boundary: with a single well-typed function, the model gains the ability to look up today's weather, query a CRM, file a support ticket, or invoke another model.
For AI search optimization specifically, tool calling matters in two distinct ways. First, modern answer engines such as ChatGPT, Perplexity, and Gemini use tool calls internally to ground their responses — web_search, file_search, fetch_url, and code_interpreter are all implemented as tools, and the citations users see in answer cards are tool-call artifacts. Second, when sites publish structured agent-facing surfaces (/agents.json, /llms.txt, MCP servers), they are effectively offering tool definitions that downstream agents can invoke.
For product teams, tool calling matters because it is the cheapest path from prototype to production agent. A 50-line tool definition replaces what would otherwise be brittle prompt engineering and regex parsing. Anthropic notes that introducing tools "reliably eliminates a class of hallucinations" because the model is steered toward emitting structured output rather than freeform prose (Anthropic, 2026).
Finally, tool calling is the substrate for composable agents. Once any capability is a typed tool, agents can be chained, supervised, and audited. The entire ecosystem of MCP servers, agent frameworks (LangChain, LlamaIndex, Anthropic's Agents SDK), and orchestration platforms is built on top of this single primitive.
How tool calling works
The canonical tool-calling loop has five steps that are essentially identical across OpenAI, Anthropic, and Google Gemini:
flowchart TD
A["User prompt + tool definitions"] --> B["Model decides: text or tool call?"]
B -->|"Text"| F["Final assistant message"]
B -->|"Tool call"| C["Model emits name + arguments JSON"]
C --> D["Host executes the function"]
D --> E["Tool result returned to model"]
E --> BStep 1 — Tool definition
The host registers each callable function with a JSON Schema describing its parameters. A typical OpenAI definition looks like:
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": { "type": "string", "description": "City name, e.g. Hanoi" },
"unit": { "type": "string", "enum": ["c", "f"] }
},
"required": ["city"]
}
}
}Anthropic uses a similar shape under an input_schema key, and Google Gemini accepts either OpenAPI 3.0 schema or auto-generated schemas from Python type hints (Google, 2026).
Step 2 — Routing
When the host sends a user message together with the tool list, the model performs an internal routing decision: respond with text, or emit a tool call. Routing quality depends primarily on (a) the clarity of the tool's description field, (b) parameter naming, and (c) the richness of the system prompt. Sparse, ambiguous descriptions are the leading cause of mis-routed calls.
Step 3 — Tool call emission
If routing chooses a tool, the model emits a structured object — never freeform prose — containing name and arguments. The schema constrains generation, so well-defined tools rarely produce malformed JSON when used with modern models. Some vendors (OpenAI, Anthropic) support parallel tool calls: the model can emit multiple independent calls in a single turn for the host to execute concurrently.
Step 4 — Host execution
The host is responsible for executing the function and capturing the result. Anthropic distinguishes between client tools (executed by your application) and server tools like web_search or code_execution that run on Anthropic's infrastructure (Anthropic, 2026). OpenAI and Gemini follow similar bifurcations.
Step 5 — Result feedback and loop continuation
The tool's return value is appended to the conversation as a tool role message and the model is invoked again. The model can then either respond with final text, or chain another tool call. Gemini calls this compositional function calling when the second call depends on the first call's result, e.g. get_current_location() followed by get_weather(location) (Google, 2026).
OpenAI vs Anthropic vs Gemini vs MCP
While the conceptual loop is identical, the four major surfaces differ in shape, server-tool catalog, and orchestration extensions.
| Feature | OpenAI | Anthropic Claude | Google Gemini | MCP |
|---|---|---|---|---|
| Canonical term | Function calling / tool calling | Tool use | Function calling | Tools (protocol-level) |
| Schema format | JSON Schema | JSON Schema (input_schema) | OpenAPI 3.0 / auto from types | JSON Schema |
| Parallel calls | Yes | Yes | Yes | Yes |
| Server-side tools | web_search, file_search, code_interpreter, tool_search | web_search, code_execution, computer_use | Built-in retrieval, code execution | Server-defined |
| Free-form / custom tool inputs | Custom tools (free-form text) | Text editor tools | Schema-required | Schema-required |
| Cross-vendor portability | No | No | No | Yes — primary purpose |
| Reference | OpenAI docs | Anthropic docs | Google docs | MCP spec |
The Model Context Protocol (MCP) is the cross-vendor standard. Released by Anthropic as an open protocol and now adopted across Claude Desktop, ChatGPT desktop clients, Cursor, and many IDEs, MCP defines a wire-level contract for LLM-tool integration (Anthropic, 2024). An MCP server exposes a list of tools through a standardized schema; any compliant MCP client (the LLM application) can discover and invoke them without per-vendor adapters. MCP is best understood as a transport and discovery layer above the per-vendor function-calling primitives — it does not replace them, it standardizes their exposure.
Anthropic has further extended tool use with Programmatic Tool Calling, where Claude writes a Python orchestration script that runs in the Code Execution sandbox and calls your tools internally rather than round-tripping every result through the model. This pattern reduces latency and token consumption for parallel-heavy workflows (Anthropic, 2026).
Practical implementation
A production-grade tool-calling implementation has six concerns beyond the basic loop:
- Schema discipline. Treat tool schemas as a public API. Use precise types, enum constraints where possible, required-field markers, and rich descriptions. The model's routing accuracy is bounded by the descriptive quality of your schemas.
- Idempotency and retries. Tools should be idempotent where possible. Wrap external API calls in retry-with-backoff because models will sometimes re-emit identical calls during ambiguous loops.
- Error envelopes. Return errors as structured tool results, not exceptions. A well-formed {"error": "...", "code": "..."} lets the model recover gracefully; an unhandled exception forces full conversation restart.
- Tool budget. Cap the number of tool calls per turn (10-25 is a common ceiling) to bound runaway loops. Anthropic and OpenAI both expose max_tool_uses or stop-condition parameters.
- Authentication context. Tools that touch user data must thread the authenticated user identity from your application, not from the model. Never let the LLM choose which user to act as.
- Observability. Log every tool call's name, arguments, result, and latency. This is essential for debugging mis-routed calls, prompt regressions, and cost attribution.
A minimal production implementation in TypeScript with the OpenAI SDK looks like:
const tools = [{
type: "function",
function: {
name: "get_order",
description: "Look up an order by its ID",
parameters: {
type: "object",
properties: { order_id: { type: "string" } },
required: ["order_id"]
}
}
}]async function runAgent(userMessage: string) {
const messages = [{ role: "user", content: userMessage }]
for (let step = 0; step < 10; step++) {
const res = await openai.responses.create({ model: "gpt-5", input: messages, tools })
if (!res.tool_calls?.length) return res.output_text
for (const call of res.tool_calls) {
const result = await executeTool(call.name, JSON.parse(call.arguments))
messages.push({ role: "tool", tool_call_id: call.id, content: JSON.stringify(result) })
}
}
}
Examples
- Weather lookup. A travel-planning agent with a get_weather(city, date) tool answers "Should I bring a jacket to Hanoi tomorrow?" by calling the tool once and incorporating the forecast.
- CRM ticketing. A support agent exposes search_tickets(query), create_ticket(subject, body), and escalate(ticket_id, level). The agent can answer status questions and file new tickets without a human handoff.
- Code execution for math. Rather than risk an arithmetic hallucination, an analytics assistant calls python_exec(code) (a server tool in Claude and Gemini) to compute regressions deterministically.
- Web search for grounding. Perplexity, ChatGPT Search, and Gemini Grounded answers all use a web_search tool internally. Each citation chip you see in the UI corresponds to a tool-call result.
- Multi-tool composition. A travel agent calls find_flights(from, to, date), find_hotels(city, checkin, checkout), and book_calendar(start, end) in sequence, with each call's output feeding the next — Gemini's compositional function calling pattern.
- MCP-bridged enterprise tools. A team exposes their internal docs through an MCP server. Claude Desktop, Cursor, and ChatGPT all consume the same server unchanged; no vendor-specific adapters are written.
Common mistakes
- Hallucinated tools. Models sometimes invoke a tool name not present in the registered list, especially when the user's request implies a capability the tool list does not cover. Mitigation: always validate the emitted name against the registered set and return a structured "unknown tool" error rather than executing blindly.
- Sparse descriptions. A tool named lookup with description "Looks things up" will be mis-routed. Treat descriptions as if you were writing API docs for a junior engineer who has never seen your system.
- Unbounded loops. Without a step cap, a confused model can ping-pong between two tools indefinitely. Always bound the loop and surface the failure to the user.
- Ignoring parallel call limits. OpenAI and Anthropic cap the number of parallel tool calls per turn. Code that assumes serial-only execution can race when the model legitimately emits parallel calls.
- Leaking secrets through arguments. Never pass API keys or user passwords as tool arguments — the model will see them and may surface them in subsequent turns. Inject secrets at the host layer.
- Conflating tool calls with structured output. If you only need typed JSON output and not external execution, use the model's structured-output mode (response format / schema) — it is cheaper and avoids the loop entirely.
FAQ
Q: Is tool calling the same as function calling?
Yes. The two terms are used interchangeably across vendors. OpenAI's docs explicitly equate them; Anthropic prefers "tool use" but documents the same primitive. Treat them as synonymous in design discussions and reserve the precise term for whichever vendor surface you are coding against.
Q: Do I need MCP if I am only targeting one model vendor?
No. If your agent is locked to a single vendor, the native function-calling API is simpler and lower-overhead. MCP becomes valuable when you want the same tool inventory to be consumable by Claude Desktop, ChatGPT, Cursor, and custom agents without re-implementing per vendor.
Q: How are tool calls billed?
The tool definitions and tool-call payloads count as input and output tokens respectively. A long tool list inflates every turn's input cost, so prune unused tools. Some vendors offer tool search (OpenAI) or deferred loading patterns to keep large catalogs cheap.
Q: Can a tool return non-text data such as images or files?
Yes for vision-capable models. Anthropic and OpenAI both accept image content blocks in tool results; Gemini accepts inline images and file references. Plan for multimodal results when your tool produces charts, screenshots, or generated assets.
Q: How do I prevent the model from calling tools when I just want a chat answer?
Use the tool_choice parameter (OpenAI, Anthropic) to force none for that turn, or omit the tool list entirely. For mixed workflows, set tool_choice: "auto" and rely on the system prompt to instruct the model on when tool use is appropriate.
Q: What is the difference between client tools and server tools?
Client tools execute on your application's infrastructure — you write the implementation. Server tools (e.g., web_search, code_execution, computer_use) execute on the model vendor's infrastructure; you only enable them. Server tools simplify development but reduce control over execution environment, logging, and data residency.
Q: Does tool calling work with streaming responses?
Yes. All major vendors stream tool-call deltas as the model emits them. Your client must accumulate arguments across deltas before executing, since arguments arrive token-by-token rather than as a single complete JSON object.
Related Articles
Agent Tool Use Documentation Specification
Specification for documenting tools so AI agents can discover, understand, and correctly invoke them: structured schemas, examples, error semantics, and idempotency hints.
Function Calling Documentation Spec: How to Document Tools for AI Agents
Function calling documentation spec: how to describe tools, parameters, errors, and examples so AI agents can reliably invoke them in production.
MCP vs Function Calling vs OpenAI Plugins: AI Agent Tool Integration Architectures Compared
MCP vs function calling vs plugins compared for AI agent tool integration: discovery scope, maintainability, and documentation patterns for 2026 stacks.