Agent Tool Manifest QA Checklist: Validating SKILL.md and Tool Schemas Before Agent Discovery

Run this 40-point QA checklist over every agent tool manifest—SKILL.md, MCP tool definition, or platform-specific YAML—before publishing it for agent discovery. Each section targets a specific failure mode in how LLMs read metadata, parse descriptions, validate inputs, and decide whether to invoke a tool.

TL;DR. Agents pick the wrong tool, skip the right one, or call it with bad arguments when the manifest is sloppy. This checklist locks down five things—identity, description, input schema, examples, and security—so the manifest acts as an enforceable contract between the LLM and the runtime. Run it locally before you commit, and again in CI.

Use this guide alongside the Skill Manifest Specification, the MCP tool schema reference, and tool description best practices. For the wider context, start at the AI Agents hub.

When to run this checklist

Before opening a pull request that adds or modifies a SKILL.md, MCP tool, or platform agent manifest.
During code review for any change to inputSchema, description, name, or example payloads.
Quarterly, as part of a manifest hygiene audit.
Whenever an agent picks the wrong tool, hallucinates arguments, or silently skips a tool—these are nearly always manifest defects, not model defects.

1. Identity and metadata

[ ] name is unique across the agent's tool catalog and does not overlap conceptually with another tool (e.g., not both search_contacts and list_contacts without a description disambiguating them).
[ ] name is kebab-case with no spaces or capitals, and matches the folder or file name for SKILL.md skills.
[ ] version is set and follows the platform's versioning rule (semantic version, integer, or date).
[ ] license and compatibility fields are present when the manifest is open-sourced or environment-specific (Anthropic recommends both for shared skills).
[ ] Owner or maintainer is recorded so triage has a routing target when the tool misbehaves.
[ ] Manifest format version matches what the runtime expects (for example, Microsoft 365 agents require manifest v1.13+).

2. Description quality

The description is the single most important field: it is the text the LLM reads to decide whether to load or call the tool.

[ ] Description states what the tool does in one plain sentence.
[ ] Description states when to use the tool, including 2-3 trigger phrases an end user might say.
[ ] Description names file types, domains, or systems the tool touches.
[ ] Description is under the platform limit—1024 characters for SKILL.md; check your platform's quota.
[ ] No XML tags (< or >) in SKILL.md descriptions—they break Claude's frontmatter parser.
[ ] Side effects are disclosed: state writes, deletes, outbound network calls, costs, or rate limits.
[ ] Idempotency is noted if calling the tool repeatedly has additional effect.
[ ] Counter-examples are included when two tools could plausibly handle the same request, telling the agent which to prefer and when.

3. Input schema (JSON Schema)

[ ] type: "object" on the root of inputSchema or input_schema.
[ ] Every property has a type—runtimes including Gemini CLI silently skip tools whose schema lacks types.
[ ] Every property has a description that explains the parameter and gives an example value when ambiguous.
[ ] Parameter names are unambiguous: user_id not user, start_date_iso not start.
[ ] required array lists only truly required fields; optional fields are omitted on call, not passed as null.
[ ] additionalProperties: false is set on objects to prevent argument drift.
[ ] No unsupported keywords for your runtime—Microsoft Foundry rejects anyOf and nullable unions and requires exactly one type per property.
[ ] Enums are used for fields with a fixed set of values, with each value documented.
[ ] Numeric fields declare ranges (minimum, maximum) where business rules apply.
[ ] String fields declare format or pattern (format: "email", pattern: "^[A-Z]{2}$") where applicable.
[ ] Output schema is defined when the platform supports it, so downstream tools and evals can assert on results.

4. Examples and trigger phrases

[ ] At least one example invocation is included with realistic argument values.
[ ] Examples cover the common path and at least one edge case (empty list, paginated result, error).
[ ] Trigger phrases mirror real user intent: "schedule a meeting," "find recent invoices," not abstract verbs.
[ ] A negative example shows when the tool should not be invoked—the cheapest fix for wrong-tool selection.

5. Bundled files and references

[ ] Referenced files exist at the path the manifest claims (reference.md, forms.md, scripts/*).
[ ] Referenced files are under the platform's loadable size—oversized references silently truncate.
[ ] scripts/ content is reviewed for outbound network calls, file system writes, and credential reads.
[ ] No secrets are committed: API keys, tokens, and customer data are read from environment or vault, not the manifest.

6. Security review

[ ] The manifest does not instruct the agent to exfiltrate data to third-party endpoints (Cisco and Snyk have published findings on prompt-injection exfiltration via community skills).
[ ] Prompt-injection surfaces are minimized: inputs that flow into shell commands or HTTP bodies are explicitly described as untrusted.
[ ] Auth scope is least-privilege: the tool requests only the permissions it actually uses.
[ ] Destructive operations are gated behind an explicit confirm parameter or a separately named tool.
[ ] mcp-scan or equivalent SAST has been run on the manifest and its bundled scripts.

7. Runtime validation

[ ] Manifest passes the platform validator: mcp-validation, Microsoft 365 Agents Toolkit validate, Foundry tool catalog, or the equivalent.
[ ] The tool loads in a clean agent session without warnings.
[ ] A scripted eval invokes the tool with at least three realistic prompts and asserts on tool selection, argument shape, and outcome.
[ ] Schema validation is wired into CI, so the manifest cannot regress between sprints—an unenforced spec drifts within two sprints.

8. Discoverability and documentation

[ ] The tool appears in tools/list (MCP) or the platform tool catalog after registration.
[ ] Public README or doc page explains install, configuration, and example prompts for human users.
[ ] Changelog is updated with the version bump and the behavioral change.

Common failure modes this checklist catches

Wrong-tool selection. Two near-identical descriptions cause the agent to pick list_contacts when it should pick search_contacts. Fixed by section 2 (description) and section 4 (counter-examples).
Malformed arguments. A loose inputSchema lets the LLM invent fields. Fixed by section 3 (additionalProperties: false, types on every property).
Silent skipping. A missing type or unsupported keyword causes the runtime to drop the tool from the catalog. Fixed by section 3 and section 7.
Prompt-injection exfiltration. A bundled script makes outbound calls under the user's auth. Fixed by section 5 and section 6.
Spec drift. The manifest passes locally but fails in production because nothing asserts on it. Fixed by section 7 (CI enforcement).

FAQ

Q: How long should an agent tool description be?

Keep it tight—usually 200 to 600 characters of plain prose, well under the SKILL.md limit of 1024. Long descriptions get truncated or deprioritized by the LLM's tool-selection step. Aim for one sentence on what the tool does, one on when to use it, and one on side effects or destructive behavior.

Q: Does this checklist apply to MCP tools as well as SKILL.md?

Yes. Identity, descriptions, JSON Schema inputs, examples, and security review apply to every manifest format—SKILL.md, MCP tool definitions, Microsoft agent manifests, and proprietary platform tools. The runtime-validation row is where you swap in the platform-specific validator (mcp-validation, Microsoft 365 Agents Toolkit, Foundry catalog).

Q: Should I use anyOf, oneOf, or nullable unions in tool input schemas?

Avoid them unless your runtime explicitly supports them. Microsoft Foundry, for example, rejects anyOf and nullable unions, and several MCP clients are similarly strict. Define each property with one type and mark optional fields by leaving them out of required rather than allowing null.

Q: Where should this checklist live in our workflow?

Put it in your pull-request template and run the runtime-validation row in CI as a non-negotiable quality gate. Treat the manifest as an executable contract: if no test asserts against it, it will drift within two sprints.

Q: How often should we re-audit existing manifests?

Every 90 days, or whenever the underlying API changes. Manifests rot quickly because the LLM behavior, the platform validator rules, and the bundled tool semantics all evolve independently of the manifest file itself.