Designing Agent Skills with Progressive Disclosure: skill.md Best Practices and Practical Eval Methods

Design Agent Skills with progressive disclosure, write SKILL.md so the description block uniquely matches user intent, and verify selection accuracy with focused evals.

Agent Skills (introduced by Anthropic in 2025 for Claude Code and Claude apps) are different from OpenAI Assistants "tools" and Semantic Kernel "skills." Anthropic published Agent Skills on October 16, 2025; SKILL.md is the file the model reads first. (Anthropic)

When building agent skills (or tools/commands), most people focus on "making it work." What actually decides user experience and stability is usually:

Whether the agent selects the correct skill (selection accuracy)
Whether it executes according to rules after selection (execution reliability)
Whether misselection/misfires increase rapidly as skills grow (confusability)

Here's an engineering approach: design skill structures (especially SKILL.md) with progressive disclosure, and use simple but effective evals to check which version is more accurate. For where skills fit into the larger picture, see our agent design overview.

1) Progressive Disclosure: Why It's the Core Principle of Skill Design

The spirit of progressive disclosure is: Show essential information first, defer advanced or less-used information until needed. This reduces learning costs and error rates. (Nielsen Norman Group)

Applied to agent skills, this naturally forms three loading stages:

Discovery: Read only name + description (determine "potentially relevant")
Activation: Load SKILL.md only when deciding to use this skill (load rules, constraints, workflow)
Execution: Follow SKILL.md instructions to run scripts/tools, read additional files as needed

This "manual-like" layered loading is the design principle skills systems use for extensibility and token efficiency. Anthropic's engineering team puts it directly: "Progressive disclosure is the core design principle that makes Agent Skills flexible and scalable." (Anthropic)

2) The Truth About Discovery: It's More Like Pattern Matching, Not Full Understanding

Many assume the agent reads and fully understands every skill when selecting one. In most agent frameworks, the skill's "entry signal" is just the description. It should work like a table of contents or index: help the model decide whether to dig deeper. (Claude)

So the description's goal is simple: raise the probability of being selected correctly, not to cram execution details into it. The Anthropic engineering write-up also calls "loading more context only when needed" the key to scaling skills. (Anthropic)

3) skill.md Best Practices: Treat SKILL.md as "Agent-Executable API Documentation"

3.1 Description (Discovery) Writing: One Sentence Should Answer Three Things

A good description should answer:

What I do (clear verb + specific object)
When to use (common user intent)
Any hard constraints (e.g., read-only)

Recommended template:

<Verb> <object>. Use when <specific user intent>. <critical constraint>.

This matches Anthropic's framing of SKILL.md as an overview that points the model to detailed material: entry first, details later. (Claude)

3.2 SKILL.md Structure (Activation) Recommendations: Behavior Boundaries First, Then Execution Path

An agent-friendly SKILL.md that reduces misfires should follow this section order:

Role definition (what you are, your task; 1–2 sentences)
Capabilities (what you can do)
Safety & Constraints (what you cannot do; use strong language)
Execution method (how to run: script/command entry point)
Usage examples (typical commands)
Arguments & defaults (parameter parsing and defaults)
Workflow (Parse → Resolve defaults → Execute → Present output)

The official docs also recommend "progressive disclosure patterns" to keep the overview separate from detailed material, with concrete tips for SKILL.md (split files when content grows, keep the overview concise). (Claude)

3.3 Constraints Writing: Use "Absolute Language" to Prevent Model from Treating It as Suggestions

For safety boundaries or operational boundaries, use:

✅ You may only ...
✅ Never ...

Not:

❌ You should not ...
❌ Please avoid ...

Agents tend to scan instructions, and strong language holds up better as a constraint (especially for read-only, no secrets, and similar boundaries). (Claude)

4) Eval: How Do You Know Which Version Is More Accurate?

There's no theoretical answer; you can only verify with evals. And you need to be specific about which "accuracy" you're verifying:

Selection accuracy: Should this skill have been selected?
Execution correctness: After selection, did it follow rules and produce correct results?

OpenAI's eval guide treats tool selection itself as a capability you need to test, not just the output text. (OpenAI Platform)

4.1 Cheapest and Most Effective: Activation Test (Small Test Set)

Approach:

Prepare 10–30 user intents
Label each with "expected skill to select" (or "expected to select no skill")
Using the same agent setup, compare precision/recall across different description versions

Same idea as OpenAI's eval guide: turn the behaviors you care about into repeatable tests, so upgrades and prompt changes don't silently regress. (OpenAI Platform)

4.2 Confusable Skills Testing (Essential When Skills Multiply)

When you have multiple similar skills (e.g., all related to ADO), your test set should intentionally include easily confused sentences, such as:

"status / latest / monitor / details / history"
"find a run id"
"show last failed run"

The goal is to separately verify "collision-prone" intents: can it consistently select the skill you expect?

OpenAI's eval guide and cookbook examples treat this as a deterministic, behavior-oriented check (tool selection, parameter validity). (OpenAI Developers)

5) Recommended Iteration Process: Use Evals to Drive Skill Documentation Evolution

Treat skill documentation as a "product interface" with an engineering loop:

Write description + SKILL.md
Run activation test (selection eval)
Compare misselection cases, adjust wording and constraint clauses (make skills more mutually exclusive)
Run regression tests whenever adding skills or significantly changing descriptions

OpenAI's eval guide treats regression tests for upgrades and iterations as core to building reliable LLM apps. (OpenAI Platform)

Conclusion

Building skill functionality is easy. The hard part over the long run is getting it selected correctly and following the rules reliably once selected.

Use progressive disclosure to separate entry signals (description) from execution rules (SKILL.md), then use selection-focused evals to test the "skill selection" step itself. Your skill library stays stable as it grows, instead of misfire rates spiking each time you add a skill. (Anthropic)

Writing Claude.md – How to create effective project instructions for Claude Code
Multi-Context Workflows and State Management – Building reliable agent memory and state management
Introduction to Claude Code – Getting started with Claude Code CLI