Language: 中文 · English (this page)
This is a case-driven tutorial, not a skill API reference. Intended audience: developers who have written Claude Code prompts but never authored a skill. After reading, you should be able to answer: "When I face a new task, how do I decide whether to make it a skill, how do I design it, and how do I avoid common pitfalls?"
Claude Code ships with a meta-skill called skill-development inside the plugin-dev plugin. It covers the syntax-level territory — YAML frontmatter fields, Markdown writing style, progressive disclosure — thoroughly enough. Anyone about to write their first skill should read it first.
What that official documentation deliberately leaves blank, however, is the territory this walkthrough fills: when you face a real task, how do you decide whether it deserves to be a skill, how do you design it, how do you avoid common pitfalls, and how do you maintain it. There is no canonical answer to any of those questions — they can only be transmitted through a complete case that shows the reasoning behind each decision.
I recently built a skill called sanitize-stack to handle a Chromium crash-stack scrubbing task, and I walked the full path from "should I even do this" to "how to keep it maintained." The decision chain is typical enough to be worth unpacking for anyone who hasn't written a skill yet. Below, in chronological order.
Not every repetitive task deserves to become a skill. Building one has real cost (writing SKILL.md, maintaining references, understanding the trigger mechanism), and that cost has to be amortized by work it saves later. I apply three tests.
Pure string replacement or a one-line bash command almost never justifies a skill — there's no judgment space, so a shell alias or a git hook does the job.
A skill's sweet spot is tasks where every step has rules, but those rules have judgment calls inside them. Take sanitize-stack:
- Which frames are signal vs noise? A judgment call (when the crash happens inside
base::RunLoop, that frame stops being noise) - How aggressively should C++ template signatures be collapsed? A judgment call (
std::__Cr::basic_string<char16_t, ...>becomesstd::u16string, butscoped_refptr<T>stays verbatim) - What text does the elision marker use? A judgment call (UI thread vs worker thread need different phrasing)
These judgments can be written down (that's precisely where the skill's value lies), but making them requires looking at each specific input. That's the sweet spot.
Counterexample: replacing every downstream.dll in a file with chrome.dll. No judgment, a one-line sed suffices, not skill-worthy.
The second threshold is usage frequency. Building a skill takes roughly 30 minutes to 2 hours (structure, SKILL.md, references, smoke test). That investment only pays off if you'll use the result repeatedly.
A quick way to test this: count how many times in the last 1–3 months you've done something similar. More than three times, and the pattern will almost certainly keep recurring. Only once, and it's probably "just one quick script" territory.
For sanitize-stack: I'm someone who repeatedly ferries crash fixes from a downstream Chromium branch to upstream trackers, and every single crash fix brings another round of the same scrubbing task. Frequency high enough to amortize.
Some tasks are "wrong is fine, try again." Others are "wrong once, major problem." Only the latter justify the extra investment of codifying rules in a skill.
Scrubbing is a textbook "wrong once, major problem" case: a single downstream module name leaking into a public tracker is permanently exposed — the comment can be edited, but crawlers and mirror sites won't follow the edit. In this kind of task, the skill's value isn't saving time; it's preventing mistakes.
Counterexample: code formatting. Get it wrong, and clang-format fixes it back. No skill needed.
sanitize-stack passed all three. That was an unhesitating "build" decision.
Conversely, if any one test fails, stop and reconsider — a bash alias or a frequently-pasted prompt snippet might be enough, and a skill would be overkill. Overengineering is the first trap in skill development.
Once you've decided to build, three design questions follow: where does it live, what do you call it, how is its internal structure cut.
Claude Code supports two locations for skills:
- User-level:
~/.claude/skills/<name>/, travels with you the person - Project-level:
<project>/.claude/skills/<name>/, travels with a specific checkout
My rule: follow the skill's knowledge domain, not convenience.
- If the skill's knowledge is only meaningful for one specific codebase (say, a monorepo's internal build workflow) → project-level
- If the skill's knowledge spans multiple projects (say, the Chromium stack boilerplate list, which is true of any Chromium checkout) → user-level
sanitize-stack's knowledge domain is Chromium-the-project, not "the chromium source tree I happen to have checked out right now." Tomorrow I might be working on a different Chromium branch, and this skill should follow me there. User-level.
A common misjudgment: dropping a skill into the current project's directory just because that's where you happened to be working. That's locking cross-project knowledge inside a single tree, and the next time you work in a different tree you'll end up recreating it. Follow the knowledge domain, not your current working directory.
Look at Claude Code's existing skills for style: simplify, loop, schedule, commit, review-pr. Three common patterns:
- Short — one or two words
- Verb-forward (or a word that functions as a verb)
- Describes an action, not a domain
There's a subtle psychology behind naming: you trigger a skill by typing / followed by its name, and the longer the name, the less you'll reach for it. Long names drain usage willingness the same way long paths do.
For sanitize-stack I filtered four candidates:
| Candidate | Verdict |
|---|---|
sanitize-stack |
Chosen. Verb-forward, neutral, unambiguous. |
scrub-stack |
Shorter, but "scrub" carries a "covering up evidence" connotation. |
clean-stack |
Too generic — could be confused with "reformat." |
chromium-stack |
Wrong focus — a name should convey the action, not the domain. |
Before writing your first skill, read at least two existing SKILL.md files. Not one, two:
- One meta: tells you the format conventions (frontmatter fields, section layout, writing-style requirements)
- One concrete: tells you how a real skill actually reads (specific trigger phrases, a real pipeline description)
What I read this time:
- Meta:
plugin-dev/skills/skill-development/SKILL.md(a skill about how to write skills — meta-recursion) - Concrete:
claude-md-management/skills/claude-md-improver/SKILL.md(a real user-facing skill)
If you don't know where to find these, glob for them:
find ~/.claude -name "SKILL.md" 2>/dev/nullAny Claude Code installation with the official plugin marketplace has dozens of SKILL.md files to draw from. Three to five is plenty.
The heart of a skill is its SKILL.md file, which consists of YAML frontmatter plus a Markdown body.
---
name: sanitize-stack
description: This skill should be used when the user asks to "sanitize a crash stack", "scrub a stack trace", ...
tools: Read
version: 0.1.0
---name: the skill's invocation name. Keep it identical to the directory name.
description: the most important field. Claude Code uses this string to decide when to auto-trigger your skill. Two hard rules:
-
Use third person. Write
This skill should be used when the user asks to X, notUse this skill when you want X. Why: the description is read by a different Claude instance — the one deciding whether to trigger this skill — and third person gives it a clear observer's viewpoint. Second person would confuse "should I invoke this skill" with "am I the target user of this skill." -
List concrete trigger phrases, not abstract descriptions:
- ❌ Bad:
Provides guidance for sanitizing crash stacks. - ✅ Good:
This skill should be used when the user asks to "sanitize a crash stack", "scrub a stack trace", "prepare a stack for crbug", "脱敏崩溃堆栈", or pastes a native crash stack...
Concrete phrases turn the trigger decision into pattern matching. Abstract descriptions turn it into semantic reasoning, which is substantially less accurate.
- ❌ Bad:
Note the "脱敏崩溃堆栈" in the example — I deliberately included a Chinese trigger phrase because I often mix Chinese and English when talking to Claude. If you're an English-only user, skip it; if you're bilingual, list both.
tools: restrict the tool set this skill is allowed to use. sanitize-stack only needs Read (to read a stack from a file path, if the user provides one), so nothing else is listed. The benefit: it makes the skill's behavior more predictable and prevents it from wandering off into calling Bash or Write for something weird.
version: start at 0.1.0 and bump on substantial changes. There's no strict convention — semver is the usual reference.
This rule is the one the plugin-dev meta-skill states most bluntly:
✅ Scan the stack for module name patterns.
❌ You should scan the stack for module name patterns.Two reasons imperative wins:
- Consistency: a SKILL.md written entirely in imperative reads like a specification. Mixing in "you should" and "if you want" makes it read like a casual blog post and look unprofessional.
- AI consumption: skills are read by another Claude instance executing a task. Imperative is an instruction; second person is a conversation. The former is what Claude needs during execution.
A simple check: read your SKILL.md, and if any sentence starts with "You", that's a violation. Rewrite it to start with a verb.
Claude Code's skill file system has three layers:
- Metadata (the
name+descriptionin frontmatter): always loaded, ~100 words - SKILL.md body: loaded when the skill triggers, target 1500–2000 words
references/and other bundled files: loaded on demand by Claude
This three-layer structure is called progressive disclosure. The core idea: don't dump every detail on Claude at once; let Claude pull them in as needed.
In practice this means: put only the core workflow in SKILL.md, and sink the detailed rule tables to references/.
Here's how I cut sanitize-stack:
| Content | Location | Reason |
|---|---|---|
| The six pipeline step names and summaries | SKILL.md |
Core workflow, must always be visible |
| Step 1's module-name allowlist | SKILL.md |
Short, and a critical decision point |
| Step 3's template-collapse table | SKILL.md |
Six rows — small enough to live inline |
| Step 4's complete noise-frame list | references/noise-frames.md |
Hundreds of lines, drifts with Chromium releases — must be isolated |
| UI / worker / IO thread variants of the elision marker | references/noise-frames.md |
Edge-case detail, not consulted every time |
The most common mistake in a skill is being too abstract:
❌ Bad:
Apply reasonable judgment to decide which frames are noise.
This is equivalent to saying nothing. Claude reads it and can only go by feel, producing a different result on every invocation.
✅ Good:
Elide frames matching the following regex families:
^\s*base::internal::^\s*base::TaskAnnotator::^\s*base::MessagePump- ...
Concrete regexes, concrete function-name prefixes, concrete "always keep" / "always elide" lists — these produce consistent results on every invocation.
A skill's value lies in consistency; specific rules beat vibes. Every time you're tempted to write "apply reasonable judgment," stop and ask: can this judgment be decomposed into a few explicit rules? The part that truly can't be decomposed should stay as judgment; everything else should be pinned down.
What's worth sinking into references/?
Three criteria:
- Content that drifts. For example,
sanitize-stack's noise-frame list — every Chromium release adds or renames somebase::Bindvariant. If a drifting list sits inside SKILL.md, every maintenance pass has to modify the main flow. Splitting it out makes maintenance cost drop immediately. - Tables or lists over about 300 words. Short tables can live in SKILL.md; long ones bloat the skeleton.
- Material that might be grep'd independently. For example, "all variants of
base::MessagePump" — that kind of list has value outside the SKILL.md workflow too.
What does NOT belong in references/:
- Core workflow steps (those are SKILL.md's job)
- High-level decisive classifications (like the coarse signal-vs-noise split)
- Metadata that affects trigger matching (that belongs in the frontmatter description)
Concretely this time:
SKILL.mdsays: "Step 4: Classify Signal vs Noise. Keep frames underchrome/,components/,content/...; elidebase::internal::,base::TaskAnnotator::...; seereferences/noise-frames.mdfor the complete list."references/noise-frames.mdcontains the full multi-dozen-row list, split into UI / worker / IO thread variants, plus a regex summary.
Result: SKILL.md stays readable as a skeleton, the complete list is always available when needed without cluttering the main flow.
This is the step most commonly skipped, and the step you should skip the least.
A skill's correctness can't be checked by a linter. You can't run skill-lint and see a green light — a skill is a natural-language instruction executed by Claude, and its correctness can only be verified by running it.
How to run: find a real task you've already done by hand once, feed it to your new skill's rules, and compare the output to your hand-crafted version.
For sanitize-stack, the smoke test used a 35-frame crash stack that I had manually scrubbed just before writing the skill, originally captured from a downstream Chromium-based browser. The examples/example-before-after.md file in this repository is a structurally faithful reconstruction of that scenario: module names have been replaced with the placeholder downstream.*, while function names, source paths, line numbers, and template signatures are preserved 1:1 with the original. This makes the file safe to publish while keeping it valid as a golden test input — every rule in the skill behaves identically on the synthetic version and on the original.
I walked through the 6-step pipeline from SKILL.md on this input:
- Step 1 detected
downstream.dll→ replaced withchrome.dll - Step 2 detected 35
行tokens → replaced withline - Step 3 found 3 collapsible templates (2 ×
std::u16string, 1 ×std::unique_ptr) - Step 4 classified per list: 7 frames kept, 28 elided
- Step 5 rendered per format
- Step 6 scanned for PII, none found
Then I compared against the hand-crafted version — character-for-character identical. Same 7 frames in the same order, same elision-marker wording.
Two reasons:
- It proves the rules are complete. If the smoke-test output and the hand-crafted version differ, that means the hand-crafted version made some judgment the rules can't express — either go back and extend the rules, or acknowledge it as a judgment call that must be left to future invocations.
- It gives you a golden test. Later, when you change SKILL.md or
references/noise-frames.md, you can re-run the same input and check for regressions. This is the closest thing to a unit test that a skill can have.
You'll likely discover a bug in the rules a month later, the first time you actually use the skill for real — and by then you'll have forgotten why you wrote the rules the way you did, making debugging 10× more expensive than it would have been up front. The smoke test costs 5 minutes; the return is avoiding that debugging hell.
Writing the skill is only the start; maintenance comes next. The core principle: put things that change at different frequencies into different files.
sanitize-stack's maintenance paths fall into three categories:
Scenario: a new Chromium release introduces a new boilerplate frame family (say, a new base::ThreadPoolImpl::Worker variant), and I need to add it to the elide list.
Maintenance path: edit references/noise-frames.md. SKILL.md's main flow doesn't change at all.
This is the whole reason for sinking the list into references/ — isolate high-frequency changes from the main flow.
Scenario: deciding to add a Step 4.5 that runs deduplication before the Step 5 render, or changing the output format from two lines per frame to one.
Maintenance path: edit the body of SKILL.md. This kind of change should be rare — maybe once a year.
Scenario: noticing that "condense this stack" doesn't trigger the skill when it should, or that "show this file" triggers the skill when it shouldn't.
Maintenance path: edit the description field in SKILL.md's frontmatter to adjust the trigger-phrase list. Only the description; don't touch the body.
Each category maps to a different file. The benefit: when you know what kind of change you want to make, you immediately know which file to open, without re-reading the whole skill to decide where the change belongs.
That's why progressive disclosure isn't only a loading-efficiency concern — it's also a maintenance-efficiency concern.
You'll inevitably run into things you don't know while writing a skill. Mine this time: does Claude Code actually auto-discover user-level skills in ~/.claude/skills/?
The plugin-dev docs only describe the discovery mechanism for plugin-bundled skills, and stop short of explicitly confirming whether user-level locations work. Glob'ing around on this machine, I found that every existing SKILL.md on disk lived under plugins/marketplaces/... — the user-level location had never been populated before.
Faced with this kind of uncertainty, my approach is not to pretend to know the answer and not to freeze in place, but to:
- Explicitly declare "I don't know". Let the user know this is an unverified assumption.
- Offer fallback plans A / B: if auto-discovery works, great; if it doesn't, plan A (manually
Readthe SKILL.md as a prompt template) or plan B (wrap it as a local plugin). - Suggest the cheapest verification step: open a new session and check the available-skills list.
A fresh session verified the assumption: user-level ~/.claude/skills/ is auto-discovered, and /sanitize-stack triggered correctly. The uncertainty was resolved in about thirty seconds of testing — but notice that the methodology (flag → fallback → verify) is independent of how that particular test came out. If auto-discovery hadn't worked, plan A or B would have caught me without delaying development. A secondary lesson lurks here too: verification cost is usually much lower than your anticipation of it. Thirty seconds to settle a question you've been carrying for an hour is a bad trade — just run the test.
This habit matters for skill development in particular. Skill development is experimental — you're writing prompts, and prompt behavior cannot be derived from first principles, unlike writing a compiler where you can derive correctness from a language spec. Flagging unknowns has more engineering value than pretending to know everything.
A corollary: don't treat a skill as a one-shot effort. The first version is a prototype for your own use; iterating based on real-world feedback is the normal flow.
Compressed into eight lines, pinnable to a wall:
- Ask the three questions before building a skill: repeatable with judgment? high frequency? expensive when wrong? All three → build. Any one missing → reconsider.
- Placement follows knowledge domain: cross-project knowledge goes to
~/.claude/skills/, project-private knowledge goes to<project>/.claude/skills/. - Names should be short, verb-forward, and consistent with existing skills.
- Read two reference implementations: one meta (teaches you format), one concrete (teaches you voice).
- Descriptions use third person with concrete trigger phrases, not abstract summaries.
- Bodies use imperative, not second person; rules should be concrete enough to execute; decompose "judgment" into explicit rules wherever possible.
- Keep SKILL.md as a 1500–2000-word skeleton; sink drifting detail into
references/. - After writing, run a smoke test: feed a real historical case through the new rules and compare against the hand-crafted version.
My sanitize-stack is version 0.1.0 and is incomplete in several places:
- No companion skill to turn the scrubbed output into a Gerrit description automatically
- The noise-frame list only covers the common UI / worker / IO thread variants; GPU and utility processes aren't covered
- The template-collapse rules only handle libc++, not MSVC STL
All of this is deliberate. The right strategy for a first skill is ship it when it's good enough, then iterate based on real use. A skill's iteration cost is much lower than you'd expect — editing a references/ list might take five minutes.
If you're agonizing over details for more than two hours on a skill, you're almost certainly overengineering. Ship a 0.1.0, use it two or three times, then decide what actually needs to change. Polish is a byproduct of use, not something you type in at the keyboard.
sanitize-stack/ (GitHub repo root)
├── README.md this file (English case study)
├── README.zh.md Chinese case study (1:1 sibling)
├── SKILL.md skill skeleton, 1543 words
├── references/
│ └── noise-frames.md Chromium noise-frame catalog, 775 words
├── examples/
│ └── example-before-after.md synthetic case + golden test
└── LICENSE MIT
Read SKILL.md and noise-frames.md as a concrete sample, then use this walkthrough as the decision guide, and you'll have enough context to write your first skill from scratch.
Have fun, and iterate often.