Building a Claude Code Skill from Scratch: A Worked Example with `sanitize-stack`

Language: 中文 · English (this page)

This is a case-driven tutorial, not a skill API reference. Intended audience: developers who have written Claude Code prompts but never authored a skill. After reading, you should be able to answer: "When I face a new task, how do I decide whether to make it a skill, how do I design it, and how do I avoid common pitfalls?"

Why this walkthrough exists

Claude Code ships with a meta-skill called skill-development inside the plugin-dev plugin. It covers the syntax-level territory — YAML frontmatter fields, Markdown writing style, progressive disclosure — thoroughly enough. Anyone about to write their first skill should read it first.

What that official documentation deliberately leaves blank, however, is the territory this walkthrough fills: when you face a real task, how do you decide whether it deserves to be a skill, how do you design it, how do you avoid common pitfalls, and how do you maintain it. There is no canonical answer to any of those questions — they can only be transmitted through a complete case that shows the reasoning behind each decision.

I recently built a skill called sanitize-stack to handle a Chromium crash-stack scrubbing task, and I walked the full path from "should I even do this" to "how to keep it maintained." The decision chain is typical enough to be worth unpacking for anyone who hasn't written a skill yet. Below, in chronological order.

Step 1: First ask, "Is this task worth making into a skill?"

Not every repetitive task deserves to become a skill. Building one has real cost (writing SKILL.md, maintaining references, understanding the trigger mechanism), and that cost has to be amortized by work it saves later. I apply three tests.

Test 1: The task is repeatable AND involves judgment

Pure string replacement or a one-line bash command almost never justifies a skill — there's no judgment space, so a shell alias or a git hook does the job.

A skill's sweet spot is tasks where every step has rules, but those rules have judgment calls inside them. Take sanitize-stack:

Which frames are signal vs noise? A judgment call (when the crash happens inside base::RunLoop, that frame stops being noise)
How aggressively should C++ template signatures be collapsed? A judgment call (std::__Cr::basic_string<char16_t, ...> becomes std::u16string, but scoped_refptr<T> stays verbatim)
What text does the elision marker use? A judgment call (UI thread vs worker thread need different phrasing)

These judgments can be written down (that's precisely where the skill's value lies), but making them requires looking at each specific input. That's the sweet spot.

Counterexample: replacing every downstream.dll in a file with chrome.dll. No judgment, a one-line sed suffices, not skill-worthy.

Test 2: High frequency for the intended user

The second threshold is usage frequency. Building a skill takes roughly 30 minutes to 2 hours (structure, SKILL.md, references, smoke test). That investment only pays off if you'll use the result repeatedly.

A quick way to test this: count how many times in the last 1–3 months you've done something similar. More than three times, and the pattern will almost certainly keep recurring. Only once, and it's probably "just one quick script" territory.

For sanitize-stack: I'm someone who repeatedly ferries crash fixes from a downstream Chromium branch to upstream trackers, and every single crash fix brings another round of the same scrubbing task. Frequency high enough to amortize.

Test 3: Getting it wrong is expensive

Some tasks are "wrong is fine, try again." Others are "wrong once, major problem." Only the latter justify the extra investment of codifying rules in a skill.

Scrubbing is a textbook "wrong once, major problem" case: a single downstream module name leaking into a public tracker is permanently exposed — the comment can be edited, but crawlers and mirror sites won't follow the edit. In this kind of task, the skill's value isn't saving time; it's preventing mistakes.

Counterexample: code formatting. Get it wrong, and clang-format fixes it back. No skill needed.

All three satisfied → build it

sanitize-stack passed all three. That was an unhesitating "build" decision.

Conversely, if any one test fails, stop and reconsider — a bash alias or a frequently-pasted prompt snippet might be enough, and a skill would be overkill. Overengineering is the first trap in skill development.

Step 2: Design decisions

Once you've decided to build, three design questions follow: where does it live, what do you call it, how is its internal structure cut.

2.1 Placement: user-level or project-level?

Claude Code supports two locations for skills:

User-level: ~/.claude/skills/<name>/, travels with you the person
Project-level: <project>/.claude/skills/<name>/, travels with a specific checkout

My rule: follow the skill's knowledge domain, not convenience.

If the skill's knowledge is only meaningful for one specific codebase (say, a monorepo's internal build workflow) → project-level
If the skill's knowledge spans multiple projects (say, the Chromium stack boilerplate list, which is true of any Chromium checkout) → user-level

sanitize-stack's knowledge domain is Chromium-the-project, not "the chromium source tree I happen to have checked out right now." Tomorrow I might be working on a different Chromium branch, and this skill should follow me there. User-level.

A common misjudgment: dropping a skill into the current project's directory just because that's where you happened to be working. That's locking cross-project knowledge inside a single tree, and the next time you work in a different tree you'll end up recreating it. Follow the knowledge domain, not your current working directory.

2.2 Naming

Look at Claude Code's existing skills for style: simplify, loop, schedule, commit, review-pr. Three common patterns:

Short — one or two words
Verb-forward (or a word that functions as a verb)
Describes an action, not a domain

There's a subtle psychology behind naming: you trigger a skill by typing / followed by its name, and the longer the name, the less you'll reach for it. Long names drain usage willingness the same way long paths do.

For sanitize-stack I filtered four candidates:

Candidate	Verdict
`sanitize-stack`	Chosen. Verb-forward, neutral, unambiguous.
`scrub-stack`	Shorter, but "scrub" carries a "covering up evidence" connotation.
`clean-stack`	Too generic — could be confused with "reformat."
`chromium-stack`	Wrong focus — a name should convey the action, not the domain.

2.3 Read reference implementations

Before writing your first skill, read at least two existing SKILL.md files. Not one, two:

One meta: tells you the format conventions (frontmatter fields, section layout, writing-style requirements)
One concrete: tells you how a real skill actually reads (specific trigger phrases, a real pipeline description)

What I read this time:

Meta: plugin-dev/skills/skill-development/SKILL.md (a skill about how to write skills — meta-recursion)
Concrete: claude-md-management/skills/claude-md-improver/SKILL.md (a real user-facing skill)

If you don't know where to find these, glob for them:

find ~/.claude -name "SKILL.md" 2>/dev/null

Any Claude Code installation with the official plugin marketplace has dozens of SKILL.md files to draw from. Three to five is plenty.

Step 3: Writing SKILL.md

The heart of a skill is its SKILL.md file, which consists of YAML frontmatter plus a Markdown body.

3.1 The four frontmatter fields

---
name: sanitize-stack
description: This skill should be used when the user asks to "sanitize a crash stack", "scrub a stack trace", ...
tools: Read
version: 0.1.0
---

name: the skill's invocation name. Keep it identical to the directory name.

description: the most important field. Claude Code uses this string to decide when to auto-trigger your skill. Two hard rules:

Use third person. Write This skill should be used when the user asks to X, not Use this skill when you want X. Why: the description is read by a different Claude instance — the one deciding whether to trigger this skill — and third person gives it a clear observer's viewpoint. Second person would confuse "should I invoke this skill" with "am I the target user of this skill."
List concrete trigger phrases, not abstract descriptions:
- ❌ Bad: Provides guidance for sanitizing crash stacks.
- ✅ Good: This skill should be used when the user asks to "sanitize a crash stack", "scrub a stack trace", "prepare a stack for crbug", "脱敏崩溃堆栈", or pastes a native crash stack...
Concrete phrases turn the trigger decision into pattern matching. Abstract descriptions turn it into semantic reasoning, which is substantially less accurate.

Note the "脱敏崩溃堆栈" in the example — I deliberately included a Chinese trigger phrase because I often mix Chinese and English when talking to Claude. If you're an English-only user, skip it; if you're bilingual, list both.

tools: restrict the tool set this skill is allowed to use. sanitize-stack only needs Read (to read a stack from a file path, if the user provides one), so nothing else is listed. The benefit: it makes the skill's behavior more predictable and prevents it from wandering off into calling Bash or Write for something weird.

version: start at 0.1.0 and bump on substantial changes. There's no strict convention — semver is the usual reference.

3.2 Body style: imperative, not second person

This rule is the one the plugin-dev meta-skill states most bluntly:

✅ Scan the stack for module name patterns.
❌ You should scan the stack for module name patterns.

Two reasons imperative wins:

Consistency: a SKILL.md written entirely in imperative reads like a specification. Mixing in "you should" and "if you want" makes it read like a casual blog post and look unprofessional.
AI consumption: skills are read by another Claude instance executing a task. Imperative is an instruction; second person is a conversation. The former is what Claude needs during execution.

A simple check: read your SKILL.md, and if any sentence starts with "You", that's a violation. Rewrite it to start with a verb.

3.3 Progressive disclosure: keep SKILL.md as a skeleton

Claude Code's skill file system has three layers:

Metadata (the name + description in frontmatter): always loaded, ~100 words
SKILL.md body: loaded when the skill triggers, target 1500–2000 words
references/ and other bundled files: loaded on demand by Claude

This three-layer structure is called progressive disclosure. The core idea: don't dump every detail on Claude at once; let Claude pull them in as needed.

In practice this means: put only the core workflow in SKILL.md, and sink the detailed rule tables to references/.

Here's how I cut sanitize-stack:

Content	Location	Reason
The six pipeline step names and summaries	`SKILL.md`	Core workflow, must always be visible
Step 1's module-name allowlist	`SKILL.md`	Short, and a critical decision point
Step 3's template-collapse table	`SKILL.md`	Six rows — small enough to live inline
Step 4's complete noise-frame list	`references/noise-frames.md`	Hundreds of lines, drifts with Chromium releases — must be isolated
UI / worker / IO thread variants of the elision marker	`references/noise-frames.md`	Edge-case detail, not consulted every time

3.4 Rules must be concrete enough to execute

The most common mistake in a skill is being too abstract:

❌ Bad:

Apply reasonable judgment to decide which frames are noise.

This is equivalent to saying nothing. Claude reads it and can only go by feel, producing a different result on every invocation.

✅ Good:

Elide frames matching the following regex families:

^\s*base::internal::

^\s*base::TaskAnnotator::

^\s*base::MessagePump

...

Concrete regexes, concrete function-name prefixes, concrete "always keep" / "always elide" lists — these produce consistent results on every invocation.

A skill's value lies in consistency; specific rules beat vibes. Every time you're tempted to write "apply reasonable judgment," stop and ask: can this judgment be decomposed into a few explicit rules? The part that truly can't be decomposed should stay as judgment; everything else should be pinned down.

Step 4: The `references/` split decision

What's worth sinking into references/?

Three criteria:

Content that drifts. For example, sanitize-stack's noise-frame list — every Chromium release adds or renames some base::Bind variant. If a drifting list sits inside SKILL.md, every maintenance pass has to modify the main flow. Splitting it out makes maintenance cost drop immediately.
Tables or lists over about 300 words. Short tables can live in SKILL.md; long ones bloat the skeleton.
Material that might be grep'd independently. For example, "all variants of base::MessagePump" — that kind of list has value outside the SKILL.md workflow too.

What does NOT belong in references/:

Core workflow steps (those are SKILL.md's job)
High-level decisive classifications (like the coarse signal-vs-noise split)
Metadata that affects trigger matching (that belongs in the frontmatter description)

Concretely this time:

SKILL.md says: "Step 4: Classify Signal vs Noise. Keep frames under chrome/, components/, content/...; elide base::internal::, base::TaskAnnotator::...; see references/noise-frames.md for the complete list."
references/noise-frames.md contains the full multi-dozen-row list, split into UI / worker / IO thread variants, plus a regex summary.

Result: SKILL.md stays readable as a skeleton, the complete list is always available when needed without cluttering the main flow.

Step 5: The smoke test — validate the skill with a real case

This is the step most commonly skipped, and the step you should skip the least.

The principle

A skill's correctness can't be checked by a linter. You can't run skill-lint and see a green light — a skill is a natural-language instruction executed by Claude, and its correctness can only be verified by running it.

How to run: find a real task you've already done by hand once, feed it to your new skill's rules, and compare the output to your hand-crafted version.

For sanitize-stack, the smoke test used a 35-frame crash stack that I had manually scrubbed just before writing the skill, originally captured from a downstream Chromium-based browser. The examples/example-before-after.md file in this repository is a structurally faithful reconstruction of that scenario: module names have been replaced with the placeholder downstream.*, while function names, source paths, line numbers, and template signatures are preserved 1:1 with the original. This makes the file safe to publish while keeping it valid as a golden test input — every rule in the skill behaves identically on the synthetic version and on the original.

I walked through the 6-step pipeline from SKILL.md on this input:

Step 1 detected downstream.dll → replaced with chrome.dll
Step 2 detected 35 行 tokens → replaced with line
Step 3 found 3 collapsible templates (2 × std::u16string, 1 × std::unique_ptr)
Step 4 classified per list: 7 frames kept, 28 elided
Step 5 rendered per format
Step 6 scanned for PII, none found

Then I compared against the hand-crafted version — character-for-character identical. Same 7 frames in the same order, same elision-marker wording.

Why this comparison matters

Two reasons:

It proves the rules are complete. If the smoke-test output and the hand-crafted version differ, that means the hand-crafted version made some judgment the rules can't express — either go back and extend the rules, or acknowledge it as a judgment call that must be left to future invocations.
It gives you a golden test. Later, when you change SKILL.md or references/noise-frames.md, you can re-run the same input and check for regressions. This is the closest thing to a unit test that a skill can have.

What happens if you skip the smoke test?

You'll likely discover a bug in the rules a month later, the first time you actually use the skill for real — and by then you'll have forgotten why you wrote the rules the way you did, making debugging 10× more expensive than it would have been up front. The smoke test costs 5 minutes; the return is avoiding that debugging hell.

Step 6: Maintenance strategy

Writing the skill is only the start; maintenance comes next. The core principle: put things that change at different frequencies into different files.

sanitize-stack's maintenance paths fall into three categories:

Type A: Rules need adjusting (high frequency)

Scenario: a new Chromium release introduces a new boilerplate frame family (say, a new base::ThreadPoolImpl::Worker variant), and I need to add it to the elide list.

Maintenance path: edit references/noise-frames.md. SKILL.md's main flow doesn't change at all.

This is the whole reason for sinking the list into references/ — isolate high-frequency changes from the main flow.

Type B: Pipeline structure changed (low frequency)

Scenario: deciding to add a Step 4.5 that runs deduplication before the Step 5 render, or changing the output format from two lines per frame to one.

Maintenance path: edit the body of SKILL.md. This kind of change should be rare — maybe once a year.

Type C: Trigger conditions miss or overfire (medium frequency)

Scenario: noticing that "condense this stack" doesn't trigger the skill when it should, or that "show this file" triggers the skill when it shouldn't.

Maintenance path: edit the description field in SKILL.md's frontmatter to adjust the trigger-phrase list. Only the description; don't touch the body.

Why these three categories matter

Each category maps to a different file. The benefit: when you know what kind of change you want to make, you immediately know which file to open, without re-reading the whole skill to decide where the change belongs.

That's why progressive disclosure isn't only a loading-efficiency concern — it's also a maintenance-efficiency concern.

Step 7: Facing uncertainty

You'll inevitably run into things you don't know while writing a skill. Mine this time: does Claude Code actually auto-discover user-level skills in ~/.claude/skills/?

The plugin-dev docs only describe the discovery mechanism for plugin-bundled skills, and stop short of explicitly confirming whether user-level locations work. Glob'ing around on this machine, I found that every existing SKILL.md on disk lived under plugins/marketplaces/... — the user-level location had never been populated before.

Faced with this kind of uncertainty, my approach is not to pretend to know the answer and not to freeze in place, but to:

Explicitly declare "I don't know". Let the user know this is an unverified assumption.
Offer fallback plans A / B: if auto-discovery works, great; if it doesn't, plan A (manually Read the SKILL.md as a prompt template) or plan B (wrap it as a local plugin).
Suggest the cheapest verification step: open a new session and check the available-skills list.

A fresh session verified the assumption: user-level ~/.claude/skills/ is auto-discovered, and /sanitize-stack triggered correctly. The uncertainty was resolved in about thirty seconds of testing — but notice that the methodology (flag → fallback → verify) is independent of how that particular test came out. If auto-discovery hadn't worked, plan A or B would have caught me without delaying development. A secondary lesson lurks here too: verification cost is usually much lower than your anticipation of it. Thirty seconds to settle a question you've been carrying for an hour is a bad trade — just run the test.

This habit matters for skill development in particular. Skill development is experimental — you're writing prompts, and prompt behavior cannot be derived from first principles, unlike writing a compiler where you can derive correctness from a language spec. Flagging unknowns has more engineering value than pretending to know everything.

A corollary: don't treat a skill as a one-shot effort. The first version is a prototype for your own use; iterating based on real-world feedback is the normal flow.

Eight transferable takeaways

Compressed into eight lines, pinnable to a wall:

Ask the three questions before building a skill: repeatable with judgment? high frequency? expensive when wrong? All three → build. Any one missing → reconsider.
Placement follows knowledge domain: cross-project knowledge goes to ~/.claude/skills/, project-private knowledge goes to <project>/.claude/skills/.
Names should be short, verb-forward, and consistent with existing skills.
Read two reference implementations: one meta (teaches you format), one concrete (teaches you voice).
Descriptions use third person with concrete trigger phrases, not abstract summaries.
Bodies use imperative, not second person; rules should be concrete enough to execute; decompose "judgment" into explicit rules wherever possible.
Keep SKILL.md as a 1500–2000-word skeleton; sink drifting detail into references/.
After writing, run a smoke test: feed a real historical case through the new rules and compare against the hand-crafted version.

One final piece of advice: don't aim for perfection on your first skill

My sanitize-stack is version 0.1.0 and is incomplete in several places:

No companion skill to turn the scrubbed output into a Gerrit description automatically
The noise-frame list only covers the common UI / worker / IO thread variants; GPU and utility processes aren't covered
The template-collapse rules only handle libc++, not MSVC STL

All of this is deliberate. The right strategy for a first skill is ship it when it's good enough, then iterate based on real use. A skill's iteration cost is much lower than you'd expect — editing a references/ list might take five minutes.

If you're agonizing over details for more than two hours on a skill, you're almost certainly overengineering. Ship a 0.1.0, use it two or three times, then decide what actually needs to change. Polish is a byproduct of use, not something you type in at the keyboard.

Appendix: the complete deliverables from this session

sanitize-stack/                       (GitHub repo root)
├── README.md                         this file (English case study)
├── README.zh.md                      Chinese case study (1:1 sibling)
├── SKILL.md                          skill skeleton, 1543 words
├── references/
│   └── noise-frames.md               Chromium noise-frame catalog, 775 words
├── examples/
│   └── example-before-after.md       synthetic case + golden test
└── LICENSE                           MIT

Read SKILL.md and noise-frames.md as a concrete sample, then use this walkthrough as the decision guide, and you'll have enough context to write your first skill from scratch.

Have fun, and iterate often.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building a Claude Code Skill from Scratch: A Worked Example with `sanitize-stack`

Why this walkthrough exists

Step 1: First ask, "Is this task worth making into a skill?"

Test 1: The task is repeatable AND involves judgment

Test 2: High frequency for the intended user

Test 3: Getting it wrong is expensive

All three satisfied → build it

Step 2: Design decisions

2.1 Placement: user-level or project-level?

2.2 Naming

2.3 Read reference implementations

Step 3: Writing SKILL.md

3.1 The four frontmatter fields

3.2 Body style: imperative, not second person

3.3 Progressive disclosure: keep SKILL.md as a skeleton

3.4 Rules must be concrete enough to execute

Step 4: The `references/` split decision

Step 5: The smoke test — validate the skill with a real case

The principle

Why this comparison matters

What happens if you skip the smoke test?

Step 6: Maintenance strategy

Type A: Rules need adjusting (high frequency)

Type B: Pipeline structure changed (low frequency)

Type C: Trigger conditions miss or overfire (medium frequency)

Why these three categories matter

Step 7: Facing uncertainty

Eight transferable takeaways

One final piece of advice: don't aim for perfection on your first skill

Appendix: the complete deliverables from this session

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
references		references
LICENSE		LICENSE
README.md		README.md
README.zh.md		README.zh.md
SKILL.md		SKILL.md

Folders and files

Latest commit

History

Repository files navigation

Building a Claude Code Skill from Scratch: A Worked Example with sanitize-stack

Why this walkthrough exists

Step 1: First ask, "Is this task worth making into a skill?"

Test 1: The task is repeatable AND involves judgment

Test 2: High frequency for the intended user

Test 3: Getting it wrong is expensive

All three satisfied → build it

Step 2: Design decisions

2.1 Placement: user-level or project-level?

2.2 Naming

2.3 Read reference implementations

Step 3: Writing SKILL.md

3.1 The four frontmatter fields

3.2 Body style: imperative, not second person

3.3 Progressive disclosure: keep SKILL.md as a skeleton

3.4 Rules must be concrete enough to execute

Step 4: The references/ split decision

Step 5: The smoke test — validate the skill with a real case

The principle

Why this comparison matters

What happens if you skip the smoke test?

Step 6: Maintenance strategy

Type A: Rules need adjusting (high frequency)

Type B: Pipeline structure changed (low frequency)

Type C: Trigger conditions miss or overfire (medium frequency)

Why these three categories matter

Step 7: Facing uncertainty

Eight transferable takeaways

One final piece of advice: don't aim for perfection on your first skill

Appendix: the complete deliverables from this session

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Building a Claude Code Skill from Scratch: A Worked Example with `sanitize-stack`

Step 4: The `references/` split decision

Packages