skill-md testing how-to

How to Test a Claude Skill: SKILL.md Evals

The two ways a skill fails, how to write test prompts, the draft-test-evaluate-rewrite loop, and the eval tooling that exists.

jordan · June 3, 2026 · 7 min read

A skill that looks perfect in the editor and a skill that works are two different objects. You write a clean SKILL.md, the description reads well, the body carries good instructions, and you ship it. Then it sits there. Maybe Claude never loads it. Maybe it loads at the wrong moment and gives you output worse than no skill at all. Either way, knowing how to test a claude skill before you publish is what separates a skill you trust from one you only suspect. The method is not eyeballing it and hoping. You run it against real prompts and grade what comes back, on a loop, using the claude skill evals tooling Anthropic now ships inside the skill-creator.

I'll walk through that loop. First, though, it helps to know the two ways a skill goes wrong, because they fail in opposite directions and you fix them differently.

How to Test a Claude Skill Starts With the Two Ways It Fails

The first failure mode is that the skill never triggers at all. At startup Claude reads only the name and description from every skill (Anthropic's skills documentation calls this progressive disclosure), which means the description is the entire targeting mechanism. Write it vague, or write it for a human reader instead of for a matching algorithm, and Claude will not pull the skill in when it should. So you ask for the exact thing the skill was built to do, and Claude answers from its base behavior while 200 lines of careful instructions sit unread on disk. The skill is invisible. The whytryai writeup on testing skills names this directly: a sub-optimal description means the skill does not trigger reliably.

The second failure mode is the inverse, and it is sneakier. The skill triggers, loads its full body, and then does the wrong thing. Maybe the instructions are not as specific as they need to be, or two sections quietly contradict each other, or one step sends Claude down a path that goes nowhere. The skill is active and making the output worse the whole time. Because it clearly loaded, it looks like it is "working," so you blame the model instead of the SKILL.md.

You cannot tell these two apart by reading the file. You tell them apart by running prompts against it, and that is the part most people skip.

Write test prompts that look like real work

A test prompt is just a realistic user message that should make the skill fire. Write two or three of them. Anthropic's skill-creator suggests covering a typical case, an edge case where the input is minimal and the user barely says what they want, and something closer to a stress test. Each one probes a different behavior.

The rule that matters most here is to use real prompts from your actual day-to-day rather than invented ones. Invented prompts have a way of coming out too clean. They reach for the exact vocabulary your description already uses, so triggering looks great in the test and then falls apart in production the moment a real user phrases the request differently. Have logs of how people actually ask for this thing? Pull from those. If you do not, write the prompt the way you would type it at 4 p.m. when you are tired and terse, not the way you would write documentation.

Then save them somewhere structured. The skill-creator format is a JSON file with an id, the prompt, an expected_output description, optional files, and assertions you add later. The exact shape matters less than the habit it builds, which is keeping a frozen set of prompts you can rerun, unchanged, every single time you touch the skill.

The draft, test, evaluate, rewrite loop

The workflow is a cycle, and the trick that makes it pay off is running each prompt twice. Once with the skill loaded, once without it. Same prompt, two outputs, and the without-skill run is your baseline. When the two outputs come out indistinguishable, your skill is doing nothing, and that itself is a finding. When the with-skill output is the worse of the two, that is a much louder one.

Anthropic's skill-creator (the official version lives in claude-plugins-official) automates this. It spawns independent agents to run each prompt in parallel, each one in a clean context so the runs cannot contaminate each other, and each reporting its own token count and timing. With-skill outputs land in one directory and baseline outputs in another. For a brand-new skill the baseline is simply "no skill at all." For an edit to a skill that already exists, the baseline becomes a snapshot of the old version, which is how you check that you did not break what already worked.

From there you evaluate, rewrite the weak parts of the SKILL.md, and rerun the exact same prompts. Iteration two shows iteration one's outputs collapsed underneath the new ones, so you can see side by side whether the rewrite actually helped or just shoved the problem somewhere else. You stop when the outputs stop improving. Not when you run out of patience.

One instruction buried in the skill-creator is worth stealing even if you never touch the tool. Read the transcripts, not only the final outputs. If the skill made Claude burn three steps on something pointless before it arrived at a fine answer, the output looks good but the skill is bloated underneath. Cut those steps out of the SKILL.md.

Qualitative judgment versus assertions you can count

Not every check belongs in an automated harness, and forcing the subjective ones to fit is how you end up with confident, useless green checkmarks.

Quantitative assertions earn their keep when the criterion is objectively verifiable. A character limit. An exact item count. A required field present, valid JSON, a heading sitting where it should. The skill-creator runs a component called the Grader against checks like these. It reads each output, decides pass or fail per assertion, and writes a result with three fields: the assertion text, whether it passed, and the evidence for the call. Aggregate those across all your prompts and you get a pass rate, plus timing and token deltas against baseline. That is the quantitative side of the ledger, answering whether the new version passed more assertions and what it cost in tokens to do so.

Qualitative evaluation then covers everything assertions cannot reach. Writing style, design taste, whether a summary actually captures the point of the thing it summarizes. The skill-creator's own guidance is blunt about this: if the quality is subjective, do not bolt an assertion onto it, because a fake-objective check on a subjective output just lies to you. For those you have to look. The tool generates a browser-based eval viewer, a Python script that opens a local page, with the with-skill and without-skill outputs side by side, a tab for the benchmark numbers, and a feedback box for each test case. The skill-creator instructions push hard on one point here, which is to generate that viewer and get it in front of a human before you start grading anything yourself. Your read of your own skill is the least trustworthy read in the room.

Regression testing across versions

You freeze your prompt set because a skill is never done in one pass, and every edit you make is a fresh chance to regress. You fix the triggering problem and accidentally make the output worse. You tighten the output and the skill quietly stops firing on the edge case you fixed last week.

This is exactly why the baseline-versus-old-version comparison exists. Snapshot the current skill before you edit it, then run the new version and that snapshot against the same prompts, and the benchmark hands you the delta per prompt: pass rate up or down, tokens up or down. When you genuinely cannot tell which of two versions is better, the skill-creator has an optional Blind Comparator that passes both outputs to an independent agent without telling it which is which, so the verdict does not get bent by your hope that the new one won. Most skills never need that much rigor, though. The part you cannot skip is humbler: keep the old prompts, and rerun them on every version, forever.

There is also an MLflow integration for teams that want skill evals tracked the way they already track model experiments, with runs logged and compared over time. For a personal skill it is overkill. For a skill shared across a team, where a regression means someone's workflow quietly breaks on a Tuesday, it starts to look reasonable.

Where this gets baked in

Build skills the way most people do, by writing SKILL.md in an editor and reasoning your way through it, and the eval loop becomes something you bolt on afterward. Usually that means never. Knack takes the other route. You describe the workflow in a 20-minute interview, and the test-and-eval loop sits inside the authoring instead of waiting as a separate chore you can talk yourself out of. What comes out is a standard Anthropic-format SKILL.md that runs in Claude Code, Codex, Cursor, and Gemini CLI, so nothing locks you in.

The reason to care is the one this whole piece keeps circling. A skill you built but never ran against a real prompt is, when you are honest about it, a guess. Running it, grading it, and rerunning it after every edit is how you answer the only question that actually matters: does your skill work. Knack puts that loop where you will actually use it, during the build rather than after it.

For the build itself, see build a skill with no code. For working SKILL.md files you can read and reverse-engineer, see real skill examples.

So, the short version. Write two or three prompts that look like real work. Run them with and without your skill, grade the objective stuff, and eyeball the rest. Rewrite the weak parts and rerun the exact same prompts, and keep at it until the outputs stop getting better. Then ship.