🧪Advanced

Testing

Three rigor levels and three test areas — trigger correctness, functional correctness, and performance vs baseline.

Three Rigor Levels

Skills can be tested at varying levels of rigor depending on audience and blast radius. A skill used by a team of five does not need the same suite as one shipped to thousands of enterprise users. Pick the lightest approach that gives you the signal you need.

Approach	Where	Best For
Manual testing	Claude.ai	Fast iteration, no setup required, observing real behavior
Scripted testing	Claude Code	Automated test cases for repeatable validation across changes
Programmatic testing	Skills API	Evaluation suites that run systematically against defined test sets

💡Tip

Iterate on a single challenging task until Claude succeeds, then extract the winning approach into the skill. This leverages in-context learning and gives faster signal than broad testing. Expand to multiple test cases only after you have a working foundation.

Area 1 — Triggering Tests

Goal: ensure the skill loads at the right times. Build a suite of queries that should trigger it, queries that should NOT trigger it (especially adjacent topics), and paraphrased versions of both.

text

Should trigger:
- "Help me set up a new ProjectHub workspace"
- "I need to create a project in ProjectHub"
- "Initialize a ProjectHub project for Q4 planning"

Should NOT trigger:
- "What's the weather in San Francisco?"
- "Help me write Python code"
- "Create a spreadsheet" (unless ProjectHub skill handles sheets)

Area 2 — Functional Tests

Goal: verify the skill produces correct outputs. Each test asserts on outputs, API calls, error handling, and edge cases. Use a Given/When/Then structure so failures are easy to triage.

text

Test: Create project with 5 tasks
Given: Project name "Q4 Planning", 5 task descriptions
When: Skill executes workflow
Then:
   - Project created in ProjectHub
   - 5 tasks created with correct properties
   - All tasks linked to project
   - No API errors

Area 3 — Performance Comparison

Goal: prove the skill improves results vs baseline. Run the same task without the skill, then with it, on the same inputs. Capture turns, tokens, API failures, and clarifying questions.

text

Without skill:
- User provides instructions each time
- 15 back-and-forth messages
- 3 failed API calls requiring retry
- 12,000 tokens consumed

With skill:
- Automatic workflow execution
- 2 clarifying questions only
- 0 failed API calls
- 6,000 tokens consumed

ℹ️Info

Performance tests are also the proof you show to stakeholders. A screenshot of the before/after comparison is worth more than any written justification for adopting the skill.

← PreviousImplementation Patterns Next →Iteration & skill-creator