Testing
Three rigor levels and three test areas — trigger correctness, functional correctness, and performance vs baseline.
Three Rigor Levels
Skills can be tested at varying levels of rigor depending on audience and blast radius. A skill used by a team of five does not need the same suite as one shipped to thousands of enterprise users. Pick the lightest approach that gives you the signal you need.
| Approach | Where | Best For |
|---|---|---|
| Manual testing | Claude.ai | Fast iteration, no setup required, observing real behavior |
| Scripted testing | Claude Code | Automated test cases for repeatable validation across changes |
| Programmatic testing | Skills API | Evaluation suites that run systematically against defined test sets |
Iterate on a single challenging task until Claude succeeds, then extract the winning approach into the skill. This leverages in-context learning and gives faster signal than broad testing. Expand to multiple test cases only after you have a working foundation.
Area 1 — Triggering Tests
Goal: ensure the skill loads at the right times. Build a suite of queries that should trigger it, queries that should NOT trigger it (especially adjacent topics), and paraphrased versions of both.
Should trigger:
- "Help me set up a new ProjectHub workspace"
- "I need to create a project in ProjectHub"
- "Initialize a ProjectHub project for Q4 planning"
Should NOT trigger:
- "What's the weather in San Francisco?"
- "Help me write Python code"
- "Create a spreadsheet" (unless ProjectHub skill handles sheets)Area 2 — Functional Tests
Goal: verify the skill produces correct outputs. Each test asserts on outputs, API calls, error handling, and edge cases. Use a Given/When/Then structure so failures are easy to triage.
Test: Create project with 5 tasks
Given: Project name "Q4 Planning", 5 task descriptions
When: Skill executes workflow
Then:
- Project created in ProjectHub
- 5 tasks created with correct properties
- All tasks linked to project
- No API errorsArea 3 — Performance Comparison
Goal: prove the skill improves results vs baseline. Run the same task without the skill, then with it, on the same inputs. Capture turns, tokens, API failures, and clarifying questions.
Without skill:
- User provides instructions each time
- 15 back-and-forth messages
- 3 failed API calls requiring retry
- 12,000 tokens consumed
With skill:
- Automatic workflow execution
- 2 clarifying questions only
- 0 failed API calls
- 6,000 tokens consumedPerformance tests are also the proof you show to stakeholders. A screenshot of the before/after comparison is worth more than any written justification for adopting the skill.