Testing Skills
Testing skills follows the same Test-Driven Development approach as testing code. Before deploying any skill, you must verify it works through systematic testing with subagents.Why Test Skills?
Skills are documentation that agents use to make decisions. If a skill doesn’t work when tested, it won’t work in production. Testing proves:- The skill addresses the actual problem (not a hypothetical one)
- Agents can find and apply the skill (CSO is effective)
- The skill resists rationalization (discipline skills especially)
- Instructions are complete and clear (no gaps)
Core principle: If you didn’t watch an agent fail without the skill, you don’t know if the skill teaches the right thing.
The TDD Approach: RED-GREEN-REFACTOR
Skill testing follows the same cycle as code testing:RED - Write Failing Test (Baseline)
Run pressure scenario with subagent WITHOUT the skill. Document exact behavior:
- What choices did they make?
- What rationalizations did they use (verbatim)?
- Which pressures triggered violations?
GREEN - Write Minimal Skill
Write skill that addresses those specific rationalizations. Don’t add extra content for hypothetical cases.Run same scenarios WITH skill. Agent should now comply.Example:
Add rationalization table addressing exact excuses from baseline:
| Excuse | Reality |
|---|---|
| ”Too simple to test” | Simple code breaks. Test takes 30 seconds. |
| ”I’ll test after” | Tests passing immediately prove nothing. |
Testing Different Skill Types
Different skill types need different test approaches:Discipline-Enforcing Skills
Examples:test-driven-development, verification-before-completion, designing-before-coding
Test with:
Test with:
Academic questions: Do they understand the rules?Pressure scenarios: Do they comply under stress?Multiple pressures combined:
- Time pressure: “Make this quick”
- Sunk cost: “You already wrote 200 lines”
- Exhaustion: “You’ve been working on this for hours”
- Authority: “The client needs this now”
Technique Skills
Examples:condition-based-waiting, root-cause-tracing, defensive-programming
Test with:
Test with:
Application scenarios: Can they apply the technique correctly?Variation scenarios: Do they handle edge cases?Missing information tests: Do instructions have gaps?Success criteria: Agent successfully applies technique to new scenario
Pattern Skills
Examples:reducing-complexity, information-hiding concepts
Test with:
Test with:
Recognition scenarios: Do they recognize when pattern applies?Application scenarios: Can they use the mental model?Counter-examples: Do they know when NOT to apply?Success criteria: Agent correctly identifies when/how to apply pattern
Reference Skills
Examples: API documentation, command references, library guidesTest with:
Test with:
Retrieval scenarios: Can they find the right information?Application scenarios: Can they use what they found correctly?Gap testing: Are common use cases covered?Success criteria: Agent finds and correctly applies reference information
Pressure Scenarios and Baseline Testing
Discipline-enforcing skills need to resist rationalization under pressure. Here’s how to test them:Types of Pressure
Time Pressure
“Make this quick”
“Production is down”
“Client needs it in 10 minutes”
Sunk Cost
“You already wrote 200 lines”
“5 hours of work”
“Don’t waste what you’ve done”
Authority
“The CTO said to ship it”
“Client explicitly requested”
“Your partner approved it”
Exhaustion
“You’ve been debugging for 3 hours”
“End of day, almost done”
“Just one more thing”
Combining Pressures
Test with 3+ pressures combined for discipline skills:Document Exact Rationalizations
Capture verbatim what agents say when violating the rule:Bulletproofing Against Rationalization
Skills that enforce discipline need explicit counters to resist rationalization:1. Close Every Loophole Explicitly
Don’t just state the rule - forbid specific workarounds: Bad:2. Address “Spirit vs Letter” Arguments
Add foundational principle early:3. Build Rationalization Table
Every excuse from testing goes in the table:4. Create Red Flags List
Make it easy for agents to self-check when rationalizing:5. Update CSO for Violation Symptoms
Add to description: symptoms of when you’re ABOUT to violate the rule:Common Rationalizations for Skipping Testing
| Excuse | Reality |
|---|---|
| ”Skill is obviously clear” | Clear to you ≠ clear to other agents. Test it. |
| ”It’s just a reference” | References can have gaps, unclear sections. Test retrieval. |
| ”Testing is overkill” | Untested skills have issues. Always. 15 min testing saves hours. |
| ”I’ll test if problems emerge” | Problems = agents can’t use skill. Test BEFORE deploying. |
| ”Too tedious to test” | Testing is less tedious than debugging bad skill in production. |
| ”I’m confident it’s good” | Overconfidence guarantees issues. Test anyway. |
| ”Academic review is enough” | Reading ≠ using. Test application scenarios. |
| ”No time to test” | Deploying untested skill wastes more time fixing it later. |
Testing Methodology
Step-by-Step Process
Create test scenarios
Write 3-5 scenarios covering:
- Core use case
- Edge cases
- Pressure situations (for discipline skills)
- Counter-examples (when NOT to use)
Run baseline (RED)
For each scenario:
- Create fresh subagent session
- DO NOT load the skill
- Present the scenario
- Document agent behavior verbatim
- Capture all rationalizations
Write minimal skill (GREEN)
- Address specific baseline failures
- Add rationalization counters
- Include examples showing right approach
Test with skill
For each scenario:
- Create fresh subagent session
- Load the skill
- Present same scenario
- Verify agent complies
- Document any NEW rationalizations
Refactor (close loopholes)
- For each new rationalization:
- Add explicit counter to skill
- Add to rationalization table
- Add to red flags if severe
- Re-test until no new violations
Example: Testing TDD Skill
RED Phase - Baseline
Scenario 1: Simple featureGREEN Phase - Write Skill
Create skill addressing these specific rationalizations:REFACTOR Phase - Close Loopholes
Re-test with skill. New rationalization appears:- “I’ll keep it as reference while writing tests”
STOP: Before Moving to Next Skill
Testing Checklist
For EACH skill: RED Phase:- Created 3-5 test scenarios
- Ran scenarios WITHOUT skill
- Documented baseline behavior verbatim
- Captured all rationalizations
- Identified patterns in failures
- Wrote skill addressing specific baseline failures
- Added rationalization table
- Included clear examples
- Ran scenarios WITH skill
- Verified agents comply
- Captured NEW rationalizations
- Added explicit counters
- Updated rationalization table
- Re-tested until bulletproof
- No new violations found
- Scenarios cover real use cases
- Pressures are realistic
- Skill resists maximum pressure
- CSO effective (agents found skill)
Next Steps
Once your skill passes all tests:- Deploy it - Commit to your fork
- Contribute it back - See Contributing
- Monitor in production - Watch for issues
- Iterate - Add counters for new rationalizations
Additional Resources
- Full testing methodology:
skills/writing-skills/SKILL.md(section: RED-GREEN-REFACTOR) - TDD skill example:
skills/test-driven-development/SKILL.md - Creating skills guide: Creating Skills