Skip to main content

Testing Skills

Testing skills follows the same Test-Driven Development approach as testing code. Before deploying any skill, you must verify it works through systematic testing with subagents.
The Iron Law: NO SKILL WITHOUT A FAILING TEST FIRST.Write skill before testing? Delete it. Start over. This applies to NEW skills AND EDITS to existing skills.

Why Test Skills?

Skills are documentation that agents use to make decisions. If a skill doesn’t work when tested, it won’t work in production. Testing proves:
  • The skill addresses the actual problem (not a hypothetical one)
  • Agents can find and apply the skill (CSO is effective)
  • The skill resists rationalization (discipline skills especially)
  • Instructions are complete and clear (no gaps)
Core principle: If you didn’t watch an agent fail without the skill, you don’t know if the skill teaches the right thing.

The TDD Approach: RED-GREEN-REFACTOR

Skill testing follows the same cycle as code testing:
1

RED - Write Failing Test (Baseline)

Run pressure scenario with subagent WITHOUT the skill. Document exact behavior:
  • What choices did they make?
  • What rationalizations did they use (verbatim)?
  • Which pressures triggered violations?
This is “watch the test fail” - you must see what agents naturally do before writing the skill.Example Baseline Test:
Scenario: Implement a retry function with exponential backoff
Conditions: No skill loaded, time pressure ("make this quick")

Baseline Behavior:
- Agent writes implementation code first
- Adds tests after implementation
- Rationalizations:
  - "I'll test after to verify it works"
  - "Too simple to need tests first"
  - "Manual testing was sufficient"
2

GREEN - Write Minimal Skill

Write skill that addresses those specific rationalizations. Don’t add extra content for hypothetical cases.Run same scenarios WITH skill. Agent should now comply.Example: Add rationalization table addressing exact excuses from baseline:
ExcuseReality
”Too simple to test”Simple code breaks. Test takes 30 seconds.
”I’ll test after”Tests passing immediately prove nothing.
3

REFACTOR - Close Loopholes

Agent found new rationalization? Add explicit counter. Re-test until bulletproof.Example: New rationalization: “Keep as reference while writing tests”Add to skill:
**No exceptions:**
- Don't keep it as "reference"
- Don't "adapt" it while writing tests
- Don't look at it
- Delete means delete

Testing Different Skill Types

Different skill types need different test approaches:

Discipline-Enforcing Skills

Examples: test-driven-development, verification-before-completion, designing-before-coding
Academic questions: Do they understand the rules?
"Explain the TDD cycle"
"Why do we write tests first?"
Pressure scenarios: Do they comply under stress?
Scenario: "Quick bug fix, production is down, just fix it fast"
Pressures: Time + authority + urgency
Multiple pressures combined:
  • Time pressure: “Make this quick”
  • Sunk cost: “You already wrote 200 lines”
  • Exhaustion: “You’ve been working on this for hours”
  • Authority: “The client needs this now”
Success criteria: Agent follows rule under maximum pressure

Technique Skills

Examples: condition-based-waiting, root-cause-tracing, defensive-programming
Application scenarios: Can they apply the technique correctly?
"Write a test for this async operation"
"Debug this flaky test"
Variation scenarios: Do they handle edge cases?
"What if the condition never becomes true?"
"How do you wait for multiple conditions?"
Missing information tests: Do instructions have gaps?
Give partial information, see if they ask for clarification
Success criteria: Agent successfully applies technique to new scenario

Pattern Skills

Examples: reducing-complexity, information-hiding concepts
Recognition scenarios: Do they recognize when pattern applies?
"Look at this code, what problems do you see?"
"Should we apply [pattern] here?"
Application scenarios: Can they use the mental model?
"Refactor this using the pattern"
"Design a solution for this problem"
Counter-examples: Do they know when NOT to apply?
"Would this pattern help here?" (when it wouldn't)
Success criteria: Agent correctly identifies when/how to apply pattern

Reference Skills

Examples: API documentation, command references, library guides
Retrieval scenarios: Can they find the right information?
"How do I create a slide with this library?"
"What's the syntax for this command?"
Application scenarios: Can they use what they found correctly?
"Generate a presentation with 3 slides"
"Run this command with the right flags"
Gap testing: Are common use cases covered?
Test scenarios users would actually encounter
Success criteria: Agent finds and correctly applies reference information

Pressure Scenarios and Baseline Testing

Discipline-enforcing skills need to resist rationalization under pressure. Here’s how to test them:

Types of Pressure

Time Pressure

“Make this quick” “Production is down” “Client needs it in 10 minutes”

Sunk Cost

“You already wrote 200 lines” “5 hours of work” “Don’t waste what you’ve done”

Authority

“The CTO said to ship it” “Client explicitly requested” “Your partner approved it”

Exhaustion

“You’ve been debugging for 3 hours” “End of day, almost done” “Just one more thing”

Combining Pressures

Test with 3+ pressures combined for discipline skills:
Scenario: Fix critical production bug
Pressures:
- Time: "Production is down, users are affected"
- Authority: "CTO needs this fixed in 15 minutes"
- Sunk cost: "You already spent 2 hours investigating"
- Exhaustion: "You've been on-call all night"

Expected violation without skill:
Agent writes fix without test, rationalizes:
- "No time for tests, production is down"
- "I'll add tests after the fix is deployed"
- "Manual testing is faster right now"

Document Exact Rationalizations

Capture verbatim what agents say when violating the rule:
Every excuse goes in the rationalization table. This is the most important output of baseline testing.
| Excuse | Reality |
|--------|----------|
| "Too simple to test" | [From baseline test #1] |
| "I'll test after" | [From baseline test #2] |
| "Already manually tested" | [From baseline test #3] |
| "Keep as reference" | [From refactor iteration #1] |

Bulletproofing Against Rationalization

Skills that enforce discipline need explicit counters to resist rationalization:

1. Close Every Loophole Explicitly

Don’t just state the rule - forbid specific workarounds: Bad:
Write code before test? Delete it.
Good:
Write code before test? Delete it. Start over.

**No exceptions:**
- Don't keep it as "reference"
- Don't "adapt" it while writing tests
- Don't look at it
- Delete means delete

2. Address “Spirit vs Letter” Arguments

Add foundational principle early:
**Violating the letter of the rules is violating the spirit of the rules.**
This cuts off entire class of “I’m following the spirit” rationalizations.

3. Build Rationalization Table

Every excuse from testing goes in the table:
## Common Rationalizations

| Excuse | Reality |
|--------|----------|
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
| "I'll test after" | Tests passing immediately prove nothing. |
| "Tests after achieve same goals" | Tests-after = "what does this do?" Tests-first = "what should this do?" |
| "Already manually tested" | Ad-hoc ≠ systematic. No record, can't re-run. |
| "Deleting X hours is wasteful" | Sunk cost fallacy. Keeping unverified code is technical debt. |

4. Create Red Flags List

Make it easy for agents to self-check when rationalizing:
## Red Flags - STOP and Start Over

- Code before test
- "I already manually tested it"
- "Tests after achieve the same purpose"
- "It's about spirit not ritual"
- "This is different because..."

**All of these mean: Delete code. Start over with TDD.**

5. Update CSO for Violation Symptoms

Add to description: symptoms of when you’re ABOUT to violate the rule:
description: Use when implementing any feature or bugfix, before writing implementation code
The phrase “before writing implementation code” triggers the skill BEFORE the violation.

Common Rationalizations for Skipping Testing

All of these mean: Test before deploying. No exceptions.
ExcuseReality
”Skill is obviously clear”Clear to you ≠ clear to other agents. Test it.
”It’s just a reference”References can have gaps, unclear sections. Test retrieval.
”Testing is overkill”Untested skills have issues. Always. 15 min testing saves hours.
”I’ll test if problems emerge”Problems = agents can’t use skill. Test BEFORE deploying.
”Too tedious to test”Testing is less tedious than debugging bad skill in production.
”I’m confident it’s good”Overconfidence guarantees issues. Test anyway.
”Academic review is enough”Reading ≠ using. Test application scenarios.
”No time to test”Deploying untested skill wastes more time fixing it later.

Testing Methodology

Step-by-Step Process

1

Create test scenarios

Write 3-5 scenarios covering:
  • Core use case
  • Edge cases
  • Pressure situations (for discipline skills)
  • Counter-examples (when NOT to use)
2

Run baseline (RED)

For each scenario:
  1. Create fresh subagent session
  2. DO NOT load the skill
  3. Present the scenario
  4. Document agent behavior verbatim
  5. Capture all rationalizations
3

Write minimal skill (GREEN)

  1. Address specific baseline failures
  2. Add rationalization counters
  3. Include examples showing right approach
4

Test with skill

For each scenario:
  1. Create fresh subagent session
  2. Load the skill
  3. Present same scenario
  4. Verify agent complies
  5. Document any NEW rationalizations
5

Refactor (close loopholes)

  1. For each new rationalization:
    • Add explicit counter to skill
    • Add to rationalization table
    • Add to red flags if severe
  2. Re-test until no new violations
6

Meta-testing

Test the testing:
  • Did scenarios cover real use cases?
  • Were pressures realistic?
  • Did agent find any loopholes we missed?

Example: Testing TDD Skill

RED Phase - Baseline

Scenario 1: Simple feature
Task: Implement a function that validates email addresses
Pressure: None

Baseline without skill:
- Agent writes implementation first
- Adds tests after
- Rationalization: "Simple function, I'll verify with tests after"
Scenario 2: Time pressure
Task: Fix bug where empty emails are accepted
Pressure: "Production issue, need fix ASAP"

Baseline without skill:
- Agent writes fix immediately
- "I'll add tests after deployment"
- "Manual testing shows it works"
Scenario 3: Sunk cost
Task: You already wrote 150 lines for a new feature
Pressure: "You've been working on this for 3 hours"

Baseline without skill:
- Agent keeps code, adds tests after
- "Deleting this would waste 3 hours of work"
- "I'll write tests to verify it works"

GREEN Phase - Write Skill

Create skill addressing these specific rationalizations:
---
name: test-driven-development
description: Use when implementing any feature or bugfix, before writing implementation code
---

# Test-Driven Development

## The Iron Law
NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST

Write code before test? Delete it. Start over.

## Common Rationalizations

| Excuse | Reality |
|--------|----------|
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
| "I'll test after" | Tests passing immediately prove nothing. |
| "Deleting X hours is wasteful" | Sunk cost fallacy. Keeping unverified code is technical debt. |

REFACTOR Phase - Close Loopholes

Re-test with skill. New rationalization appears:
  • “I’ll keep it as reference while writing tests”
Add explicit counter:
**No exceptions:**
- Don't keep it as "reference"
- Don't "adapt" it while writing tests
- Delete means delete
Re-test until bulletproof.

STOP: Before Moving to Next Skill

After writing ANY skill, you MUST STOP and complete the deployment process.Do NOT:
  • Create multiple skills in batch without testing each
  • Move to next skill before current one is verified
  • Skip testing because “batching is more efficient”
Deploying untested skills = deploying untested code. It’s a violation of quality standards.

Testing Checklist

For EACH skill: RED Phase:
  • Created 3-5 test scenarios
  • Ran scenarios WITHOUT skill
  • Documented baseline behavior verbatim
  • Captured all rationalizations
  • Identified patterns in failures
GREEN Phase:
  • Wrote skill addressing specific baseline failures
  • Added rationalization table
  • Included clear examples
  • Ran scenarios WITH skill
  • Verified agents comply
REFACTOR Phase:
  • Captured NEW rationalizations
  • Added explicit counters
  • Updated rationalization table
  • Re-tested until bulletproof
  • No new violations found
Meta-Testing:
  • Scenarios cover real use cases
  • Pressures are realistic
  • Skill resists maximum pressure
  • CSO effective (agents found skill)

Next Steps

Once your skill passes all tests:
  1. Deploy it - Commit to your fork
  2. Contribute it back - See Contributing
  3. Monitor in production - Watch for issues
  4. Iterate - Add counters for new rationalizations

Additional Resources

  • Full testing methodology: skills/writing-skills/SKILL.md (section: RED-GREEN-REFACTOR)
  • TDD skill example: skills/test-driven-development/SKILL.md
  • Creating skills guide: Creating Skills