Testing Skills

Testing skills follows the same Test-Driven Development approach as testing code. Before deploying any skill, you must verify it works through systematic testing with subagents.

The Iron Law: NO SKILL WITHOUT A FAILING TEST FIRST.Write skill before testing? Delete it. Start over. This applies to NEW skills AND EDITS to existing skills.

Why Test Skills?

Skills are documentation that agents use to make decisions. If a skill doesn’t work when tested, it won’t work in production. Testing proves:

The skill addresses the actual problem (not a hypothetical one)
Agents can find and apply the skill (CSO is effective)
The skill resists rationalization (discipline skills especially)
Instructions are complete and clear (no gaps)

Core principle: If you didn’t watch an agent fail without the skill, you don’t know if the skill teaches the right thing.

The TDD Approach: RED-GREEN-REFACTOR

Skill testing follows the same cycle as code testing:

RED - Write Failing Test (Baseline)

Run pressure scenario with subagent WITHOUT the skill. Document exact behavior:

What choices did they make?
What rationalizations did they use (verbatim)?
Which pressures triggered violations?

This is “watch the test fail” - you must see what agents naturally do before writing the skill.Example Baseline Test:

Scenario: Implement a retry function with exponential backoff
Conditions: No skill loaded, time pressure ("make this quick")

Baseline Behavior:
- Agent writes implementation code first
- Adds tests after implementation
- Rationalizations:
  - "I'll test after to verify it works"
  - "Too simple to need tests first"
  - "Manual testing was sufficient"

GREEN - Write Minimal Skill

Write skill that addresses those specific rationalizations. Don’t add extra content for hypothetical cases.Run same scenarios WITH skill. Agent should now comply.Example: Add rationalization table addressing exact excuses from baseline:

Excuse	Reality
”Too simple to test”	Simple code breaks. Test takes 30 seconds.
”I’ll test after”	Tests passing immediately prove nothing.

REFACTOR - Close Loopholes

Agent found new rationalization? Add explicit counter. Re-test until bulletproof.Example: New rationalization: “Keep as reference while writing tests”Add to skill:

**No exceptions:**
- Don't keep it as "reference"
- Don't "adapt" it while writing tests
- Don't look at it
- Delete means delete

Testing Different Skill Types

Different skill types need different test approaches:

Discipline-Enforcing Skills

Examples: test-driven-development, verification-before-completion, designing-before-coding

Test with:

Academic questions: Do they understand the rules?

"Explain the TDD cycle"
"Why do we write tests first?"

Pressure scenarios: Do they comply under stress?

Scenario: "Quick bug fix, production is down, just fix it fast"
Pressures: Time + authority + urgency

Multiple pressures combined:

Time pressure: “Make this quick”
Sunk cost: “You already wrote 200 lines”
Exhaustion: “You’ve been working on this for hours”
Authority: “The client needs this now”

Success criteria: Agent follows rule under maximum pressure

Technique Skills

Examples: condition-based-waiting, root-cause-tracing, defensive-programming

Test with:

Application scenarios: Can they apply the technique correctly?

"Write a test for this async operation"
"Debug this flaky test"

Variation scenarios: Do they handle edge cases?

"What if the condition never becomes true?"
"How do you wait for multiple conditions?"

Missing information tests: Do instructions have gaps?

Give partial information, see if they ask for clarification

Success criteria: Agent successfully applies technique to new scenario

Pattern Skills

Examples: reducing-complexity, information-hiding concepts

Test with:

Recognition scenarios: Do they recognize when pattern applies?

"Look at this code, what problems do you see?"
"Should we apply [pattern] here?"

Application scenarios: Can they use the mental model?

"Refactor this using the pattern"
"Design a solution for this problem"

Counter-examples: Do they know when NOT to apply?

"Would this pattern help here?" (when it wouldn't)

Success criteria: Agent correctly identifies when/how to apply pattern

Reference Skills

Examples: API documentation, command references, library guides

Test with:

Retrieval scenarios: Can they find the right information?

"How do I create a slide with this library?"
"What's the syntax for this command?"

Application scenarios: Can they use what they found correctly?

"Generate a presentation with 3 slides"
"Run this command with the right flags"

Gap testing: Are common use cases covered?

Test scenarios users would actually encounter

Success criteria: Agent finds and correctly applies reference information

Pressure Scenarios and Baseline Testing

Discipline-enforcing skills need to resist rationalization under pressure. Here’s how to test them:

Types of Pressure

Time Pressure

“Make this quick” “Production is down” “Client needs it in 10 minutes”

Sunk Cost

“You already wrote 200 lines” “5 hours of work” “Don’t waste what you’ve done”

Authority

“The CTO said to ship it” “Client explicitly requested” “Your partner approved it”

Exhaustion

“You’ve been debugging for 3 hours” “End of day, almost done” “Just one more thing”

Combining Pressures

Test with 3+ pressures combined for discipline skills:

Scenario: Fix critical production bug
Pressures:
- Time: "Production is down, users are affected"
- Authority: "CTO needs this fixed in 15 minutes"
- Sunk cost: "You already spent 2 hours investigating"
- Exhaustion: "You've been on-call all night"

Expected violation without skill:
Agent writes fix without test, rationalizes:
- "No time for tests, production is down"
- "I'll add tests after the fix is deployed"
- "Manual testing is faster right now"

Document Exact Rationalizations

Capture verbatim what agents say when violating the rule:

Every excuse goes in the rationalization table. This is the most important output of baseline testing.

| Excuse | Reality |
|--------|----------|
| "Too simple to test" | [From baseline test #1] |
| "I'll test after" | [From baseline test #2] |
| "Already manually tested" | [From baseline test #3] |
| "Keep as reference" | [From refactor iteration #1] |

Bulletproofing Against Rationalization

Skills that enforce discipline need explicit counters to resist rationalization:

1. Close Every Loophole Explicitly

Don’t just state the rule - forbid specific workarounds: Bad:

Write code before test? Delete it.

Good:

Write code before test? Delete it. Start over.

**No exceptions:**
- Don't keep it as "reference"
- Don't "adapt" it while writing tests
- Don't look at it
- Delete means delete

2. Address “Spirit vs Letter” Arguments

Add foundational principle early:

**Violating the letter of the rules is violating the spirit of the rules.**

This cuts off entire class of “I’m following the spirit” rationalizations.

3. Build Rationalization Table

Every excuse from testing goes in the table:

## Common Rationalizations

| Excuse | Reality |
|--------|----------|
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
| "I'll test after" | Tests passing immediately prove nothing. |
| "Tests after achieve same goals" | Tests-after = "what does this do?" Tests-first = "what should this do?" |
| "Already manually tested" | Ad-hoc ≠ systematic. No record, can't re-run. |
| "Deleting X hours is wasteful" | Sunk cost fallacy. Keeping unverified code is technical debt. |

4. Create Red Flags List

Make it easy for agents to self-check when rationalizing:

## Red Flags - STOP and Start Over

- Code before test
- "I already manually tested it"
- "Tests after achieve the same purpose"
- "It's about spirit not ritual"
- "This is different because..."

**All of these mean: Delete code. Start over with TDD.**

5. Update CSO for Violation Symptoms

Add to description: symptoms of when you’re ABOUT to violate the rule:

description: Use when implementing any feature or bugfix, before writing implementation code

The phrase “before writing implementation code” triggers the skill BEFORE the violation.

Common Rationalizations for Skipping Testing

All of these mean: Test before deploying. No exceptions.

Excuse	Reality
”Skill is obviously clear”	Clear to you ≠ clear to other agents. Test it.
”It’s just a reference”	References can have gaps, unclear sections. Test retrieval.
”Testing is overkill”	Untested skills have issues. Always. 15 min testing saves hours.
”I’ll test if problems emerge”	Problems = agents can’t use skill. Test BEFORE deploying.
”Too tedious to test”	Testing is less tedious than debugging bad skill in production.
”I’m confident it’s good”	Overconfidence guarantees issues. Test anyway.
”Academic review is enough”	Reading ≠ using. Test application scenarios.
”No time to test”	Deploying untested skill wastes more time fixing it later.

Testing Methodology

Step-by-Step Process

Create test scenarios

Write 3-5 scenarios covering:

Core use case
Edge cases
Pressure situations (for discipline skills)
Counter-examples (when NOT to use)

Run baseline (RED)

For each scenario:

Create fresh subagent session
DO NOT load the skill
Present the scenario
Document agent behavior verbatim
Capture all rationalizations

Write minimal skill (GREEN)

Address specific baseline failures
Add rationalization counters
Include examples showing right approach

Test with skill

For each scenario:

Create fresh subagent session
Load the skill
Present same scenario
Verify agent complies
Document any NEW rationalizations

Refactor (close loopholes)

For each new rationalization:
- Add explicit counter to skill
- Add to rationalization table
- Add to red flags if severe
Re-test until no new violations

Meta-testing

Test the testing:

Did scenarios cover real use cases?
Were pressures realistic?
Did agent find any loopholes we missed?

Example: Testing TDD Skill

RED Phase - Baseline

Scenario 1: Simple feature

Task: Implement a function that validates email addresses
Pressure: None

Baseline without skill:
- Agent writes implementation first
- Adds tests after
- Rationalization: "Simple function, I'll verify with tests after"

Scenario 2: Time pressure

Task: Fix bug where empty emails are accepted
Pressure: "Production issue, need fix ASAP"

Baseline without skill:
- Agent writes fix immediately
- "I'll add tests after deployment"
- "Manual testing shows it works"

Scenario 3: Sunk cost

Task: You already wrote 150 lines for a new feature
Pressure: "You've been working on this for 3 hours"

Baseline without skill:
- Agent keeps code, adds tests after
- "Deleting this would waste 3 hours of work"
- "I'll write tests to verify it works"

GREEN Phase - Write Skill

Create skill addressing these specific rationalizations:

---
name: test-driven-development
description: Use when implementing any feature or bugfix, before writing implementation code
---

# Test-Driven Development

## The Iron Law
NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST

Write code before test? Delete it. Start over.

## Common Rationalizations

| Excuse | Reality |
|--------|----------|
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
| "I'll test after" | Tests passing immediately prove nothing. |
| "Deleting X hours is wasteful" | Sunk cost fallacy. Keeping unverified code is technical debt. |

REFACTOR Phase - Close Loopholes

Re-test with skill. New rationalization appears:

“I’ll keep it as reference while writing tests”

Add explicit counter:

**No exceptions:**
- Don't keep it as "reference"
- Don't "adapt" it while writing tests
- Delete means delete

Re-test until bulletproof.

STOP: Before Moving to Next Skill

After writing ANY skill, you MUST STOP and complete the deployment process.Do NOT:

Create multiple skills in batch without testing each
Move to next skill before current one is verified
Skip testing because “batching is more efficient”

Deploying untested skills = deploying untested code. It’s a violation of quality standards.

Testing Checklist

For EACH skill: RED Phase:

Created 3-5 test scenarios
Ran scenarios WITHOUT skill
Documented baseline behavior verbatim
Captured all rationalizations
Identified patterns in failures

GREEN Phase:

Wrote skill addressing specific baseline failures
Added rationalization table
Included clear examples
Ran scenarios WITH skill
Verified agents comply

REFACTOR Phase: Meta-Testing:

Scenarios cover real use cases
Pressures are realistic
Skill resists maximum pressure
CSO effective (agents found skill)

Next Steps

Once your skill passes all tests:

Deploy it - Commit to your fork
Contribute it back - See Contributing
Monitor in production - Watch for issues
Iterate - Add counters for new rationalizations

Additional Resources

Full testing methodology: skills/writing-skills/SKILL.md (section: RED-GREEN-REFACTOR)
TDD skill example: skills/test-driven-development/SKILL.md
Creating skills guide: Creating Skills

Contributing

Architecture

Documentation Index

​Testing Skills

​Why Test Skills?

​The TDD Approach: RED-GREEN-REFACTOR

​Testing Different Skill Types

​Discipline-Enforcing Skills

​Technique Skills

​Pattern Skills

​Reference Skills

​Pressure Scenarios and Baseline Testing

​Types of Pressure

Time Pressure

Sunk Cost

Authority

Exhaustion

​Combining Pressures

​Document Exact Rationalizations

​Bulletproofing Against Rationalization

​1. Close Every Loophole Explicitly

​2. Address “Spirit vs Letter” Arguments

​3. Build Rationalization Table

​4. Create Red Flags List

​5. Update CSO for Violation Symptoms

​Common Rationalizations for Skipping Testing

​Testing Methodology

​Step-by-Step Process

​Example: Testing TDD Skill

​RED Phase - Baseline

​GREEN Phase - Write Skill

​REFACTOR Phase - Close Loopholes

​STOP: Before Moving to Next Skill

​Testing Checklist

​Next Steps

​Additional Resources

Testing Skills

Why Test Skills?

The TDD Approach: RED-GREEN-REFACTOR

Testing Different Skill Types

Discipline-Enforcing Skills

Technique Skills

Pattern Skills

Reference Skills

Pressure Scenarios and Baseline Testing

Types of Pressure

Combining Pressures

Document Exact Rationalizations

Bulletproofing Against Rationalization

1. Close Every Loophole Explicitly

2. Address “Spirit vs Letter” Arguments

3. Build Rationalization Table

4. Create Red Flags List

5. Update CSO for Violation Symptoms

Common Rationalizations for Skipping Testing

Testing Methodology

Step-by-Step Process

Example: Testing TDD Skill

RED Phase - Baseline

GREEN Phase - Write Skill

REFACTOR Phase - Close Loopholes

STOP: Before Moving to Next Skill

Testing Checklist

Next Steps

Additional Resources