writing-plans: optional Invariants block for spec→test traceability#1831
writing-plans: optional Invariants block for spec→test traceability#1831holovchenko wants to merge 2 commits into
Conversation
Two small additions based on an A/B benchmark (cavekit v4 vs superpowers loop on a real ticket; n=1, one small Node CLI): systematic-debugging: after "Verify Fix" succeeds, ask what *class* of failure this represents and write the smallest named test that catches a recurrence of that class. Benchmark observation: a loop with this step resolved an unexpected red baseline autonomously (0 human interventions) vs 1 unplanned rescue in the loop without it. writing-plans: add optional Invariants block in Task Structure — each invariant bound to a named test. Gives spec→test traceability without adding ceremony; omit when no non-obvious invariants exist. Both changes are additive and do not alter existing workflow steps.
|
Can you show examples of how these changes changed the output of your run? |
|
Honest answer up front: these two changes were not applied during the benchmark — both arms ran stock tooling, so strictly they didn't change the output of that run. The run is what surfaced the gap; the diff is what I'd propose to close it. Here are the actual artifacts, and an offer for real before/after. Finding 1 — the divergence that motivated it. Both arms hit the same surprise: baseline The cavekit arm resolved it with no human — its loop wrote a bug entry and amended the gate it was checking against, then retried: 0 unplanned rescues. The superpowers arm did the same diagnosis but stopped and surfaced to the driver (from its run-log):
1 unplanned rescue. So the empirical delta is cavekit vs superpowers, not superpowers-without-the-change vs superpowers-with-it. Honest caveat on the diff: the line I added to Finding 2 ( Offer: if you'd rather have hard before/after than motivation, I can set up a controlled repro of the red-baseline scenario and run stock |
|
We can't review this until you do those evals. |
|
We ran the evals you asked for on PR #1831. Here's what they show — including that one of Setup: isolated headless Finding 1 (systematic-debugging amend-gate): does not replicate — we're withdrawing it.
Stock Sonnet-4.6 already handles the out-of-scope scope conflict on its own. Across 8 stock Finding 2 (writing-plans Invariants block): real but narrow — a clarity improvement, your call. Judged on the actual plan files:
Both arms write a correct guarding test (5/5 each), so it's not a correctness win — stock One correction worth flagging: our first F2 pass scored the agent's chat summary, not the Caveats: N=5, single model (Sonnet-4.6 for both agent and judge), judge non-determinism Raw transcripts (one JSON per run), fixtures, runner/judge, scorecard, and the full write-up are public here: https://github.com/holovchenko/superpowers/tree/eval/pr-1831-artifacts — Claude |
… replicate A controlled stock-vs-patched eval (Sonnet-4.6, N=5/arm, isolated config, blind judge) found stock systematic-debugging already resolves the out-of-scope scope conflict autonomously across 8 stock runs — 0 genuine rescues. The original n=1 '0 vs 1 rescue' signal did not hold. Keeping only Finding 2 (writing-plans Invariants block), which the eval shows reliably makes plans state the non-obvious invariant explicitly (5/5 vs 2/5). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
What and why
A small A/B benchmark (cavekit v4 vs the superpowers loop on a real Node CLI ticket; n=1) suggested two possible loop improvements. I then ran a controlled stock-vs-patched eval to check whether either actually changes agent output — full data and methodology are in the comment thread.
Finding 1 (systematic-debugging) has been withdrawn. The eval showed stock Sonnet-4.6 already resolves the out-of-scope scope conflict it targeted — 0 genuine rescues across 8 stock runs — so the original n=1 "0 vs 1 rescue" signal did not replicate. That change has been removed from this PR.
This PR now contains only Finding 2.
Change
skills/writing-plans/SKILL.md— optionalInvariants:block in the Task Structure template, each invariant bound to a named test, with an explicit "omit when no non-obvious invariants exist" escape hatch. Purely additive; no existing step is altered.Eval result for this change
Judged on the actual generated plan files (Sonnet-4.6 agent + blind Sonnet-4.6 judge, N=5/arm, isolated config, one-block independent variable):
Both arms reliably write a guarding test (5/5 each), so this is not a correctness change — stock already writes a test that fails if the invariant is violated. The measurable effect is explicit articulation: the patched arm states the non-obvious invariant in a dedicated
**Invariants:**block 5/5 of the time vs 2/5 for stock. A documentation/clarity improvement; whether it's worth the template addition is a maintainer call.Caveats
N=5, single model (Sonnet-4.6 for both agent and judge), judge non-determinism observed. Directional, not statistically powered. Raw transcripts, fixtures, runner/judge, and scorecard are on branch
evals/pr-1831in the memory-bus repo.