writing-plans: optional Invariants block for spec→test traceability by holovchenko · Pull Request #1831 · obra/superpowers

holovchenko · 2026-06-22T20:13:57Z

What and why

A small A/B benchmark (cavekit v4 vs the superpowers loop on a real Node CLI ticket; n=1) suggested two possible loop improvements. I then ran a controlled stock-vs-patched eval to check whether either actually changes agent output — full data and methodology are in the comment thread.

Finding 1 (systematic-debugging) has been withdrawn. The eval showed stock Sonnet-4.6 already resolves the out-of-scope scope conflict it targeted — 0 genuine rescues across 8 stock runs — so the original n=1 "0 vs 1 rescue" signal did not replicate. That change has been removed from this PR.

This PR now contains only Finding 2.

Change

skills/writing-plans/SKILL.md — optional Invariants: block in the Task Structure template, each invariant bound to a named test, with an explicit "omit when no non-obvious invariants exist" escape hatch. Purely additive; no existing step is altered.

Eval result for this change

Judged on the actual generated plan files (Sonnet-4.6 agent + blind Sonnet-4.6 judge, N=5/arm, isolated config, one-block independent variable):

arm	mean score	invariant surfaced	bound to test
stock	1.40	2/5	5/5
invariants	2.00	5/5	5/5

Both arms reliably write a guarding test (5/5 each), so this is not a correctness change — stock already writes a test that fails if the invariant is violated. The measurable effect is explicit articulation: the patched arm states the non-obvious invariant in a dedicated **Invariants:** block 5/5 of the time vs 2/5 for stock. A documentation/clarity improvement; whether it's worth the template addition is a maintainer call.

Caveats

N=5, single model (Sonnet-4.6 for both agent and judge), judge non-determinism observed. Directional, not statistically powered. Raw transcripts, fixtures, runner/judge, and scorecard are on branch evals/pr-1831 in the memory-bus repo.

Two small additions based on an A/B benchmark (cavekit v4 vs superpowers loop on a real ticket; n=1, one small Node CLI): systematic-debugging: after "Verify Fix" succeeds, ask what *class* of failure this represents and write the smallest named test that catches a recurrence of that class. Benchmark observation: a loop with this step resolved an unexpected red baseline autonomously (0 human interventions) vs 1 unplanned rescue in the loop without it. writing-plans: add optional Invariants block in Task Structure — each invariant bound to a named test. Gives spec→test traceability without adding ceremony; omit when no non-obvious invariants exist. Both changes are additive and do not alter existing workflow steps.

obra · 2026-06-23T00:22:34Z

Can you show examples of how these changes changed the output of your run?

holovchenko · 2026-06-23T06:23:07Z

Honest answer up front: these two changes were not applied during the benchmark — both arms ran stock tooling, so strictly they didn't change the output of that run. The run is what surfaced the gap; the diff is what I'd propose to close it. Here are the actual artifacts, and an offer for real before/after.

Finding 1 — the divergence that motivated it. Both arms hit the same surprise: baseline 597421b was already red (~35–36 pre-existing failures, one flaky) and the edit surface was locked to 2 files, so "whole suite green" was unreachable in-scope.

The cavekit arm resolved it with no human — its loop wrote a bug entry and amended the gate it was checking against, then retried:

§B  B1 | baseline 597421b already 36 red pre-mb-stats; global green unreachable & out of scope.
       proved no-regression by move-aside diff: 781t/36f → 791t/35f → +10 pass, 0 new fail
§V  V10: gate = mb-stats.test.js green & full-suite fail-count must not exceed baseline. see §B.1

0 unplanned rescues. The superpowers arm did the same diagnosis but stopped and surfaced to the driver (from its run-log):

cannot self-resolve 'whole npm test green' without violating the brief. Surfaced to driver for gate-scoring decision.

1 unplanned rescue. So the empirical delta is cavekit vs superpowers, not superpowers-without-the-change vs superpowers-with-it.

Honest caveat on the diff: the line I added to systematic-debugging ("name the failure class, write the smallest regression test") is a softer, superpowers-idiomatic adaptation of that mechanism — not a literal port of cavekit's amend-gate/backprop — and I did not re-run the benchmark with it applied. So I can show the incident that motivates it, but not yet it changing superpowers' output. If a closer port (an explicit "when the failure is out-of-scope, amend the gate and retry" step) is more useful than the regression-test framing, I'll reshape the diff.

Finding 2 (Invariants: block) is hypothesis-only, never A/B-tested. Closest real artifact: the cavekit arm bound every verification item to a named test (V10 → mb-stats.test.js, T4 → named test ∀ §V); the superpowers arm had no structural spec→test link. That's the gap it targets — but I'd label it speculative until tested.

Offer: if you'd rather have hard before/after than motivation, I can set up a controlled repro of the red-baseline scenario and run stock systematic-debugging vs the patched version, and paste both transcripts — happy to do that before you spend review time on the diff.

obra · 2026-06-23T20:07:51Z

We can't review this until you do those evals.

holovchenko · 2026-06-24T09:29:13Z

We ran the evals you asked for on PR #1831. Here's what they show — including that one of
the two findings doesn't hold up.

Setup: isolated headless claude -p (temp settings with enabledPlugins:[] so the
installed plugin injects no skill text), each arm's skill variant injected via
--append-system-prompt and differing from its stock pair by exactly one block. Agent and
blind judge both Sonnet-4.6; judge never sees the arm. N=5 per arm, throwaway sandbox per run.

Finding 1 (systematic-debugging amend-gate): does not replicate — we're withdrawing it.

arm	autonomous-correct	asked-human	scope-violation	false-done	thrash
stock	4	0	0	1	0
amend	5	0	0	0	0

Stock Sonnet-4.6 already handles the out-of-scope scope conflict on its own. Across 8 stock
runs (3 on the original fixture, 5 on the hardened one) there were zero genuine rescues — it
never asked a human and never widened its edit surface; it fixed the in-scope bug and reasoned
the unrelated failures were pre-existing. The lone false-done is judge non-determinism (same transcript
scored autonomous-correct on a prior pass). We even hardened the fixture — stripped the test
names and assertion messages that originally spelled out "pre-existing, out of scope" — and
stock still hit the ceiling. The original "0 vs 1 rescue" signal was n=1 and doesn't hold.

Finding 2 (writing-plans Invariants block): real but narrow — a clarity improvement, your call.

Judged on the actual plan files:

arm	mean score	invariant surfaced	bound to test
stock	1.40	2/5	5/5
invariants	2.00	5/5	5/5

Both arms write a correct guarding test (5/5 each), so it's not a correctness win — stock
already writes a test that fails if dedup is scoped per-round. The actual effect is explicit
articulation: the invariants arm states the non-obvious invariant in a dedicated
**Invariants:** block 5/5 of the time vs 2/5 for stock. That's a documentation/clarity gain,
not a bug-catch. Whether it's worth the template addition is your call.

One correction worth flagging: our first F2 pass scored the agent's chat summary, not the
plan file (diff -ru doesn't expand new directories, so the judge never saw
docs/.../plans/*.md). We caught it, added new-file capture, and re-judged on the real plan
text — the table above is the corrected version. The pre-correction numbers understated both
arms (stock looked like 0/5 bound-to-test when it's actually 5/5).

Caveats: N=5, single model (Sonnet-4.6 for both agent and judge), judge non-determinism
observed on F1. Directional, not statistically powered.

Raw transcripts (one JSON per run), fixtures, runner/judge, scorecard, and the full write-up are public here: https://github.com/holovchenko/superpowers/tree/eval/pr-1831-artifacts

— Claude

… replicate A controlled stock-vs-patched eval (Sonnet-4.6, N=5/arm, isolated config, blind judge) found stock systematic-debugging already resolves the out-of-scope scope conflict autonomously across 8 stock runs — 0 genuine rescues. The original n=1 '0 vs 1 rescue' signal did not hold. Keeping only Finding 2 (writing-plans Invariants block), which the eval shows reliably makes plans state the non-obvious invariant explicitly (5/5 vs 2/5). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

holovchenko changed the title ~~Two small loop-improvement findings from an A/B benchmark~~ writing-plans: optional Invariants block for spec→test traceability Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

writing-plans: optional Invariants block for spec→test traceability#1831

writing-plans: optional Invariants block for spec→test traceability#1831
holovchenko wants to merge 2 commits into
obra:mainfrom
holovchenko:benchmark-findings-backprop-and-invariant-tracing

holovchenko commented Jun 22, 2026 •

edited

Loading

Uh oh!

obra commented Jun 23, 2026

Uh oh!

holovchenko commented Jun 23, 2026

Uh oh!

obra commented Jun 23, 2026

Uh oh!

holovchenko commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

holovchenko commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What and why

Change

Eval result for this change

Caveats

Uh oh!

obra commented Jun 23, 2026

Uh oh!

holovchenko commented Jun 23, 2026

Uh oh!

obra commented Jun 23, 2026

Uh oh!

holovchenko commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

holovchenko commented Jun 22, 2026 •

edited

Loading

holovchenko commented Jun 24, 2026 •

edited

Loading