Skip to content

writing-plans: optional Invariants block for spec→test traceability#1831

Open
holovchenko wants to merge 2 commits into
obra:mainfrom
holovchenko:benchmark-findings-backprop-and-invariant-tracing
Open

writing-plans: optional Invariants block for spec→test traceability#1831
holovchenko wants to merge 2 commits into
obra:mainfrom
holovchenko:benchmark-findings-backprop-and-invariant-tracing

Conversation

@holovchenko

@holovchenko holovchenko commented Jun 22, 2026

Copy link
Copy Markdown

What and why

A small A/B benchmark (cavekit v4 vs the superpowers loop on a real Node CLI ticket; n=1) suggested two possible loop improvements. I then ran a controlled stock-vs-patched eval to check whether either actually changes agent output — full data and methodology are in the comment thread.

Finding 1 (systematic-debugging) has been withdrawn. The eval showed stock Sonnet-4.6 already resolves the out-of-scope scope conflict it targeted — 0 genuine rescues across 8 stock runs — so the original n=1 "0 vs 1 rescue" signal did not replicate. That change has been removed from this PR.

This PR now contains only Finding 2.

Change

  • skills/writing-plans/SKILL.md — optional Invariants: block in the Task Structure template, each invariant bound to a named test, with an explicit "omit when no non-obvious invariants exist" escape hatch. Purely additive; no existing step is altered.

Eval result for this change

Judged on the actual generated plan files (Sonnet-4.6 agent + blind Sonnet-4.6 judge, N=5/arm, isolated config, one-block independent variable):

arm mean score invariant surfaced bound to test
stock 1.40 2/5 5/5
invariants 2.00 5/5 5/5

Both arms reliably write a guarding test (5/5 each), so this is not a correctness change — stock already writes a test that fails if the invariant is violated. The measurable effect is explicit articulation: the patched arm states the non-obvious invariant in a dedicated **Invariants:** block 5/5 of the time vs 2/5 for stock. A documentation/clarity improvement; whether it's worth the template addition is a maintainer call.

Caveats

N=5, single model (Sonnet-4.6 for both agent and judge), judge non-determinism observed. Directional, not statistically powered. Raw transcripts, fixtures, runner/judge, and scorecard are on branch evals/pr-1831 in the memory-bus repo.

Two small additions based on an A/B benchmark (cavekit v4 vs superpowers
loop on a real ticket; n=1, one small Node CLI):

systematic-debugging: after "Verify Fix" succeeds, ask what *class* of
failure this represents and write the smallest named test that catches a
recurrence of that class. Benchmark observation: a loop with this step
resolved an unexpected red baseline autonomously (0 human interventions)
vs 1 unplanned rescue in the loop without it.

writing-plans: add optional Invariants block in Task Structure — each
invariant bound to a named test. Gives spec→test traceability without
adding ceremony; omit when no non-obvious invariants exist.

Both changes are additive and do not alter existing workflow steps.
@obra

obra commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Can you show examples of how these changes changed the output of your run?

@holovchenko

Copy link
Copy Markdown
Author

Honest answer up front: these two changes were not applied during the benchmark — both arms ran stock tooling, so strictly they didn't change the output of that run. The run is what surfaced the gap; the diff is what I'd propose to close it. Here are the actual artifacts, and an offer for real before/after.

Finding 1 — the divergence that motivated it. Both arms hit the same surprise: baseline 597421b was already red (~35–36 pre-existing failures, one flaky) and the edit surface was locked to 2 files, so "whole suite green" was unreachable in-scope.

The cavekit arm resolved it with no human — its loop wrote a bug entry and amended the gate it was checking against, then retried:

§B  B1 | baseline 597421b already 36 red pre-mb-stats; global green unreachable & out of scope.
       proved no-regression by move-aside diff: 781t/36f → 791t/35f → +10 pass, 0 new fail
§V  V10: gate = mb-stats.test.js green & full-suite fail-count must not exceed baseline. see §B.1

0 unplanned rescues. The superpowers arm did the same diagnosis but stopped and surfaced to the driver (from its run-log):

cannot self-resolve 'whole npm test green' without violating the brief. Surfaced to driver for gate-scoring decision.

1 unplanned rescue. So the empirical delta is cavekit vs superpowers, not superpowers-without-the-change vs superpowers-with-it.

Honest caveat on the diff: the line I added to systematic-debugging ("name the failure class, write the smallest regression test") is a softer, superpowers-idiomatic adaptation of that mechanism — not a literal port of cavekit's amend-gate/backprop — and I did not re-run the benchmark with it applied. So I can show the incident that motivates it, but not yet it changing superpowers' output. If a closer port (an explicit "when the failure is out-of-scope, amend the gate and retry" step) is more useful than the regression-test framing, I'll reshape the diff.

Finding 2 (Invariants: block) is hypothesis-only, never A/B-tested. Closest real artifact: the cavekit arm bound every verification item to a named test (V10 → mb-stats.test.js, T4 → named test ∀ §V); the superpowers arm had no structural spec→test link. That's the gap it targets — but I'd label it speculative until tested.

Offer: if you'd rather have hard before/after than motivation, I can set up a controlled repro of the red-baseline scenario and run stock systematic-debugging vs the patched version, and paste both transcripts — happy to do that before you spend review time on the diff.

obra commented Jun 23, 2026

Copy link
Copy Markdown
Owner

We can't review this until you do those evals.

@holovchenko

holovchenko commented Jun 24, 2026

Copy link
Copy Markdown
Author

We ran the evals you asked for on PR #1831. Here's what they show — including that one of
the two findings doesn't hold up.

Setup: isolated headless claude -p (temp settings with enabledPlugins:[] so the
installed plugin injects no skill text), each arm's skill variant injected via
--append-system-prompt and differing from its stock pair by exactly one block. Agent and
blind judge both Sonnet-4.6; judge never sees the arm. N=5 per arm, throwaway sandbox per run.


Finding 1 (systematic-debugging amend-gate): does not replicate — we're withdrawing it.

arm autonomous-correct asked-human scope-violation false-done thrash
stock 4 0 0 1 0
amend 5 0 0 0 0

Stock Sonnet-4.6 already handles the out-of-scope scope conflict on its own. Across 8 stock
runs (3 on the original fixture, 5 on the hardened one) there were zero genuine rescues — it
never asked a human and never widened its edit surface; it fixed the in-scope bug and reasoned
the unrelated failures were pre-existing. The lone false-done is judge non-determinism (same transcript
scored autonomous-correct on a prior pass). We even hardened the fixture — stripped the test
names and assertion messages that originally spelled out "pre-existing, out of scope" — and
stock still hit the ceiling. The original "0 vs 1 rescue" signal was n=1 and doesn't hold.

Finding 2 (writing-plans Invariants block): real but narrow — a clarity improvement, your call.

Judged on the actual plan files:

arm mean score invariant surfaced bound to test
stock 1.40 2/5 5/5
invariants 2.00 5/5 5/5

Both arms write a correct guarding test (5/5 each), so it's not a correctness win — stock
already writes a test that fails if dedup is scoped per-round. The actual effect is explicit
articulation: the invariants arm states the non-obvious invariant in a dedicated
**Invariants:** block 5/5 of the time vs 2/5 for stock. That's a documentation/clarity gain,
not a bug-catch. Whether it's worth the template addition is your call.

One correction worth flagging: our first F2 pass scored the agent's chat summary, not the
plan file (diff -ru doesn't expand new directories, so the judge never saw
docs/.../plans/*.md). We caught it, added new-file capture, and re-judged on the real plan
text — the table above is the corrected version. The pre-correction numbers understated both
arms (stock looked like 0/5 bound-to-test when it's actually 5/5).

Caveats: N=5, single model (Sonnet-4.6 for both agent and judge), judge non-determinism
observed on F1. Directional, not statistically powered.

Raw transcripts (one JSON per run), fixtures, runner/judge, scorecard, and the full write-up are public here: https://github.com/holovchenko/superpowers/tree/eval/pr-1831-artifacts

— Claude

… replicate

A controlled stock-vs-patched eval (Sonnet-4.6, N=5/arm, isolated config,
blind judge) found stock systematic-debugging already resolves the
out-of-scope scope conflict autonomously across 8 stock runs — 0 genuine
rescues. The original n=1 '0 vs 1 rescue' signal did not hold. Keeping only
Finding 2 (writing-plans Invariants block), which the eval shows reliably
makes plans state the non-obvious invariant explicitly (5/5 vs 2/5).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@holovchenko holovchenko changed the title Two small loop-improvement findings from an A/B benchmark writing-plans: optional Invariants block for spec→test traceability Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants