Skip to content

bench(design): Opus A-vs-B-vs-C benchmark — 90% harness win-rate#1041

Open
rapiercraft-forge[bot] wants to merge 30 commits into
milestone/ui-taste-harness-abc-benchmarkfrom
bench/full-abc-opus
Open

bench(design): Opus A-vs-B-vs-C benchmark — 90% harness win-rate#1041
rapiercraft-forge[bot] wants to merge 30 commits into
milestone/ui-taste-harness-abc-benchmarkfrom
bench/full-abc-opus

Conversation

@rapiercraft-forge

Copy link
Copy Markdown
Contributor

Summary

Full rerun of the ABC benchmark using claude-opus-4-6 for both design generation and judging (previous run #1040 used Sonnet).

Results: Opus vs Sonnet

Metric Sonnet Run Opus Run
A vs B pairwise win-rate 54.2% 90.0%
A rubric mean 3.40/5 4.11/5
B rubric mean 3.07/5 3.18/5
A slop count (avg) 1.8 1.2
B slop count (avg) 3.6 3.8

Key findings

  • Opus harness (A) is dramatically better: 90% win-rate vs B, up from 54% with Sonnet. The harness methodology (archetype sampling, anti-slop negatives, non-default typefaces) produces much larger gains when the base model is more capable.
  • A approaches reference quality: A scored 91% of C's rubric average for Plume, and tied or beat C in 4 pairwise comparisons (Tender run-2, Plume runs 1-2, Voltage run-2).
  • Slop detection works: A averages 1.2 AI tells vs B's 3.8. B consistently falls into gradient text, centered layouts, pill badges, and dark-mode templates.
  • Design range confirmed: A shows genuine creative range across runs — editorial serif, brutalist condensed, warm photographic, technical dense — while B repeats the same dark/centered patterns.

Miscalibration flags (exit 2)

4 comparisons where A or B beat C:

  • Plume runs 1-2: A > C — Linear's screenshot was partially loaded (known issue)
  • Plume run 2: B > C — same screenshot issue
  • Tender run 2: A > C — genuine; the editorial ledger design with serif typography was judged as having stronger brand identity than Stripe's reference

Contents

  • 30 HTML files (5 products × 3 runs × 2 arms)
  • runs.json — normalized judge data
  • scorecard.json — deterministic aggregation output

Test plan

  • node scripts/bench-scorecard.mjs docs/design/fixtures/runs/full-abc-opus/runs.json produces valid scorecard
  • All 30 HTML files render correctly in browser
  • Judge data normalized from 5 independent Opus judge agents

RapierCraft and others added 30 commits June 16, 2026 18:53
Release: deterministic pipeline v2, project-agnostic commands, licensing & governance
…ma-881

feat(design): FORGE:DESIGN_SPEC schema — structured design language artifact
…-grammar-negatives-880

Feat: reference corpus + grammar/negatives extraction (#880)
…g-878

feat(bench): ABC benchmark rig — reference URL → brief → render arms → judge → scorecard
…nter-884

Feat: deterministic design-system linter (anti-slop gate)
…ateness-885

Feat: effects-appropriateness doctrine layer (#885)
…-rationale-886

Feat: design-architect rationale doctrine + FORGE:DESIGN_RATIONALE (#886)
… doctrine; register FORGE:DESIGN_CANDIDATES (#883)
…tion-883

Feat: divergent generation + archetype sampling + taste-judge (#883)
…loop-882

Feat: render -> vision-critique -> iterate loop (#882)
Feat: design-memory doctrine — anti-sameness across the portfolio (#887)
Feat: /design pipeline — the design-track spine (#888)
…928)

JSON.parse accepts null/primitives/arrays as valid JSON, but the check
functions dereference spec.<field> and threw an uncaught TypeError on a
literal `null` spec. Add an object-shape guard in main() after the parse
that degrades null/array/primitive specs to the documented exit-2/WARNING.
…spec-guard-928

Fix: guard design-system-lint against non-object JSON spec (#928)
Both bad-spec exit paths in design-system-lint.mjs — the malformed-JSON
catch block and the non-object spec guard (added in PR #947) — called
process.exit(2) before the --baseline and --json handling at the bottom
of main(). This violated two documented contracts:

  - "--baseline always exits 0" (docstring) — bad spec exited 2 instead
  - "--json" consumers expected parseable stdout — got empty stdout + exit 2

Fix: add a badSpecExit(opts, message) helper that applies --baseline and
--json semantics before exiting (honoring exit-0 and emitting a JSON error
payload respectively). Replace both bare process.exit(2) calls with calls
to this helper. Update docstring to clarify the contracts explicitly cover
bad-spec/usage-error paths.

The JSON error payload shape matches the normal --json output shape
(blocking, warnings, baseline, findings) with an added "error" field.
stderr output is preserved in all cases.

Out of scope: L873 (missing --spec/--html args) and L902 (HTML read
failure) follow the same pattern but are not part of this review finding.
…t-bad-spec-contracts-948

Fix: honor --baseline/--json contracts on bad-spec exits (#948)
#1039)

Run the ABC benchmark rig (arms B and C only) across all 5 seed
products (Cadence/Tender/Slipstream/Voltage/Plume) with n=3 runs
each to establish the baseline the design harness must beat.

Key findings:
- B wins 0% of clean pairwise comparisons vs C (0/12)
- Mean rubric gap: -1.69 points on 5-point scale
- Originality is widest gap (-2.17) — model mode-collapses to templates
- Mean slop count: 5.0 AI tells per B page vs 0.5 for C
- Slipstream excluded (vercel.com screenshot rendering artifact)

Deliverables:
- docs/articles/b-vs-c-baseline.md (full scorecard report)
- docs/design/fixtures/runs/b-vs-c-baseline/runs.json (raw judge data)
- docs/design/fixtures/runs/b-vs-c-baseline/scorecard.json (aggregated)
- 15 arm-B HTML files (3 per product)

Co-authored-by: RapierCraft <rapiercraftstudios@gmail.com>
…4% (#1040)

First full benchmark of the UI Taste Harness (arm A) against raw
frontier model output (arm B) and real reference pages (arm C).

Key results:
- A wins 54.2% of clean A-vs-B head-to-head comparisons
- A rubric mean: 3.40 vs B: 3.31 (both up from 3.10 baseline)
- Slop reduced 64%: A=1.8 vs baseline B=5.0
- Originality is biggest A win (+0.59 vs B) — archetype sampling works
- Mobile is biggest A weakness (-0.59 vs B)
- High variance: Voltage/Tender A dominates, Cadence/Plume B leads
- Root cause of failures: harness over-indexes on restraint, producing
  sparse pages that judges rate below B's denser output

Improvement targets identified:
- Content completeness gate in critique loop
- Mobile responsiveness in design spec
- Consistency guardrails against stub-page outputs

Co-authored-by: RapierCraft <rapiercraftstudios@gmail.com>
…#878)

Full rerun of the ABC benchmark using claude-opus-4-6 for both generation
and judging (previous run used Sonnet).

Key results:
- A vs B pairwise win-rate: 90.0% (up from 54.2% with Sonnet)
- A rubric mean: 4.11/5 (up from 3.40)
- B rubric mean: 3.18/5
- A slop count: 1.2 avg (down from 1.8)
- B slop count: 3.8 avg
- 4 miscalibration flags (A or B beat C) — 3 from Linear's
  partially-loaded screenshot, 1 genuine (Tender run-2 editorial design)

5 products × 3 runs × 2 arms = 30 HTML fixtures + runs.json + scorecard.json
@RapierCraft RapierCraft force-pushed the milestone/ui-taste-harness-abc-benchmark branch from 7558a0d to 33e1d9f Compare June 22, 2026 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant