bench(design): Opus A-vs-B-vs-C benchmark — 90% harness win-rate by rapiercraft-forge[bot] · Pull Request #1041 · RapierCraftStudios/ForgeDock

rapiercraft-forge · 2026-06-22T11:22:34Z

Summary

Full rerun of the ABC benchmark using claude-opus-4-6 for both design generation and judging (previous run #1040 used Sonnet).

Results: Opus vs Sonnet

Metric	Sonnet Run	Opus Run
A vs B pairwise win-rate	54.2%	90.0%
A rubric mean	3.40/5	4.11/5
B rubric mean	3.07/5	3.18/5
A slop count (avg)	1.8	1.2
B slop count (avg)	3.6	3.8

Key findings

Opus harness (A) is dramatically better: 90% win-rate vs B, up from 54% with Sonnet. The harness methodology (archetype sampling, anti-slop negatives, non-default typefaces) produces much larger gains when the base model is more capable.
A approaches reference quality: A scored 91% of C's rubric average for Plume, and tied or beat C in 4 pairwise comparisons (Tender run-2, Plume runs 1-2, Voltage run-2).
Slop detection works: A averages 1.2 AI tells vs B's 3.8. B consistently falls into gradient text, centered layouts, pill badges, and dark-mode templates.
Design range confirmed: A shows genuine creative range across runs — editorial serif, brutalist condensed, warm photographic, technical dense — while B repeats the same dark/centered patterns.

Miscalibration flags (exit 2)

4 comparisons where A or B beat C:

Plume runs 1-2: A > C — Linear's screenshot was partially loaded (known issue)
Plume run 2: B > C — same screenshot issue
Tender run 2: A > C — genuine; the editorial ledger design with serif typography was judged as having stronger brand identity than Stripe's reference

Test plan

node scripts/bench-scorecard.mjs docs/design/fixtures/runs/full-abc-opus/runs.json produces valid scorecard
All 30 HTML files render correctly in browser
Judge data normalized from 5 independent Opus judge agents

Release: deterministic pipeline v2, project-agnostic commands, licensing & governance

)

…protocol entry (#881)

…ma-881 feat(design): FORGE:DESIGN_SPEC schema — structured design language artifact

…-grammar-negatives-880 Feat: reference corpus + grammar/negatives extraction (#880)

…→ judge → scorecard (#878)

…g-878 feat(bench): ABC benchmark rig — reference URL → brief → render arms → judge → scorecard

)

…nter-884 Feat: deterministic design-system linter (anti-slop gate)

…ateness-885 Feat: effects-appropriateness doctrine layer (#885)

…ORGE:DESIGN_RATIONALE (#886)

…-rationale-886 Feat: design-architect rationale doctrine + FORGE:DESIGN_RATIONALE (#886)

… doctrine; register FORGE:DESIGN_CANDIDATES (#883)

…tion-883 Feat: divergent generation + archetype sampling + taste-judge (#883)

…GE:CRITIQUE (#882)

…loop-882 Feat: render -> vision-critique -> iterate loop (#882)

…tfolio) (#887)

Feat: design-memory doctrine — anti-sameness across the portfolio (#887)

…r FORGE:DESIGN_CONTEXT/DESIGN_SHIPPED (#888)

Feat: /design pipeline — the design-track spine (#888)

…928) JSON.parse accepts null/primitives/arrays as valid JSON, but the check functions dereference spec.<field> and threw an uncaught TypeError on a literal `null` spec. Add an object-shape guard in main() after the parse that degrades null/array/primitive specs to the documented exit-2/WARNING.

…spec-guard-928 Fix: guard design-system-lint against non-object JSON spec (#928)

Both bad-spec exit paths in design-system-lint.mjs — the malformed-JSON catch block and the non-object spec guard (added in PR #947) — called process.exit(2) before the --baseline and --json handling at the bottom of main(). This violated two documented contracts: - "--baseline always exits 0" (docstring) — bad spec exited 2 instead - "--json" consumers expected parseable stdout — got empty stdout + exit 2 Fix: add a badSpecExit(opts, message) helper that applies --baseline and --json semantics before exiting (honoring exit-0 and emitting a JSON error payload respectively). Replace both bare process.exit(2) calls with calls to this helper. Update docstring to clarify the contracts explicitly cover bad-spec/usage-error paths. The JSON error payload shape matches the normal --json output shape (blocking, warnings, baseline, findings) with an added "error" field. stderr output is preserved in all cases. Out of scope: L873 (missing --spec/--html args) and L902 (HTML read failure) follow the same pattern but are not part of this review finding.

…t-bad-spec-contracts-948 Fix: honor --baseline/--json contracts on bad-spec exits (#948)

#1039) Run the ABC benchmark rig (arms B and C only) across all 5 seed products (Cadence/Tender/Slipstream/Voltage/Plume) with n=3 runs each to establish the baseline the design harness must beat. Key findings: - B wins 0% of clean pairwise comparisons vs C (0/12) - Mean rubric gap: -1.69 points on 5-point scale - Originality is widest gap (-2.17) — model mode-collapses to templates - Mean slop count: 5.0 AI tells per B page vs 0.5 for C - Slipstream excluded (vercel.com screenshot rendering artifact) Deliverables: - docs/articles/b-vs-c-baseline.md (full scorecard report) - docs/design/fixtures/runs/b-vs-c-baseline/runs.json (raw judge data) - docs/design/fixtures/runs/b-vs-c-baseline/scorecard.json (aggregated) - 15 arm-B HTML files (3 per product) Co-authored-by: RapierCraft <rapiercraftstudios@gmail.com>

…4% (#1040) First full benchmark of the UI Taste Harness (arm A) against raw frontier model output (arm B) and real reference pages (arm C). Key results: - A wins 54.2% of clean A-vs-B head-to-head comparisons - A rubric mean: 3.40 vs B: 3.31 (both up from 3.10 baseline) - Slop reduced 64%: A=1.8 vs baseline B=5.0 - Originality is biggest A win (+0.59 vs B) — archetype sampling works - Mobile is biggest A weakness (-0.59 vs B) - High variance: Voltage/Tender A dominates, Cadence/Plume B leads - Root cause of failures: harness over-indexes on restraint, producing sparse pages that judges rate below B's denser output Improvement targets identified: - Content completeness gate in critique loop - Mobile responsiveness in design spec - Consistency guardrails against stub-page outputs Co-authored-by: RapierCraft <rapiercraftstudios@gmail.com>

…#878) Full rerun of the ABC benchmark using claude-opus-4-6 for both generation and judging (previous run used Sonnet). Key results: - A vs B pairwise win-rate: 90.0% (up from 54.2% with Sonnet) - A rubric mean: 4.11/5 (up from 3.40) - B rubric mean: 3.18/5 - A slop count: 1.2 avg (down from 1.8) - B slop count: 3.8 avg - 4 miscalibration flags (A or B beat C) — 3 from Linear's partially-loaded screenshot, 1 genuine (Tender run-2 editorial design) 5 products × 3 runs × 2 arms = 30 HTML fixtures + runs.json + scorecard.json

RapierCraft and others added 30 commits June 16, 2026 18:53

Merge pull request #843 from RapierCraftStudios/staging

df6c26b

Release: deterministic pipeline v2, project-agnostic commands, licensing & governance

chore: bump version to 1.0.17 [skip ci]

f383ec2

feat(design): add FORGE:DESIGN_SPEC schema and register annotation (#881

3806d7d

)

fix(design): use 4-backtick fence to nest jsonc block in DESIGN_SPEC …

9003aaf

…protocol entry (#881)

Merge pull request #911 from RapierCraftStudios/feat/design-spec-sche…

8eff0cd

…ma-881 feat(design): FORGE:DESIGN_SPEC schema — structured design language artifact

feat(design): add reference corpus + grammar/negatives extraction (#880)

d71ca6f

Merge pull request #919 from RapierCraftStudios/feat/reference-corpus…

2679f97

…-grammar-negatives-880 Feat: reference corpus + grammar/negatives extraction (#880)

feat(bench): ABC benchmark rig — reference URL → brief → render arms …

c2b109f

…→ judge → scorecard (#878)

Merge pull request #924 from RapierCraftStudios/feat/abc-benchmark-ri…

1ebc917

…g-878 feat(bench): ABC benchmark rig — reference URL → brief → render arms → judge → scorecard

feat(scripts): deterministic design-system linter (anti-slop gate) (#884

a09b517

)

Merge pull request #927 from RapierCraftStudios/feat/design-system-li…

70c008a

…nter-884 Feat: deterministic design-system linter (anti-slop gate)

feat(design): effects-appropriateness doctrine layer (#885)

1377de6

Merge pull request #936 from RapierCraftStudios/feat/effects-appropri…

bdc5ff2

…ateness-885 Feat: effects-appropriateness doctrine layer (#885)

feat(design): commit design-architect rationale doctrine + register F…

e2c56bc

…ORGE:DESIGN_RATIONALE (#886)

Merge pull request #937 from RapierCraftStudios/feat/design-architect…

256dcf5

…-rationale-886 Feat: design-architect rationale doctrine + FORGE:DESIGN_RATIONALE (#886)

feat(design): divergent generation + archetype sampling + taste-judge…

9c4d314

… doctrine; register FORGE:DESIGN_CANDIDATES (#883)

Merge pull request #939 from RapierCraftStudios/feat/divergent-genera…

74ffe40

…tion-883 Feat: divergent generation + archetype sampling + taste-judge (#883)

feat(design): render -> vision-critique -> iterate loop; register FOR…

5a991d3

…GE:CRITIQUE (#882)

Merge pull request #940 from RapierCraftStudios/feat/render-critique-…

6048b36

…loop-882 Feat: render -> vision-critique -> iterate loop (#882)

feat(design): commit design-memory doctrine (anti-sameness across por…

70950c4

…tfolio) (#887)

Merge pull request #941 from RapierCraftStudios/feat/design-memory-887

450932a

Feat: design-memory doctrine — anti-sameness across the portfolio (#887)

feat(design): /design pipeline router (the /work-on analog) + registe…

3499335

…r FORGE:DESIGN_CONTEXT/DESIGN_SHIPPED (#888)

Merge pull request #942 from RapierCraftStudios/feat/design-pipeline-888

526d3c5

Feat: /design pipeline — the design-track spine (#888)

Merge pull request #947 from RapierCraftStudios/fix/design-lint-null-…

39f94ca

…spec-guard-928 Fix: guard design-system-lint against non-object JSON spec (#928)

Merge pull request #979 from RapierCraftStudios/fix/design-system-lin…

aeacbb9

…t-bad-spec-contracts-948 Fix: honor --baseline/--json contracts on bad-spec exits (#948)

RapierCraft force-pushed the milestone/ui-taste-harness-abc-benchmark branch from 7558a0d to 33e1d9f Compare June 22, 2026 16:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bench(design): Opus A-vs-B-vs-C benchmark — 90% harness win-rate#1041

bench(design): Opus A-vs-B-vs-C benchmark — 90% harness win-rate#1041
rapiercraft-forge[bot] wants to merge 30 commits into
milestone/ui-taste-harness-abc-benchmarkfrom
bench/full-abc-opus

rapiercraft-forge Bot commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rapiercraft-forge Bot commented Jun 22, 2026

Summary

Results: Opus vs Sonnet

Key findings

Miscalibration flags (exit 2)

Contents

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant