bench(design): Opus A-vs-B-vs-C benchmark — 90% harness win-rate#1041
Open
rapiercraft-forge[bot] wants to merge 30 commits into
Open
bench(design): Opus A-vs-B-vs-C benchmark — 90% harness win-rate#1041rapiercraft-forge[bot] wants to merge 30 commits into
rapiercraft-forge[bot] wants to merge 30 commits into
Conversation
Release: deterministic pipeline v2, project-agnostic commands, licensing & governance
…ma-881 feat(design): FORGE:DESIGN_SPEC schema — structured design language artifact
…-grammar-negatives-880 Feat: reference corpus + grammar/negatives extraction (#880)
…→ judge → scorecard (#878)
…g-878 feat(bench): ABC benchmark rig — reference URL → brief → render arms → judge → scorecard
…nter-884 Feat: deterministic design-system linter (anti-slop gate)
…ateness-885 Feat: effects-appropriateness doctrine layer (#885)
…ORGE:DESIGN_RATIONALE (#886)
…-rationale-886 Feat: design-architect rationale doctrine + FORGE:DESIGN_RATIONALE (#886)
… doctrine; register FORGE:DESIGN_CANDIDATES (#883)
…tion-883 Feat: divergent generation + archetype sampling + taste-judge (#883)
…loop-882 Feat: render -> vision-critique -> iterate loop (#882)
Feat: design-memory doctrine — anti-sameness across the portfolio (#887)
…r FORGE:DESIGN_CONTEXT/DESIGN_SHIPPED (#888)
Feat: /design pipeline — the design-track spine (#888)
…928) JSON.parse accepts null/primitives/arrays as valid JSON, but the check functions dereference spec.<field> and threw an uncaught TypeError on a literal `null` spec. Add an object-shape guard in main() after the parse that degrades null/array/primitive specs to the documented exit-2/WARNING.
…spec-guard-928 Fix: guard design-system-lint against non-object JSON spec (#928)
Both bad-spec exit paths in design-system-lint.mjs — the malformed-JSON catch block and the non-object spec guard (added in PR #947) — called process.exit(2) before the --baseline and --json handling at the bottom of main(). This violated two documented contracts: - "--baseline always exits 0" (docstring) — bad spec exited 2 instead - "--json" consumers expected parseable stdout — got empty stdout + exit 2 Fix: add a badSpecExit(opts, message) helper that applies --baseline and --json semantics before exiting (honoring exit-0 and emitting a JSON error payload respectively). Replace both bare process.exit(2) calls with calls to this helper. Update docstring to clarify the contracts explicitly cover bad-spec/usage-error paths. The JSON error payload shape matches the normal --json output shape (blocking, warnings, baseline, findings) with an added "error" field. stderr output is preserved in all cases. Out of scope: L873 (missing --spec/--html args) and L902 (HTML read failure) follow the same pattern but are not part of this review finding.
…t-bad-spec-contracts-948 Fix: honor --baseline/--json contracts on bad-spec exits (#948)
#1039) Run the ABC benchmark rig (arms B and C only) across all 5 seed products (Cadence/Tender/Slipstream/Voltage/Plume) with n=3 runs each to establish the baseline the design harness must beat. Key findings: - B wins 0% of clean pairwise comparisons vs C (0/12) - Mean rubric gap: -1.69 points on 5-point scale - Originality is widest gap (-2.17) — model mode-collapses to templates - Mean slop count: 5.0 AI tells per B page vs 0.5 for C - Slipstream excluded (vercel.com screenshot rendering artifact) Deliverables: - docs/articles/b-vs-c-baseline.md (full scorecard report) - docs/design/fixtures/runs/b-vs-c-baseline/runs.json (raw judge data) - docs/design/fixtures/runs/b-vs-c-baseline/scorecard.json (aggregated) - 15 arm-B HTML files (3 per product) Co-authored-by: RapierCraft <rapiercraftstudios@gmail.com>
…4% (#1040) First full benchmark of the UI Taste Harness (arm A) against raw frontier model output (arm B) and real reference pages (arm C). Key results: - A wins 54.2% of clean A-vs-B head-to-head comparisons - A rubric mean: 3.40 vs B: 3.31 (both up from 3.10 baseline) - Slop reduced 64%: A=1.8 vs baseline B=5.0 - Originality is biggest A win (+0.59 vs B) — archetype sampling works - Mobile is biggest A weakness (-0.59 vs B) - High variance: Voltage/Tender A dominates, Cadence/Plume B leads - Root cause of failures: harness over-indexes on restraint, producing sparse pages that judges rate below B's denser output Improvement targets identified: - Content completeness gate in critique loop - Mobile responsiveness in design spec - Consistency guardrails against stub-page outputs Co-authored-by: RapierCraft <rapiercraftstudios@gmail.com>
…#878) Full rerun of the ABC benchmark using claude-opus-4-6 for both generation and judging (previous run used Sonnet). Key results: - A vs B pairwise win-rate: 90.0% (up from 54.2% with Sonnet) - A rubric mean: 4.11/5 (up from 3.40) - B rubric mean: 3.18/5 - A slop count: 1.2 avg (down from 1.8) - B slop count: 3.8 avg - 4 miscalibration flags (A or B beat C) — 3 from Linear's partially-loaded screenshot, 1 genuine (Tender run-2 editorial design) 5 products × 3 runs × 2 arms = 30 HTML fixtures + runs.json + scorecard.json
7558a0d to
33e1d9f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Full rerun of the ABC benchmark using claude-opus-4-6 for both design generation and judging (previous run #1040 used Sonnet).
Results: Opus vs Sonnet
Key findings
Miscalibration flags (exit 2)
4 comparisons where A or B beat C:
Contents
runs.json— normalized judge datascorecard.json— deterministic aggregation outputTest plan
node scripts/bench-scorecard.mjs docs/design/fixtures/runs/full-abc-opus/runs.jsonproduces valid scorecard