Skip to content

feat(regen): hard canvas-size gate with delta-aware repair messages#7517

Merged
MarkusNeusinger merged 5 commits into
mainfrom
feat/canvas-size-hard-gate
May 20, 2026
Merged

feat(regen): hard canvas-size gate with delta-aware repair messages#7517
MarkusNeusinger merged 5 commits into
mainfrom
feat/canvas-size-hard-gate

Conversation

@MarkusNeusinger
Copy link
Copy Markdown
Owner

Summary

  • Hard canvas-size enforcement for the regen pipeline: every saved PNG must end at exactly 3200×1800 (landscape) or 2400×2400 (square), within ±16 px (≤0.5 %).
  • Three-layer defense: STEP-0 contract in impl-generate-claude.md, "Canvas — hard rule" sections in all 10 library prompts, and a pre-AI-review pixel gate in impl-review.yml that routes drift into the existing repair cascade.
  • Delta-aware repair feedback: when the gate fails, the synthetic weakness names the actual dims, the closest target, the signed delta, and a library-specific likely cause — so the repair model knows which knob to turn and which direction, not just "use exact size."
  • Monitoring script + 12 unit tests lock the ::notice::canvas_gate log contract so we can spot libraries whose prompt is still leaking before steady-state compute waste.

Background

After yesterday's per-library sizing change (1b1f279a6), last night's daily-regen produced systematic canvas drift across 7 of 10 libraries — see /tmp/regen-compare/overview.html audit from earlier today:

Lib Last night's outcome
ggplot2, pygal 8/8 on target (ggplot2 wins because it's generated fresh — no prior impl to anchor on)
highcharts, bokeh 8/8 stuck at 3200×1661 / 2400×2261 (Chrome chrome + bokeh toolbar)
matplotlib, seaborn 8/8 trimmed by bbox_inches='tight' to ~3160×1740 / ~2200×2336
altair every render a different size (vl-convert pads outside width/height)
plotly, plotnine dims OK but wrong aspect 2/4 times

Two prior soft tightenings (13cf81342, 7f4a78a04) added "Base style ALWAYS wins" to the regen prompt — advisory, kept losing to the in-context anchor of the previous impl's figsize/width/height.

Changes

Layer 1 — prompts/workflow-prompts/impl-generate-claude.md

  • New "Step 0: Canvas dimensions (HARD CONTRACT)" placed above the previous-impl reference. Names the two canonical pixel pairs, declares prior values historical, warns about the post-render gate.
  • New Step 3b PIL self-check that fails fast inside Claude's existing 3-attempt loop.

Layer 2 — prompts/library/*.md

  • matplotlib, seaborn: removed bbox_inches='tight' from all savefig examples (this was the documented cause of ~40 px trim).
  • altair: explicit configure_view(continuousWidth, continuousHeight) + zero padding + mandatory in-impl PIL crop/pad to exact target (escape valve so the gate never deadlocks altair structurally).
  • plotly: forbid autosize=True; require explicit margin=dict(...).
  • highcharts: concrete Chrome workaround — Emulation.setDeviceMetricsOverride via CDP + --hide-scrollbars (the --window-size value isn't authoritative in headless mode).
  • bokeh, pygal, plotnine, letsplot, ggplot2: re-labeled the canvas section "Canvas — hard rule, no deviation"; documented both canonical pairs.

Layer 3 — .github/workflows/impl-review.yml + prompts/workflow-prompts/ai-quality-review.md

  • New "Canvas dimension gate" step after the "Verify both theme renders exist" step. Reads plot-light.png dims, emits a structured ::notice::canvas_gate library=… status=… actual=… target=… delta=… attempt=… line, and on drift writes a synthetic weakness to /tmp/anyplot-canvas-gate.txt naming actual, closest target, signed delta, direction, optional aspect hint, and a library-specific likely cause.
  • ai-quality-review.md now reads that file in a new "5c2" step and (a) copies the weakness verbatim as the first item in its output, (b) forces VQ-05 to 0/4 — which drags quality_score below the repair threshold so the existing label-driven cascade routes the PR into impl-repair.yml. No raw exit 1 (would bypass repair).

Monitoring

  • automation/scripts/canvas_gate_report.py parses canvas_gate notice lines from recent impl-review runs and prints per-library first-attempt fail rates. Decision rule: anything > 20 % over 14 days means the library prompt is still leaking and warrants another tightening pass.

Test plan

  • Local replay against last night's 79 PNGs: gate classifies 38 pass / 41 fail; messages match the manual audit (correct actual dims, signed delta, library-specific cause).
  • Inline gate snippet smoke-test: emits the exact ::notice:: format the parser expects.
  • Unit tests: 12 new tests lock the log-line contract between the gate step and the monitor script; full unit suite (1485 tests) passes.
  • Ruff lint + format clean on new files.
  • End-to-end run on one spec: after merge, dispatch bulk-generate.yml -f specification_id=sn-curve-basic (highcharts/bokeh cropped the X-axis title last night → real reproducer). Expectations: ggplot2/pygal/letsplot/plotnine/plotly pass first attempt; matplotlib/seaborn/altair/highcharts/bokeh either pass or repair to pass on attempt 2 with the named delta as feedback. No PR auto-closed.
  • Two-week watch: run uv run automation/scripts/canvas_gate_report.py daily for 3 days, then weekly. Flag any library > 20 % first-attempt fail rate.

🤖 Generated with Claude Code

MarkusNeusinger and others added 4 commits May 20, 2026 14:29
Tag chips on the spec/impl detail page and on the /stats tag-distribution
block built URLs as /?<param>=<value>. Since the page split, / is the
LandingPage and no longer accepts filter query params — only /plots does.
The links resolved to the landing page with stray query params and lost
the filter intent.

Fixes:
- SpecTabs.handleTagClick: navigate("/plots?…") instead of "/?…". One
  call site covers both spec-overview and impl-detail since SpecPage
  mounts <SpecTabs> for both router modes.
- StatsPage tag links: target /plots, encodeURIComponent the value, and
  fire `tag_click` with `source: 'stats'` (StatsPage was silent on this
  event — only spec_detail was tracked before).
- docs/reference/plausible.md: list StatsPage.tsx as a tag_click source
  and document the `source ∈ spec_detail, stats` enum.

Drive-by: add `crypto` to the eslint browser-globals list so
FeedbackWidget.tsx (uses crypto.randomUUID / getRandomValues) stops
failing `no-undef`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses Copilot review feedback on #7516: the StatsPage tag-link
behavior changed in the parent commit had no test coverage (also flagged
by codecov/patch/frontend). Adds one test that locks in:
- the link's href is /plots?plot=scatter (regression guard for the
  pre-fix /?plot=scatter that silently dropped the filter intent), and
- clicking fires trackEvent('tag_click', { param, value, source: 'stats' }).

The useAnalytics mock is hoisted into a module-level mockTrackEvent so
the assertion can see the call. Cleared per test via mockClear() because
vi.restoreAllMocks() does not reset hoisted vi.fn() instances on its own.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prior test asserted href === '/plots?plot=scatter', which proves
the /plots routing but does not exercise encodeURIComponent — 'scatter'
has no special chars, so the assertion would still pass if the encoding
were removed.

Copilot's review explicitly called out '/plots?{param}={encodedTag}',
so add a 'time series' tag (space in the value) under data_type. The
new assertion locks in href === '/plots?data=time%20series', which
fails the moment encodeURIComponent is dropped from the source. Also
clicks both links and asserts trackEvent receives the raw (unencoded)
value, matching the documented plausible.md contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Yesterday's per-library sizing change (1b1f279) shipped, but last night's
daily-regen exposed systematic canvas drift across 7 of 10 libraries: ggplot2
(generated fresh, no prior impl to anchor on) and pygal hit the 3200×1800 /
2400×2400 target exactly, while highcharts/bokeh ended ~140 px short,
matplotlib/seaborn were trimmed by bbox_inches='tight', and altair had no
canvas control at all. The two prior soft tightenings of "Base style ALWAYS
wins" couldn't beat the in-context anchor of the previous implementation's
figsize/width/height values.

This change enforces canvas dimensions through three coordinated layers:

- impl-generate-claude.md: STEP 0 canvas contract above the previous-impl
  reference, with a PIL self-check sub-step in Step 3 so Claude catches drift
  inside its own retry loop before pushing.

- per-library prompts: explicit "Canvas — hard rule" sections naming both
  canonical pixel pairs. Surgical fixes for the actual offenders — drop
  bbox_inches='tight' from matplotlib/seaborn save calls; add a PIL crop/pad
  normalizer to altair (vl-convert pads outside width/height); forbid
  autosize=True + pin margin in plotly; use Chrome CDP setDeviceMetricsOverride
  in highcharts (--window-size alone gets eaten by Chrome chrome).

- impl-review.yml: pre-AI-review canvas gate that emits a structured ::notice::
  log line per evaluation and, on drift, writes /tmp/anyplot-canvas-gate.txt
  with actual×actual, closest target, signed delta, and a library-specific
  likely cause. ai-quality-review.md picks the file up, copies the synthetic
  weakness verbatim, and forces VQ-05 to 0/4 — which drags quality_score below
  the repair threshold and routes the PR through the existing 5-review/
  4-repair cascade. Routing through review (not a raw step failure) preserves
  attempt counting and PR cleanup; impl-repair is dispatched only by the
  ai-rejected label, never by a hard step exit.

The repair feedback names the *actual* drift, not just the target — so the
repair model knows which knob to turn and which direction, rather than
guessing on the next attempt.

automation/scripts/canvas_gate_report.py aggregates the structured notice
lines across recent impl-review runs and prints per-library first-attempt
fail rates; anything above 20 % over 14 days means the library prompt is
still leaking and gets another tightening pass rather than burning compute
in steady-state repair.

Verified by replaying the gate logic against the 79 PNGs from last night's
regen (/tmp/regen-compare/): 38 pass (ggplot2 + pygal + most plotly/plotnine/
letsplot), 41 fail with delta + cause messages matching the manual audit.
1485 unit tests pass; 12 new tests lock the ::notice:: log contract between
the gate step and the report parser.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 20, 2026 13:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the regeneration pipeline against canvas-size drift by introducing a pre-AI-review pixel gate, updating generation/review prompts to treat canvas size as a hard contract (with delta-aware repair feedback), and adding a monitoring script + tests to track first-attempt failure rates. It also fixes frontend tag-chip navigation to route to /plots (and tracks tag_click from StatsPage).

Changes:

  • Add a “Canvas dimension gate” to impl-review.yml that emits structured ::notice::canvas_gate … lines and writes delta-aware repair guidance for the AI review step.
  • Tighten regen + library prompts to treat canvas size as a non-negotiable contract and provide library-specific guidance (including a local PIL self-check).
  • Add canvas_gate_report.py + unit tests to monitor gate outcomes from workflow logs; update frontend tag routing + analytics docs/tests.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/unit/automation/scripts/test_canvas_gate_report.py Unit tests locking the structured canvas_gate log-line regex/aggregation contract.
automation/scripts/canvas_gate_report.py New CLI script to scan impl-review logs and report per-library first-attempt fail rates.
.github/workflows/impl-review.yml Adds the canvas dimension gate step and synthetic weakness output for repair routing.
prompts/workflow-prompts/impl-generate-claude.md Adds Step 0 hard canvas contract + Step 3b PIL self-check instructions.
prompts/workflow-prompts/ai-quality-review.md Requires ingesting /tmp/anyplot-canvas-gate.txt (if present) and forcing VQ-05 to 0/4 on drift.
prompts/library/matplotlib.md Adds “Canvas — hard rule” guidance; removes bbox_inches='tight' in save examples.
prompts/library/seaborn.md Adds “Canvas — hard rule” guidance; removes bbox_inches='tight' in save examples.
prompts/library/altair.md Adds view/padding constraints plus post-save PIL normalization guidance for exact PNG dimensions.
prompts/library/plotly.md Adds hard canvas sizing pairs; forbids autosize=True and pins margins.
prompts/library/highcharts.md Adds CDP viewport override and PIL normalization guidance; documents multi-place sizing sync.
prompts/library/bokeh.md Adds hard canvas sizing section and warns about toolbar-induced drift.
prompts/library/pygal.md Adds hard canvas sizing section.
prompts/library/plotnine.md Adds hard canvas sizing section and guidance consistent with matplotlib backend.
prompts/library/letsplot.md Adds hard canvas sizing section and canonical size pairs.
prompts/library/ggplot2.md Adds hard canvas sizing section for ggsave + ragg.
app/src/pages/StatsPage.tsx Routes tag links to /plots?... and emits tag_click analytics from StatsPage.
app/src/pages/StatsPage.test.tsx Adds regression test for /plots routing and tag_click tracking (including URL encoding).
app/src/components/SpecTabs.tsx Updates tag navigation to /plots?... from spec detail.
docs/reference/plausible.md Updates tag_click event documentation to include StatsPage.tsx and source=stats.
app/eslint.config.js Declares crypto as a readonly global for ESLint.

Comment thread .github/workflows/impl-review.yml Outdated
"letsplot": "Check `ggsize(W, H)` and `ggsave(..., scale=4)` pair — only the two canonical pairs land on target.",
"ggplot2": "Check `ggsave(width=…, height=…, units='in', dpi=400)` with `ragg::agg_png`.",
}
cause = causes.get(LIBRARY, "Review `prompts/library/${LIBRARY}.md` 'Canvas — hard rule' section.")
Comment on lines +74 to +76
return datetime.fromisoformat(value).astimezone(timezone.utc)


Comment thread prompts/library/altair.md Outdated
Comment on lines +29 to +35
TARGET = (3200, 1800) # or (2400, 2400) for square

def normalize_to_target(path: str, target: tuple[int, int], page_bg: str) -> None:
"""Crop or pad the PNG so its dimensions exactly equal `target`."""
img = Image.open(path).convert("RGB")
tw, th = target
w, h = img.size
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 61.20690% with 45 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
automation/scripts/canvas_gate_report.py 60.17% 45 Missing ⚠️

📢 Thoughts on this report? Let us know!

Three fixes for sensible review comments on #7517:

1. impl-review.yml: the fallback `cause` string used shell-style `${LIBRARY}`
   inside a Python string, so it would be written literally into the synthetic
   weakness. Switch to an f-string so the message names the actual library.

2. canvas_gate_report.py parse_since(): `datetime.fromisoformat` accepts `Z`
   on 3.13 but the test suite documents 3.13+ as the floor, and a naive
   `--since 2026-05-20` would otherwise be interpreted in the host timezone.
   Normalize trailing `Z` to `+00:00` and treat naive datetimes as UTC
   explicitly so `--since 2026-05-20` is unambiguous.

3. altair.md: the `def normalize_to_target(...)` helper would have violated
   CQ-01 (KISS — no functions/classes in plot impls) and cost VQ points in
   the review. Replace with an inline crop/pad block that runs immediately
   after `chart.save(...)`. The PNG save sites now reference the inline
   block instead of calling a helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MarkusNeusinger MarkusNeusinger merged commit ace2673 into main May 20, 2026
8 of 9 checks passed
@MarkusNeusinger MarkusNeusinger deleted the feat/canvas-size-hard-gate branch May 20, 2026 14:04
MarkusNeusinger added a commit that referenced this pull request May 20, 2026
…-reject (#7528)

## Summary

Follow-up to #7517. The canvas gate worked perfectly on yesterday's
`sn-curve-basic` regen (17/17 PASS at exact 3200×1800), but altair's PIL
center-crop fallback silently chopped the title at the top edge and the
first digit of every y-axis tick label off the left edge. AI review
scored that 84/100, which cleared the attempt-2 ≥80 threshold, and a
visibly broken altair chart shipped to the gallery.

Two coordinated fixes:

1. **Altair PAD-only:** shrink inner-view defaults so vl-convert's
external title/legend/axis padding still fits within target, and replace
the crop-or-pad with PAD-only (raise `SystemExit` instead of cropping if
vl-convert still overshoots — repair will pick that up).
2. **AR-09 EDGE_CLIPPING:** new hard auto-reject (Score=0, REJECTED) in
the AI review. Fires when any text touches or extends past the canvas
border. Distinct from the soft VQ-05 "no overflow" check — that one only
deducts; AR-09 rejects outright.

## Changes

### `prompts/library/altair.md`
- Landscape inner view: `(800, 450)` → `(620, 320)` (leaves ~500 px /
580 px of room for title + legend + axes inside 3200×1800)
- Square inner view: `(600, 600)` → `(500, 460)` (similarly for
2400×2400)
- Normalization: PAD-only with `Image.new(...) + paste(...)`. If
`vl-convert` still produces an oversized PNG, raise `SystemExit` with
the actual dims — impl-repair triggers and the next attempt shrinks
further.
- Removed the destructive center-crop branch entirely.

### `prompts/workflow-prompts/ai-quality-review.md`
- Expanded section 6 from "Check for Auto-Reject (AR-08)" to "(AR-08,
AR-09)".
- AR-09 lists the five concrete clipping patterns ("title cropped at
top", "y-tick missing leftmost digit", "x-axis label cut at bottom",
"legend behind border", "annotation bounding box partially outside
PNG").
- False-positive guard so the rule does not over-trigger on tooltips,
decorative borders, or text that overflows its axis but stays within the
canvas.

### `prompts/quality-evaluator.md` + `prompts/quality-criteria.md`
- AR-09 added to both AR catalogues with description, trigger
conditions, exceptions, and check order updated to AR-01 → … → AR-09.

## Why now

The canvas-size hard gate (#7517) enforces *dimensions* but cannot see
*what is at those edges*. The combination of "exact 3200×1800 PNG" +
"title pixels chopped off" is the worst possible outcome because it
gives a green check while shipping a broken chart. AR-09 closes that
gap.

## Test plan

- [x] Local diff inspection — altair example block uses (620, 320),
normalization is PAD-only with `SystemExit` on overshoot.
- [ ] After merge, dispatch `bulk-generate.yml -f
specification_id=sn-curve-basic` (where altair's title-clip was
reproducible) and confirm:
  - altair first-attempt PNG has the full title visible at the top
  - all y-axis tick labels show their leading digit
  - if AR-09 fires on any other library, score is 0 and repair triggers
- [ ] Watch the `canvas_gate_report.py` output over the next few
daily-regen runs — altair should now consistently hit target without
padding ugliness.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants