Visual regression CI with semantic VLM triage layer#19578
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
✅ Deploy Preview for kubestellarconsole canceled.Built without sensitive environment variables
|
|
👋 Welcome to the KubeStellar community! 💖 Thanks and congrats 🎉 for opening your first PR here! We're excited to have you contributing. Before merge, please ensure:
📬 If you're using KubeStellar in your organization, please add your name to our Adopters list. 🙏 It really helps the project gain momentum and credibility — a small contribution back with a big impact. Resources:
A maintainer will review your PR soon. Hope you have a great time here! 🌟 ~~~~~~~~~~ 🌟 📬 If you like KubeStellar, please ⭐ star ⭐ our repo to support it! 🙏 It really helps the project gain momentum and credibility — a small contribution back with a big impact. |
| SOURCE_RUN_ID: ${{ env.SOURCE_RUN_ID }} | ||
| ARTIFACT_ROOT: visual-regression-artifacts | ||
| with: | ||
| script: | |
7b11236 to
606982d
Compare
Adds a durable, self-perpetuating visual-regression system for the console: a gated Playwright screenshot suite with committed baselines, plus a semantic triage layer that classifies each pixel diff (regression / intended change / noise) and drives an auto-issue → AI-fix → close-on-green loop. What it does - Gated GitHub Actions workflow (runs only on UI-path changes) builds the app, screenshots core routes/components across viewports with a determinism `settle` helper, and compares against committed Linux/Chromium baseline PNGs. Compare vs. generate-baseline are separated. - On a pixel diff, a semantic triage layer (scripts/visual-diff-triage.py) classifies the diff; cheap cases (0-pixel / sub-threshold / full-page) resolve with no model call, the rest go to a VLM. It fails closed to human review on model error / low confidence / high-risk paths. - On failure the system auto-creates a structured GitHub issue (the failure "schema" lives in visual-regression-failure-issue.yml) with confidence-gated labels so the AI fixer (Hive) can pick up confident regressions; close-on-green auto-closes the issue and writes a resolution verdict back to an append-only ledger for accuracy metrics. Triggers & flow - pull_request on web/src/** or web/e2e/visual/** (and the workflow files) → compare mode. - Diff → triage → CI red on regression/human_review; auto-issue with `triage/accepted`+`ai-fix-requested` (confident regression) or `kind/bug`+`needs-triage` (otherwise). Green again → issue auto-closed. Refreshing baselines - See web/e2e/visual/BASELINES.md. Use the workflow's generate-baseline mode (workflow_dispatch `generate_baselines`) to regenerate the Linux baselines as an artifact, then commit them. Config knobs (.github/visual-triage-config.json) - confidence_cutoff (CI-fail bar), auto_accept_min_confidence (auto-fix bar), high_risk_globs, per-run token/call budget, eval_min_accuracy, target_regression_precision, min_samples. Accuracy gate & metrics - web/e2e/visual/triage-eval/ holds a curated eval set; visual-triage-eval.yml runs the same pipeline (real VLM when VISUAL_TRIAGE_API_KEY is set, else a deterministic mock smoke) and fails below eval_min_accuracy. A metrics workflow publishes a regression-precision badge. This is meant to keep running as standing infrastructure, not a one-off check. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Default operation without a VLM key: the workflow runs in detect-only mode — a real visual change is routed to human review, fails the check, and files a tracking issue (update the baseline if intended, else fix). Configuring a vision-capable model via VISUAL_TRIAGE_API_KEY upgrades this to automated semantic classification. Signed-off-by: Mengping Zhang <mengping.zhang@bytedance.com>
606982d to
8eb9068
Compare
Adds a durable, self-perpetuating visual-regression system for the console: a gated
Playwright screenshot suite with committed baselines, plus a semantic triage layer
that classifies each pixel diff (regression / intended change / noise) and drives an
auto-issue → AI-fix → close-on-green loop.
What it does
core routes/components across viewports with a determinism
settlehelper, and comparesagainst committed Linux/Chromium baseline PNGs. Compare vs. generate-baseline are separated.
cheap cases (0-pixel / sub-threshold / full-page) resolve with no model call, the rest go to a
VLM. It fails closed to human review on model error / low confidence / high-risk paths.
visual-regression-failure-issue.yml) with confidence-gated labels so the AI fixer (Hive) can
pick up confident regressions; close-on-green auto-closes the issue and writes a resolution
verdict back to an append-only ledger for accuracy metrics.
Triggers & flow
triage/accepted+ai-fix-requested(confident regression) or
kind/bug+needs-triage(otherwise). Green again → issue auto-closed.Refreshing baselines
generate_baselines) to regenerate the Linux baselines as an artifact, then commit them.Config knobs (.github/visual-triage-config.json)
per-run token/call budget, eval_min_accuracy, target_regression_precision, min_samples.
Accuracy gate & metrics
pipeline (real VLM when VISUAL_TRIAGE_API_KEY is set, else a deterministic mock smoke) and fails
below eval_min_accuracy. A metrics workflow publishes a regression-precision badge.
This is meant to keep running as standing infrastructure, not a one-off check.
Files: 99 changed (+3171/−87) — the 5 workflows, the visual specs + 75 Linux baselines + eval set under
web/e2e/visual/, thevisual-settle.tsdeterminism helper,scripts/visual-diff-triage.py+merge_ledger.py,.github/visual-triage-config.json, and small touches toweb/e2e/helpers/setup.ts,app-visual.config.ts, anddocs/security/SECURITY-AI.md. No unrelated files.