Skip to content

Visual regression CI with semantic VLM triage layer#19578

Draft
MengpingZhang wants to merge 1 commit into
kubestellar:mainfrom
DavidDiaz0317:visual-regression-system
Draft

Visual regression CI with semantic VLM triage layer#19578
MengpingZhang wants to merge 1 commit into
kubestellar:mainfrom
DavidDiaz0317:visual-regression-system

Conversation

@MengpingZhang

@MengpingZhang MengpingZhang commented Jun 25, 2026

Copy link
Copy Markdown

Runs without a VLM by default. This PR needs no API key: with no VISUAL_TRIAGE_API_KEY configured, the workflow runs in detect-only mode — a real visual change is routed to human review, fails the check, and files a tracking issue (update the baseline if intended, otherwise fix). The cheap pixel rules (identical / sub-noise / full-page) still resolve the obvious cases with no model. Pointing a vision-capable model at VISUAL_TRIAGE_API_KEY later upgrades this to automated semantic classification — no other change needed. The auto-fix half of the loop already reuses the repo's Hive automation (org-credentialed), so no per-contributor key is required for fixes either.

Draft for review (Mengping/VSIP). Baselines are app-version-specific, so screenshots may not match upstream main yet — see CI note at the bottom.

Adds a durable, self-perpetuating visual-regression system for the console: a gated
Playwright screenshot suite with committed baselines, plus a semantic triage layer
that classifies each pixel diff (regression / intended change / noise) and drives an
auto-issue → AI-fix → close-on-green loop.

What it does

  • Gated GitHub Actions workflow (runs only on UI-path changes) builds the app, screenshots
    core routes/components across viewports with a determinism settle helper, and compares
    against committed Linux/Chromium baseline PNGs. Compare vs. generate-baseline are separated.
  • On a pixel diff, a semantic triage layer (scripts/visual-diff-triage.py) classifies the diff;
    cheap cases (0-pixel / sub-threshold / full-page) resolve with no model call, the rest go to a
    VLM. It fails closed to human review on model error / low confidence / high-risk paths.
  • On failure the system auto-creates a structured GitHub issue (the failure "schema" lives in
    visual-regression-failure-issue.yml) with confidence-gated labels so the AI fixer (Hive) can
    pick up confident regressions; close-on-green auto-closes the issue and writes a resolution
    verdict back to an append-only ledger for accuracy metrics.

Triggers & flow

  • pull_request on web/src/** or web/e2e/visual/** (and the workflow files) → compare mode.
  • Diff → triage → CI red on regression/human_review; auto-issue with triage/accepted+ai-fix-requested
    (confident regression) or kind/bug+needs-triage (otherwise). Green again → issue auto-closed.

Refreshing baselines

  • See web/e2e/visual/BASELINES.md. Use the workflow's generate-baseline mode (workflow_dispatch
    generate_baselines) to regenerate the Linux baselines as an artifact, then commit them.

Config knobs (.github/visual-triage-config.json)

  • confidence_cutoff (CI-fail bar), auto_accept_min_confidence (auto-fix bar), high_risk_globs,
    per-run token/call budget, eval_min_accuracy, target_regression_precision, min_samples.

Accuracy gate & metrics

  • web/e2e/visual/triage-eval/ holds a curated eval set; visual-triage-eval.yml runs the same
    pipeline (real VLM when VISUAL_TRIAGE_API_KEY is set, else a deterministic mock smoke) and fails
    below eval_min_accuracy. A metrics workflow publishes a regression-precision badge.

This is meant to keep running as standing infrastructure, not a one-off check.


Files: 99 changed (+3171/−87) — the 5 workflows, the visual specs + 75 Linux baselines + eval set under web/e2e/visual/, the visual-settle.ts determinism helper, scripts/visual-diff-triage.py + merge_ledger.py, .github/visual-triage-config.json, and small touches to web/e2e/helpers/setup.ts, app-visual.config.ts, and docs/security/SECURITY-AI.md. No unrelated files.

@kubestellar-prow kubestellar-prow Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 25, 2026
@kubestellar-prow

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mikespreitzer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubestellar-prow kubestellar-prow Bot added the dco-signoff: no Indicates the PR's author has not signed the DCO. label Jun 25, 2026
@netlify

netlify Bot commented Jun 25, 2026

Copy link
Copy Markdown

Deploy Preview for kubestellarconsole canceled.

Built without sensitive environment variables

Name Link
🔨 Latest commit 8eb9068
🔍 Latest deploy log https://app.netlify.com/projects/kubestellarconsole/deploys/6a3d41d201433400088bb782

@github-actions

Copy link
Copy Markdown
Contributor

👋 Welcome to the KubeStellar community! 💖

Thanks and congrats 🎉 for opening your first PR here! We're excited to have you contributing.

Before merge, please ensure:

  • DCO Sign-off — All commits signed with git commit -s (DCO)
  • PR Title — Starts with an emoji: ✨ feature | 🐛 bug fix | 📖 docs | 🌱 infra/tests | ⚠️ breaking

📬 If you're using KubeStellar in your organization, please add your name to our Adopters list. 🙏 It really helps the project gain momentum and credibility — a small contribution back with a big impact.

Resources:

A maintainer will review your PR soon. Hope you have a great time here!

🌟 ~~~~~~~~~~ 🌟

📬 If you like KubeStellar, please ⭐ star ⭐ our repo to support it!

🙏 It really helps the project gain momentum and credibility — a small contribution back with a big impact.

@kubestellar-prow kubestellar-prow Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 25, 2026
SOURCE_RUN_ID: ${{ env.SOURCE_RUN_ID }}
ARTIFACT_ROOT: visual-regression-artifacts
with:
script: |
@MengpingZhang MengpingZhang force-pushed the visual-regression-system branch from 7b11236 to 606982d Compare June 25, 2026 14:28
@kubestellar-prow kubestellar-prow Bot added dco-signoff: yes Indicates the PR's author has signed the DCO. and removed dco-signoff: no Indicates the PR's author has not signed the DCO. labels Jun 25, 2026
Adds a durable, self-perpetuating visual-regression system for the console: a gated
Playwright screenshot suite with committed baselines, plus a semantic triage layer
that classifies each pixel diff (regression / intended change / noise) and drives an
auto-issue → AI-fix → close-on-green loop.

What it does
- Gated GitHub Actions workflow (runs only on UI-path changes) builds the app, screenshots
  core routes/components across viewports with a determinism `settle` helper, and compares
  against committed Linux/Chromium baseline PNGs. Compare vs. generate-baseline are separated.
- On a pixel diff, a semantic triage layer (scripts/visual-diff-triage.py) classifies the diff;
  cheap cases (0-pixel / sub-threshold / full-page) resolve with no model call, the rest go to a
  VLM. It fails closed to human review on model error / low confidence / high-risk paths.
- On failure the system auto-creates a structured GitHub issue (the failure "schema" lives in
  visual-regression-failure-issue.yml) with confidence-gated labels so the AI fixer (Hive) can
  pick up confident regressions; close-on-green auto-closes the issue and writes a resolution
  verdict back to an append-only ledger for accuracy metrics.

Triggers & flow
- pull_request on web/src/** or web/e2e/visual/** (and the workflow files) → compare mode.
- Diff → triage → CI red on regression/human_review; auto-issue with `triage/accepted`+`ai-fix-requested`
  (confident regression) or `kind/bug`+`needs-triage` (otherwise). Green again → issue auto-closed.

Refreshing baselines
- See web/e2e/visual/BASELINES.md. Use the workflow's generate-baseline mode (workflow_dispatch
  `generate_baselines`) to regenerate the Linux baselines as an artifact, then commit them.

Config knobs (.github/visual-triage-config.json)
- confidence_cutoff (CI-fail bar), auto_accept_min_confidence (auto-fix bar), high_risk_globs,
  per-run token/call budget, eval_min_accuracy, target_regression_precision, min_samples.

Accuracy gate & metrics
- web/e2e/visual/triage-eval/ holds a curated eval set; visual-triage-eval.yml runs the same
  pipeline (real VLM when VISUAL_TRIAGE_API_KEY is set, else a deterministic mock smoke) and fails
  below eval_min_accuracy. A metrics workflow publishes a regression-precision badge.

This is meant to keep running as standing infrastructure, not a one-off check.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Default operation without a VLM key: the workflow runs in detect-only mode — a real
visual change is routed to human review, fails the check, and files a tracking issue
(update the baseline if intended, else fix). Configuring a vision-capable model via
VISUAL_TRIAGE_API_KEY upgrades this to automated semantic classification.

Signed-off-by: Mengping Zhang <mengping.zhang@bytedance.com>
@MengpingZhang MengpingZhang force-pushed the visual-regression-system branch from 606982d to 8eb9068 Compare June 25, 2026 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the DCO. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants