diff --git a/domains/pr-workflow/skills/pr-validate/references/claim-extraction.md b/domains/pr-workflow/skills/pr-validate/references/claim-extraction.md new file mode 100644 index 0000000..8eabb94 --- /dev/null +++ b/domains/pr-workflow/skills/pr-validate/references/claim-extraction.md @@ -0,0 +1,61 @@ +# Claim extraction + +Before choosing any lane, turn the PR into a **falsifiable, surface-specific claim**. Every lane is only as good as the claim it tests. A vague claim ("improves perf", "fixes the bug") can't be proven or refuted; a sharp claim names the precondition, action, observable outcome, and what would disprove it. + +## Read these, in order + +1. **PR body** — Description (what/why), `Fixes #N`, Manual testing steps, the Before/After intent. +2. **Linked issue(s)** — the bug report / acceptance criteria; "Steps to reproduce" and "Expected vs actual" are the claim in the reporter's words. +3. **The diff** (`gh pr diff`) — what actually changed: which surfaces, controllers, modules. Anchor the claim to what the code can do, not only what the body promises. +4. **Labels / type** — bug vs feat vs perf vs refactor changes the claim shape (see special cases). + +## Extraction steps + +1. **Asserted change** — what does the PR say it does? (body + issue) +2. **Anchor to the diff** — which surface/module changed? Reconcile intent with the diff. If the body promises X but the diff can't deliver X, **flag the drift** — that's a finding, not a claim. +3. **Phrase as falsifiable** — `Given , when , then .` The outcome must be observable and checkable. Replace vague verbs (improve / fix / handle / support) with the concrete observable. +4. **Pin the surface + reachability** — exact screen / API / metric. Reachable in the default fixture, or does it need state seeding, a feature flag, or a fallback surface? +5. **Classify the type** → routes to lanes via the matching guide: visible UI · non-visible perf · telemetry · persisted-state · build-output · behavior-no-UI. +6. **Decompose mixed claims** — a PR that changes UI *and* shifts a metric is two claims; validate each. +7. **Name the falsifier** — what observation would prove it FALSE? (the negative). The trustworthiness anchor. +8. **Set the baseline** — before/after needs a "before": the base ref, a before-window (telemetry), or a test that **fails on `main`** (bug fixes). + +## Claim Card (output) + +``` +Claim: Given , when , then . +Surface: (reachable? seed / flag / fallback: …) +Type: → lanes <…> +Falsifier: +Baseline: +``` + +One card per claim. For a refactor, the claim is a **negation** (see below). + +## Claim quality bar + +A good claim is **falsifiable** (observable outcome + clear falsifier), **surface-specific** (names the exact screen/API/metric, not "the app"), **diff-anchored** (the changed code can plausibly produce it), **bounded** (one behavior, one precondition), and **measurable** where quantitative (a number + threshold, not "faster"). + +## Anti-patterns → refinements + +| Vague claim | Refined | +|---|---| +| "Improves performance" | "Opening the Activity tab: TBT drops below 200ms (was >600ms)" — name the interaction, metric, threshold | +| "Fixes the bug" | "With privacy mode on, the Perps tab balance is masked" — observable behavior + precondition + surface | +| "Refactor, no behavior change" | Negation claim: "behavior of `` is unchanged" → prove via a regression test staying green / snapshot / identical output, **not** a screenshot | +| "Adds a null check" (restates the diff) | "No crash when `` is null on ``" — the behavior, not the code | +| Body promises X, diff does Y | Not a claim — **flag the drift** to the author | + +## Special cases + +- **Refactor / no-op:** the claim is "nothing observable changed." Falsifier = any behavior/output diff. Prove via a regression test staying green, an empty snapshot diff, identical bundle/output, or a benchmark within noise. A passing screenshot proves nothing here. +- **Bug fix:** the strongest claim form ships its own falsifier — a test that fails on `main` and passes on the branch. Extract the claim straight from the issue's "Expected vs actual." +- **Perf:** always quantify — metric + interaction + threshold + baseline. Without a number it isn't falsifiable. +- **Persisted-state / migration:** claim = "upgrading from `` preserves `` and applies ``." Falsifier = corrupted/lost state. Baseline = a profile from the prior version. +- **Flag-gated:** two claims, one per flag state. + +## Worked examples + +- **Visible:** body "privacy mode doesn't hide the Perps balance"; issue: expected masked, actual visible; diff touches the Perps balance component. → **Claim:** *Given privacy mode on, when I open the Perps tab, the balance is masked.* **Surface:** Perps tab (gated → fallback: Shield entry modal). **Type:** visible. **Falsifier:** balance digits visible under privacy mode. **Baseline:** same flow on base reproduces the bug. +- **Perf:** body "defer Rive wasm at startup"; diff: dynamic `import()` of the Rive runtime. → **Claim:** *On cold start of the home view, the Rive wasm chunk is not requested until the animation surface mounts.* **Surface:** startup network + chunk graph. **Type:** perf. **Falsifier:** the chunk appears in the cold-start waterfall. **Baseline:** base requests it at startup. +- **Migration:** diff adds a state migration. → **Claim:** *Loading a profile from `` applies the migration; `changedKeys` covers only the touched controllers; all other state intact.* **Type:** state. **Falsifier:** an untouched controller mutated, or migrated state malformed. **Baseline:** a prior-version profile. diff --git a/domains/pr-workflow/skills/pr-validate/references/evidence-catalog.md b/domains/pr-workflow/skills/pr-validate/references/evidence-catalog.md new file mode 100644 index 0000000..41586e7 --- /dev/null +++ b/domains/pr-workflow/skills/pr-validate/references/evidence-catalog.md @@ -0,0 +1,68 @@ +# Evidence catalog — extension + +The menu of evidence kinds for validating a `metamask-extension` PR, with **what each proves** and **which skill or tool captures it**. This is a matching guide, not a capture cookbook — for capture mechanics it defers to the sibling skills (`visual-testing`, `performance-testing`, `e2e-test`, `e2e-flakiness-patterns`, `component-view-test`, `test-i18n-usage`, `unit-testing`, `ab-testing`) and to AEP. + +Pick the evidence that would **falsify the claim if it were false**. Prefer a lane that yields an artifact a reviewer can independently re-check (a link, image, number, or replayable trace) over prose. Match, then capture — don't run the whole menu. Verify any `yarn` script name against the repo's `package.json`; they drift. + +## Matching guide (claim → lanes) + +| The PR claims… | Lead with | Corroborate | +|---|---|---| +| a visible UI behavior | visual (`visual-testing` / AEP `visual_validation`) | recording for motion; a11y | +| a fixed bug (any) | **falsifying test** — fails on `main`, passes on the branch | visual if visible; Sentry if it errored | +| preload / no-double-fetch / lazy-load | AEP `perf_validation` | DevTools CDP netlog, chunk membership | +| a render / over-render fix | WDYR + React DevTools | startup traces | +| interaction responsiveness | INP, long-task TBT | DevTools profile | +| startup / load timing | benchmark A/B (paired) | startup phase traces, FCP/LCP | +| smaller/cleaner bundle | bundle-size diff | chunk membership | +| a memory leak fixed | heap-over-a-flow | benchmark over time | +| an error/crash fixed | Sentry rate→0 | falsifying test, visual | +| a dependency change is safe | LavaMoat policy + manifest diff | bundle-size | +| persisted-state change | **state migration** (`changedKeys`, old→new) | vault round-trip | +| tx / confirmation behavior | transaction simulation | e2e trace | +| dapp / provider behavior | provider connectivity (EIP-1193/6963) | e2e trace | +| flag-gated behavior | feature-flag matrix (on/off) | visual per state | +| snap behavior | snaps execution | distributed trace | +| copy / localization | i18n usage | visual | + +## Lanes by family + +**Behavior & flow** +- **Visual before/after** — UI on a real headed build with controlled state/network. Capture via `visual-testing` (`yarn build:test:webpack` → `dist/chrome`; `mm launch` / `describe-screen` / `screenshot`; `mm mock-network` for degraded paths). Or AEP `visual_validation` (autonomous: deterministic seed + agent navigation). +- **E2E trace + video** — a replayable full-flow proof. Playwright: `yarn playwright test ` (trace `on` by default; view `yarn test:e2e:pw:report`). Selenium: `yarn test:e2e:single --browser chrome|firefox [--retries n]` (screenshots auto on failure). See `e2e-test`. +- **Falsifying regression test** — the strongest single proof a fix targets the bug: a new test that **fails on `main`, passes on the branch**. Show both runs. Reach for it on every bug fix. +- **Component / Storybook** — a component across states in isolation: `yarn storybook`, `yarn test-storybook` (visual + a11y via the Storybook a11y addon). See `component-view-test`. +- **Flaky-stability rerun** — run N× to prove non-flakiness (Playwright `--retries`, Selenium `--retries n`). See `e2e-flakiness-patterns`. + +**Performance & render** +- **Startup / custom traces** — which phase moved: `TraceName` spans in `shared/lib/trace.ts`, read via `window.stateHooks.getCustomTraces()` (test/debug). LCP fallback mark `mm-hero-painted`. +- **Web vitals** — `window.stateHooks.getWebVitalsMetrics()` → INP/FCP/LCP/CLS (`ui/helpers/utils/web-vitals.ts`, attribution build). Note: INP fires on all pages; **FCP/LCP/CLS don't fire on popup pages** (sidepanel/E2E only). +- **Long-task / TBT** — main-thread blocking: `window.stateHooks.getLongTaskMetricsWithTBT()` (`ui/helpers/utils/performance-observers.ts`). TBT lives here, not in the web-vitals lane. +- **React render & selector** — WDYR (`ENABLE_WHY_DID_YOU_RENDER`, wired in `app/scripts/development/wdyr.ts`) for unnecessary re-renders; `yarn devtools:react` for the flame graph. Selectors use `reselect`; no built-in call counter — prove via WDYR. +- **Benchmark A/B** — `yarn test:e2e:benchmark` (presets in `shared/constants/benchmarks.ts`). The rolling baseline sits behind a `continue-on-error` step and can silently freeze — verify it's current and prefer a **paired A/B** (build both refs, compare directly). See `performance-testing` / `ab-testing`. +- **DevTools / CDP** *(manual)* — flame chart, network waterfall, JS coverage, heap snapshots over a flow (memory leaks), CPU throttling (`Emulation.setCPUThrottlingRate`), animation FPS. No repo helper — DevTools or `mm cdp`. + +**Build output** +- **Bundle-size diff** — measured grow/shrink (bundle-size CI or local build comparison). +- **Chunk membership** — a module moved to the intended lazy chunk (webpack build; source-map membership). +- **LavaMoat policy diff** — no new global/builtin/package capability: regenerate `lavamoat/webpack/build/policy.json` (`yarn webpack:lavamoat:policy:build`) and `git diff`; new entries need justification. +- **Manifest permissions diff** — no scope creep: `git diff app/manifest/v{2,3}/_base.json`. +- **Build-variant matrix** — works across types: `yarn build:test:flask` / `:beta` / `:mv2`. + +**Production telemetry** +- **Sentry query links** — before/after error-rate / transaction / latency, scoped to the release. For PRs that *add/change span instrumentation* (volume), use the analytics span-quota skill instead. +- **Distributed traces** — a span/transaction now appears / is shaped correctly. + +**Extension integrity** +- **State migration** — a persisted-state change doesn't corrupt existing users: `app/scripts/migrations/NNN.test.js` asserts `meta.version` and that `changedKeys` covers only mutated controllers. Scaffold with `./development/generate-migration.sh NNN`. +- **Vault / keyring round-trip** — lossless encrypt→decrypt (`app/scripts/lib/encryptor-factory.ts`; `test/e2e/tests/vault-corruption/`). +- **Transaction simulation / gas** — balance-changes/gas correct before submit (`app/scripts/lib/transaction/containers/enforced-simulations.ts`; `test/e2e/tests/simulation-details/`). +- **Provider / dapp connectivity** — injection + connect + requests: `yarn dapp` (test-dapp on :8080); EIP-6963 `test/e2e/provider/eip-6963.spec.js`. +- **Feature-flag matrix** — correct in both remote-flag states: flags come from the remote client-config API (not local config); mock the response in e2e to force each state (`test/e2e/tests/remote-feature-flag/`). +- **Snaps / multichain** — snap behavior (e.g. `snap_startTrace`): `test/e2e/snaps/`, flask build. +- **i18n usage** — no hardcoded strings; locales resolve: `yarn verify-locales` (`app/_locales/`). + +**CI, review & process** +- **CI check links** (`gh pr checks`), **coverage delta** (`yarn test:unit:coverage`; `codecov.yml`), automated-reviewer output, and **manual reproduction steps** (populates the PR template's Manual testing steps). + +Run the cheapest lane that yields an independently re-checkable artifact, confirm the claim holds, then escalate. Don't over-instrument a one-line copy fix; don't under-prove a startup-latency or migration claim with a single screenshot. diff --git a/domains/pr-workflow/skills/pr-validate/references/evidence-trustworthiness.md b/domains/pr-workflow/skills/pr-validate/references/evidence-trustworthiness.md new file mode 100644 index 0000000..f2cdd2d --- /dev/null +++ b/domains/pr-workflow/skills/pr-validate/references/evidence-trustworthiness.md @@ -0,0 +1,25 @@ +# Evidence trustworthiness (anti-reward-hacking) + +A green result is not proof. An agent — or an eager run — can produce evidence that *looks* like it validates the claim but doesn't. Before believing or publishing any lane, run it through this gate. The Claim Card's **Falsifier** is the anchor: trustworthy evidence is evidence that *could* have shown the falsifier and didn't. + +## The gate (per lane, before publish) + +1. **Non-empty & expected media** — the run produced artifacts of the expected kind. A "pass" with zero artifacts (a skipped or errored step) is not a pass. +2. **Shows the claimed surface** — the screenshot/recording is the Claim Card surface in the asserted state — not a loading spinner, an error toast, the wrong screen, or a pre-action frame. Eyeball it. +3. **Exercises the changed code** — the test/flow actually hits the diff. For a test: it **fails on `main`**. For a flow: the changed component/route is on the path. A green test that never imports the changed module proves nothing. +4. **Signal exceeds noise** — a perf delta is beyond run-to-run variance (paired A/B, multiple iterations). A 3% move on a noisy metric is not evidence. +5. **Could have failed** — the assertion has a reachable failure mode. Always-true assertions (`expect(true)`, a screenshot with no assertion, a query with no time bound) can't falsify anything. +6. **Right baseline** — "before" is the actual base ref / prior version / pre-window, not a stale or mismatched comparison. + +## Lane-specific traps + +- **Visual:** spinner/skeleton mistaken for the loaded state; the toggle (privacy/redaction) not actually flipped; a cached screenshot from a prior run; a fallback surface shown without saying so. +- **Perf / benchmark:** a frozen/stale baseline (verify it's current; prefer paired A/B); single sample; warm-vs-cold mismatch; measuring a different interaction than the claim. +- **Test:** a snapshot regenerated to match the bug (`--updateSnapshot` masking a regression); the test mocks out the changed path; it passes on `main` too (so it isn't a regression test). +- **Telemetry:** query window excludes the release; the error regrouped under a different fingerprint; sample-rate makes "0 events" meaningless. +- **Migration:** only the happy path asserted; `changedKeys` not checked against actual mutations; no real prior-version fixture. +- **Coverage:** a line covered ≠ a behavior asserted (executed but never checked). + +## When evidence fails the gate + +Don't publish it. Either re-capture correctly, **downgrade the verdict to inconclusive** and name what's missing, or — if the evidence shows the claim is false — report the refutation (the validation succeeded; the change didn't). Never round a weak pass up to "proven." diff --git a/domains/pr-workflow/skills/pr-validate/references/worked-examples.md b/domains/pr-workflow/skills/pr-validate/references/worked-examples.md new file mode 100644 index 0000000..70368fa --- /dev/null +++ b/domains/pr-workflow/skills/pr-validate/references/worked-examples.md @@ -0,0 +1,30 @@ +# Worked examples (end-to-end) + +Full runs: claim → lanes → capture → trust-gate → publish. The visual case is in the skill's main instructions; these cover the non-visual claim shapes. + +## Perf — "defer Rive wasm at startup" +- **Claim:** on cold start of the home view, the Rive wasm chunk isn't requested until the animation surface mounts. **Surface:** startup network + chunk graph. **Falsifier:** the chunk appears in the cold-start waterfall. **Baseline:** base requests it at startup. +- **Lanes:** AEP perf validation (primary) → chunk membership + CDP netlog (corroborate). +- **Capture:** paired build of base vs head (`yarn webpack --test`); CDP netlog over cold start for each; source-map chunk membership of the Rive runtime. +- **Trust gate:** cold-vs-cold (not warm); the chunk truly absent (not deferred by a few ms); the netlog covers the whole startup window. +- **Publish:** before/after request list + a chunk-membership table in the PR body. No screenshot needed. + +## Migration — "add migration NNN" +- **Claim:** loading a profile from `` applies NNN; `changedKeys = {X, Y}`; all other state intact. **Falsifier:** an untouched controller mutated / malformed state. **Baseline:** a prior-version profile. +- **Lanes:** migration test (primary) → vault round-trip (if the vault is touched). +- **Capture:** run `NNN.test.js` (old-state-in → new-state-out); assert `changedKeys`; load a real prior-version profile and confirm boot. +- **Trust gate:** the test asserts more than the happy path; `changedKeys` matches the actual mutations; the fixture is a real prior profile, not synthetic. +- **Publish:** the `changedKeys` assertion + before/after state shape; link the test run. + +## Flag-gated — "Perps banner behind a remote feature flag" +- **Claim (×2):** flag on → banner shows; flag off → banner absent. **Surface:** home/Perps. **Falsifier:** banner state ≠ flag state. **Baseline:** each flag state is its own baseline. +- **Lanes:** flag matrix → visual per state. +- **Capture:** mock the client-config response for each flag state; screenshot each. +- **Trust gate:** the flag is actually toggled (read `remoteFeatureFlags`); two distinct states are shown, not the same frame twice. +- **Publish:** a two-up before/after (flag off / flag on) in the PR body. + +## Refactor / no-op — "extract a hook, no behavior change" +- **Claim (negation):** behavior of `` is unchanged. **Falsifier:** any output/behavior diff. **Baseline:** base behavior. +- **Lanes:** regression suite stays green + snapshot diff empty + bundle size within noise. +- **Trust gate:** snapshots were *not* regenerated to hide a diff; the tests actually cover the surface; bundle delta is within noise, not "small but real". +- **Publish:** "no behavior change — regression suite green, snapshots unchanged, bundle ±0"; link CI. A passing screenshot is not evidence here. diff --git a/domains/pr-workflow/skills/pr-validate/repos/metamask-extension.md b/domains/pr-workflow/skills/pr-validate/repos/metamask-extension.md new file mode 100644 index 0000000..0fea137 --- /dev/null +++ b/domains/pr-workflow/skills/pr-validate/repos/metamask-extension.md @@ -0,0 +1,77 @@ +--- +repo: metamask-extension +parent: pr-validate +metadata: + type: pr-validation +--- + +# PR Validation — Extension + +Prove a PR does what it claims with **objective, reviewer-grade evidence**, then attach that evidence to the PR. This is the orchestration layer above the capture skills: it decides *what* evidence a PR needs and *assembles* it. For capture mechanics it defers to siblings — `visual-testing` (mm CLI screenshots/flows), `performance-testing`, `pr-manual-testing`, `pr-description` — and to the MetaMask Autonomous Engineering Platform (AEP) for autonomous runs. + +Complements `pr-readiness-check` (which checks that tests and guidelines are *present*); this skill proves the *behavior* is correct. + +## When to Use + +- Before marking a PR ready for review, or after a force-push, to produce the before/after a reviewer expects. +- To back a perf, telemetry, or UI claim with an artifact a reviewer can independently re-check. +- To assemble and publish an evidence bundle into the PR's Screenshots/Recordings section. + +Not for code-correctness review. + +## The core move: match evidence to the claim + +A PR makes a **falsifiable claim** ("privacy mode now hides the Perps balance"; "hovering the asset row preloads the chart with no double-fetch"; "this cuts startup latency"). Validation = pick the evidence that would **falsify the claim if it were false**, then capture it. Don't run a fixed checklist. + +First, write a **Claim Card** (`Given , when , then ` + surface, falsifier, baseline) from the PR body, the linked issue, and the diff — the linchpin; every lane is only as good as the claim. Full rubric, anti-patterns, and special cases (refactor/no-op, bug-fix, perf, migration, flag-gated): `references/claim-extraction.md`. Then choose lanes: + +| Claim shape | Lead with | Capture via | +|---|---|---| +| Visible UI change (layout, copy, show/hide, theme) | before/after screenshots | `visual-testing` (mm CLI), or AEP `visual_validation` | +| Motion / multi-step flow | screen recording → GIF | `visual-testing` + ffmpeg | +| Non-visible perf behavior (preload, no double-fetch, lazy-load, chunk membership) | falsifiable network/static assertions | DevTools/CDP netlog, AEP `perf_validation` | +| Latency / startup timing | benchmark numbers | `performance-testing`, DevTools profile | +| Telemetry / error-rate / latency in prod | Sentry query link (before/after window) | Sentry MCP / dashboard | +| Bundle / build output | bundle-size diff, chunk membership | build + source-map analysis | +| Behavior with no UI | targeted tests + repro | `pr-manual-testing`, CI checks | + +A PR that mixes claims (a UI fix that also shifts a metric) needs more than one lane, assembled into one bundle. Full lane menu with capture pointers: `references/evidence-catalog.md`. + +## AEP — autonomous validation + +The MetaMask Autonomous Engineering Platform (`MetaMask/metamask-autonomous-engineering-platform`) can validate a PR autonomously. Submit a task for the PR head with `taskClass: visual_validation` (visible behavior) or `perf_validation` (falsifiable network/static/smoke assertions). It checks out the PR, **deterministically seeds** extension state, has an agent **navigate** and capture, and returns an `evidenceBundle` of artifacts (screenshots / proven assertions) that it can publish to the PR body. + +The split that makes the evidence trustworthy: **seeding is deterministic (the platform), navigation is the agent.** That separation is why a screenshot proves the change instead of reward-hacking a loading screen. + +Output modes: `validation_only` (the default for these task classes), plus `pull_request` / `report_only` / `evidence_only`. See the AEP repo's `docs/` for how to run it. + +## Vacuous-pass discipline + +A green result is real only if the evidence bundle is **non-empty** and contains the expected media. An agent chain can "pass" by skipping with zero artifacts. Always confirm artifacts exist — and show the claimed surface — before believing a pass or publishing. Never upgrade a zero-artifact run to "proven". + +## Publishing evidence to the PR body + +Put the evidence where a reviewer expects it: the PR template's **Screenshots/Recordings → Before / After** section (see `pr-description`). + +- Host images somewhere GitHub renders inline (an asset store the PR can reach) — a `localhost` or local file path will not render. +- Convert recordings to GIF (e.g. `ffmpeg` two-pass palette); webm/mp4 don't render inline in PR bodies. +- Scrub local paths, usernames, and internal URLs from any narrative before posting. + +## Reporting + +Lead with the verdict and the claim it tests: + +``` +PR # +Claim: <the falsifiable behavior under test> +Verdict: ✅ proven / ❌ refuted / ⚠️ inconclusive +Evidence: <lane> — <artifact or link>, ... +``` + +If a lane comes back inconclusive, say so and name what's missing. + +## Boundaries + +- Orchestration + methodology; defers capture to `visual-testing` / `performance-testing` / AEP. +- Proves behavior, not code quality (pair with review skills) and not readiness *presence* (`pr-readiness-check`). +- Confirm before writing to a public PR body. diff --git a/domains/pr-workflow/skills/pr-validate/skill.md b/domains/pr-workflow/skills/pr-validate/skill.md new file mode 100644 index 0000000..e696849 --- /dev/null +++ b/domains/pr-workflow/skills/pr-validate/skill.md @@ -0,0 +1,4 @@ +--- +name: pr-validate +description: Validate a PR with objective evidence that proves its behavioral claim — match evidence to the claim, capture it (AEP autonomous validation, visual-testing, performance-testing, Sentry, DevTools), and publish before/after into the PR body. Use when asked to validate or prove a PR, capture before/after evidence, attach screenshots/recordings/Sentry links to a PR, or confirm a change actually works before review. Trigger phrases include "validate this PR", "prove the fix", "before/after evidence", and "build an evidence bundle". +---