MetaMask · MajorLift · Jun 26, 2026 · Jun 26, 2026 · Jun 26, 2026 · Jun 26, 2026
diff --git a/domains/pr-workflow/skills/pr-validate/references/claim-extraction.md b/domains/pr-workflow/skills/pr-validate/references/claim-extraction.md
@@ -0,0 +1,61 @@
+# Claim extraction
+
+Before choosing any lane, turn the PR into a **falsifiable, surface-specific claim**. Every lane is only as good as the claim it tests. A vague claim ("improves perf", "fixes the bug") can't be proven or refuted; a sharp claim names the precondition, action, observable outcome, and what would disprove it.
+
+## Read these, in order
+
+1. **PR body** — Description (what/why), `Fixes #N`, Manual testing steps, the Before/After intent.
+2. **Linked issue(s)** — the bug report / acceptance criteria; "Steps to reproduce" and "Expected vs actual" are the claim in the reporter's words.
+3. **The diff** (`gh pr diff`) — what actually changed: which surfaces, controllers, modules. Anchor the claim to what the code can do, not only what the body promises.
+4. **Labels / type** — bug vs feat vs perf vs refactor changes the claim shape (see special cases).
+
+## Extraction steps
+
+1. **Asserted change** — what does the PR say it does? (body + issue)
+2. **Anchor to the diff** — which surface/module changed? Reconcile intent with the diff. If the body promises X but the diff can't deliver X, **flag the drift** — that's a finding, not a claim.
+3. **Phrase as falsifiable** — `Given <precondition>, when <action>, then <observable outcome>.` The outcome must be observable and checkable. Replace vague verbs (improve / fix / handle / support) with the concrete observable.
+4. **Pin the surface + reachability** — exact screen / API / metric. Reachable in the default fixture, or does it need state seeding, a feature flag, or a fallback surface?
+5. **Classify the type** → routes to lanes via the matching guide: visible UI · non-visible perf · telemetry · persisted-state · build-output · behavior-no-UI.
+6. **Decompose mixed claims** — a PR that changes UI *and* shifts a metric is two claims; validate each.
+7. **Name the falsifier** — what observation would prove it FALSE? (the negative). The trustworthiness anchor.
+8. **Set the baseline** — before/after needs a "before": the base ref, a before-window (telemetry), or a test that **fails on `main`** (bug fixes).
+
+## Claim Card (output)
+
+```
+Claim:     Given <precondition>, when <action>, then <observable>.
+Surface:   <screen / API / metric>  (reachable? seed / flag / fallback: …)
+Type:      <visible | perf | telemetry | state | build | behavior>  → lanes <…>
+Falsifier: <observation that would disprove the claim>
+Baseline:  <base ref | fails-on-main test | before-window>
+```
+
+One card per claim. For a refactor, the claim is a **negation** (see below).
+
+## Claim quality bar
+
+A good claim is **falsifiable** (observable outcome + clear falsifier), **surface-specific** (names the exact screen/API/metric, not "the app"), **diff-anchored** (the changed code can plausibly produce it), **bounded** (one behavior, one precondition), and **measurable** where quantitative (a number + threshold, not "faster").
+
+## Anti-patterns → refinements
+
+| Vague claim | Refined |
+|---|---|
+| "Improves performance" | "Opening the Activity tab: TBT drops below 200ms (was >600ms)" — name the interaction, metric, threshold |
+| "Fixes the bug" | "With privacy mode on, the Perps tab balance is masked" — observable behavior + precondition + surface |
+| "Refactor, no behavior change" | Negation claim: "behavior of `<surface>` is unchanged" → prove via a regression test staying green / snapshot / identical output, **not** a screenshot |
+| "Adds a null check" (restates the diff) | "No crash when `<field>` is null on `<surface>`" — the behavior, not the code |
+| Body promises X, diff does Y | Not a claim — **flag the drift** to the author |
+
+## Special cases
+
+- **Refactor / no-op:** the claim is "nothing observable changed." Falsifier = any behavior/output diff. Prove via a regression test staying green, an empty snapshot diff, identical bundle/output, or a benchmark within noise. A passing screenshot proves nothing here.
+- **Bug fix:** the strongest claim form ships its own falsifier — a test that fails on `main` and passes on the branch. Extract the claim straight from the issue's "Expected vs actual."
+- **Perf:** always quantify — metric + interaction + threshold + baseline. Without a number it isn't falsifiable.
+- **Persisted-state / migration:** claim = "upgrading from `<prior version>` preserves `<state>` and applies `<change>`." Falsifier = corrupted/lost state. Baseline = a profile from the prior version.
+- **Flag-gated:** two claims, one per flag state.
+
+## Worked examples
+
+- **Visible:** body "privacy mode doesn't hide the Perps balance"; issue: expected masked, actual visible; diff touches the Perps balance component. → **Claim:** *Given privacy mode on, when I open the Perps tab, the balance is masked.* **Surface:** Perps tab (gated → fallback: Shield entry modal). **Type:** visible. **Falsifier:** balance digits visible under privacy mode. **Baseline:** same flow on base reproduces the bug.
+- **Perf:** body "defer Rive wasm at startup"; diff: dynamic `import()` of the Rive runtime. → **Claim:** *On cold start of the home view, the Rive wasm chunk is not requested until the animation surface mounts.* **Surface:** startup network + chunk graph. **Type:** perf. **Falsifier:** the chunk appears in the cold-start waterfall. **Baseline:** base requests it at startup.
+- **Migration:** diff adds a state migration. → **Claim:** *Loading a profile from `<prior>` applies the migration; `changedKeys` covers only the touched controllers; all other state intact.* **Type:** state. **Falsifier:** an untouched controller mutated, or migrated state malformed. **Baseline:** a prior-version profile.
diff --git a/domains/pr-workflow/skills/pr-validate/references/evidence-catalog.md b/domains/pr-workflow/skills/pr-validate/references/evidence-catalog.md
@@ -0,0 +1,68 @@
+# Evidence catalog — extension
+
+The menu of evidence kinds for validating a `metamask-extension` PR, with **what each proves** and **which skill or tool captures it**. This is a matching guide, not a capture cookbook — for capture mechanics it defers to the sibling skills (`visual-testing`, `performance-testing`, `e2e-test`, `e2e-flakiness-patterns`, `component-view-test`, `test-i18n-usage`, `unit-testing`, `ab-testing`) and to AEP.
+
+Pick the evidence that would **falsify the claim if it were false**. Prefer a lane that yields an artifact a reviewer can independently re-check (a link, image, number, or replayable trace) over prose. Match, then capture — don't run the whole menu. Verify any `yarn` script name against the repo's `package.json`; they drift.
+
+## Matching guide (claim → lanes)
+
+| The PR claims… | Lead with | Corroborate |
+|---|---|---|
+| a visible UI behavior | visual (`visual-testing` / AEP `visual_validation`) | recording for motion; a11y |
+| a fixed bug (any) | **falsifying test** — fails on `main`, passes on the branch | visual if visible; Sentry if it errored |
+| preload / no-double-fetch / lazy-load | AEP `perf_validation` | DevTools CDP netlog, chunk membership |
+| a render / over-render fix | WDYR + React DevTools | startup traces |
+| interaction responsiveness | INP, long-task TBT | DevTools profile |
+| startup / load timing | benchmark A/B (paired) | startup phase traces, FCP/LCP |
+| smaller/cleaner bundle | bundle-size diff | chunk membership |
+| a memory leak fixed | heap-over-a-flow | benchmark over time |
+| an error/crash fixed | Sentry rate→0 | falsifying test, visual |
+| a dependency change is safe | LavaMoat policy + manifest diff | bundle-size |
+| persisted-state change | **state migration** (`changedKeys`, old→new) | vault round-trip |
+| tx / confirmation behavior | transaction simulation | e2e trace |
+| dapp / provider behavior | provider connectivity (EIP-1193/6963) | e2e trace |
+| flag-gated behavior | feature-flag matrix (on/off) | visual per state |
+| snap behavior | snaps execution | distributed trace |
+| copy / localization | i18n usage | visual |
+
+## Lanes by family
+
+**Behavior & flow**
+- **Visual before/after** — UI on a real headed build with controlled state/network. Capture via `visual-testing` (`yarn build:test:webpack` → `dist/chrome`; `mm launch` / `describe-screen` / `screenshot`; `mm mock-network` for degraded paths). Or AEP `visual_validation` (autonomous: deterministic seed + agent navigation).
+- **E2E trace + video** — a replayable full-flow proof. Playwright: `yarn playwright test <spec>` (trace `on` by default; view `yarn test:e2e:pw:report`). Selenium: `yarn test:e2e:single <path> --browser chrome|firefox [--retries n]` (screenshots auto on failure). See `e2e-test`.
+- **Falsifying regression test** — the strongest single proof a fix targets the bug: a new test that **fails on `main`, passes on the branch**. Show both runs. Reach for it on every bug fix.
+- **Component / Storybook** — a component across states in isolation: `yarn storybook`, `yarn test-storybook` (visual + a11y via the Storybook a11y addon). See `component-view-test`.
+- **Flaky-stability rerun** — run N× to prove non-flakiness (Playwright `--retries`, Selenium `--retries n`). See `e2e-flakiness-patterns`.
+
+**Performance & render**
+- **Startup / custom traces** — which phase moved: `TraceName` spans in `shared/lib/trace.ts`, read via `window.stateHooks.getCustomTraces()` (test/debug). LCP fallback mark `mm-hero-painted`.
+- **Web vitals** — `window.stateHooks.getWebVitalsMetrics()` → INP/FCP/LCP/CLS (`ui/helpers/utils/web-vitals.ts`, attribution build). Note: INP fires on all pages; **FCP/LCP/CLS don't fire on popup pages** (sidepanel/E2E only).
+- **Long-task / TBT** — main-thread blocking: `window.stateHooks.getLongTaskMetricsWithTBT()` (`ui/helpers/utils/performance-observers.ts`). TBT lives here, not in the web-vitals lane.
+- **React render & selector** — WDYR (`ENABLE_WHY_DID_YOU_RENDER`, wired in `app/scripts/development/wdyr.ts`) for unnecessary re-renders; `yarn devtools:react` for the flame graph. Selectors use `reselect`; no built-in call counter — prove via WDYR.
+- **Benchmark A/B** — `yarn test:e2e:benchmark` (presets in `shared/constants/benchmarks.ts`). The rolling baseline sits behind a `continue-on-error` step and can silently freeze — verify it's current and prefer a **paired A/B** (build both refs, compare directly). See `performance-testing` / `ab-testing`.
+- **DevTools / CDP** *(manual)* — flame chart, network waterfall, JS coverage, heap snapshots over a flow (memory leaks), CPU throttling (`Emulation.setCPUThrottlingRate`), animation FPS. No repo helper — DevTools or `mm cdp`.
+
+**Build output**
+- **Bundle-size diff** — measured grow/shrink (bundle-size CI or local build comparison).
+- **Chunk membership** — a module moved to the intended lazy chunk (webpack build; source-map membership).
+- **LavaMoat policy diff** — no new global/builtin/package capability: regenerate `lavamoat/webpack/build/policy.json` (`yarn webpack:lavamoat:policy:build`) and `git diff`; new entries need justification.
+- **Manifest permissions diff** — no scope creep: `git diff app/manifest/v{2,3}/_base.json`.
+- **Build-variant matrix** — works across types: `yarn build:test:flask` / `:beta` / `:mv2`.
+
+**Production telemetry**
+- **Sentry query links** — before/after error-rate / transaction / latency, scoped to the release. For PRs that *add/change span instrumentation* (volume), use the analytics span-quota skill instead.
+- **Distributed traces** — a span/transaction now appears / is shaped correctly.
+
+**Extension integrity**
+- **State migration** — a persisted-state change doesn't corrupt existing users: `app/scripts/migrations/NNN.test.js` asserts `meta.version` and that `changedKeys` covers only mutated controllers. Scaffold with `./development/generate-migration.sh NNN`.
+- **Vault / keyring round-trip** — lossless encrypt→decrypt (`app/scripts/lib/encryptor-factory.ts`; `test/e2e/tests/vault-corruption/`).
+- **Transaction simulation / gas** — balance-changes/gas correct before submit (`app/scripts/lib/transaction/containers/enforced-simulations.ts`; `test/e2e/tests/simulation-details/`).
+- **Provider / dapp connectivity** — injection + connect + requests: `yarn dapp` (test-dapp on :8080); EIP-6963 `test/e2e/provider/eip-6963.spec.js`.
+- **Feature-flag matrix** — correct in both remote-flag states: flags come from the remote client-config API (not local config); mock the response in e2e to force each state (`test/e2e/tests/remote-feature-flag/`).
+- **Snaps / multichain** — snap behavior (e.g. `snap_startTrace`): `test/e2e/snaps/`, flask build.
+- **i18n usage** — no hardcoded strings; locales resolve: `yarn verify-locales` (`app/_locales/`).
+
+**CI, review & process**
+- **CI check links** (`gh pr checks`), **coverage delta** (`yarn test:unit:coverage`; `codecov.yml`), automated-reviewer output, and **manual reproduction steps** (populates the PR template's Manual testing steps).
+
+Run the cheapest lane that yields an independently re-checkable artifact, confirm the claim holds, then escalate. Don't over-instrument a one-line copy fix; don't under-prove a startup-latency or migration claim with a single screenshot.
diff --git a/domains/pr-workflow/skills/pr-validate/references/evidence-trustworthiness.md b/domains/pr-workflow/skills/pr-validate/references/evidence-trustworthiness.md
@@ -0,0 +1,25 @@
+# Evidence trustworthiness (anti-reward-hacking)
+
+A green result is not proof. An agent — or an eager run — can produce evidence that *looks* like it validates the claim but doesn't. Before believing or publishing any lane, run it through this gate. The Claim Card's **Falsifier** is the anchor: trustworthy evidence is evidence that *could* have shown the falsifier and didn't.
+
+## The gate (per lane, before publish)
+
+1. **Non-empty & expected media** — the run produced artifacts of the expected kind. A "pass" with zero artifacts (a skipped or errored step) is not a pass.
+2. **Shows the claimed surface** — the screenshot/recording is the Claim Card surface in the asserted state — not a loading spinner, an error toast, the wrong screen, or a pre-action frame. Eyeball it.
+3. **Exercises the changed code** — the test/flow actually hits the diff. For a test: it **fails on `main`**. For a flow: the changed component/route is on the path. A green test that never imports the changed module proves nothing.
+4. **Signal exceeds noise** — a perf delta is beyond run-to-run variance (paired A/B, multiple iterations). A 3% move on a noisy metric is not evidence.
+5. **Could have failed** — the assertion has a reachable failure mode. Always-true assertions (`expect(true)`, a screenshot with no assertion, a query with no time bound) can't falsify anything.
+6. **Right baseline** — "before" is the actual base ref / prior version / pre-window, not a stale or mismatched comparison.
+
+## Lane-specific traps
+
+- **Visual:** spinner/skeleton mistaken for the loaded state; the toggle (privacy/redaction) not actually flipped; a cached screenshot from a prior run; a fallback surface shown without saying so.
+- **Perf / benchmark:** a frozen/stale baseline (verify it's current; prefer paired A/B); single sample; warm-vs-cold mismatch; measuring a different interaction than the claim.
+- **Test:** a snapshot regenerated to match the bug (`--updateSnapshot` masking a regression); the test mocks out the changed path; it passes on `main` too (so it isn't a regression test).
+- **Telemetry:** query window excludes the release; the error regrouped under a different fingerprint; sample-rate makes "0 events" meaningless.
+- **Migration:** only the happy path asserted; `changedKeys` not checked against actual mutations; no real prior-version fixture.
+- **Coverage:** a line covered ≠ a behavior asserted (executed but never checked).
+
+## When evidence fails the gate
+
+Don't publish it. Either re-capture correctly, **downgrade the verdict to inconclusive** and name what's missing, or — if the evidence shows the claim is false — report the refutation (the validation succeeded; the change didn't). Never round a weak pass up to "proven."
diff --git a/domains/pr-workflow/skills/pr-validate/references/worked-examples.md b/domains/pr-workflow/skills/pr-validate/references/worked-examples.md
@@ -0,0 +1,30 @@
+# Worked examples (end-to-end)
+
+Full runs: claim → lanes → capture → trust-gate → publish. The visual case is in the skill's main instructions; these cover the non-visual claim shapes.
+
+## Perf — "defer Rive wasm at startup"
+- **Claim:** on cold start of the home view, the Rive wasm chunk isn't requested until the animation surface mounts. **Surface:** startup network + chunk graph. **Falsifier:** the chunk appears in the cold-start waterfall. **Baseline:** base requests it at startup.
+- **Lanes:** AEP perf validation (primary) → chunk membership + CDP netlog (corroborate).
+- **Capture:** paired build of base vs head (`yarn webpack --test`); CDP netlog over cold start for each; source-map chunk membership of the Rive runtime.
+- **Trust gate:** cold-vs-cold (not warm); the chunk truly absent (not deferred by a few ms); the netlog covers the whole startup window.
+- **Publish:** before/after request list + a chunk-membership table in the PR body. No screenshot needed.
+
+## Migration — "add migration NNN"
+- **Claim:** loading a profile from `<prior>` applies NNN; `changedKeys = {X, Y}`; all other state intact. **Falsifier:** an untouched controller mutated / malformed state. **Baseline:** a prior-version profile.
+- **Lanes:** migration test (primary) → vault round-trip (if the vault is touched).
+- **Capture:** run `NNN.test.js` (old-state-in → new-state-out); assert `changedKeys`; load a real prior-version profile and confirm boot.
+- **Trust gate:** the test asserts more than the happy path; `changedKeys` matches the actual mutations; the fixture is a real prior profile, not synthetic.
+- **Publish:** the `changedKeys` assertion + before/after state shape; link the test run.
+
+## Flag-gated — "Perps banner behind a remote feature flag"
+- **Claim (×2):** flag on → banner shows; flag off → banner absent. **Surface:** home/Perps. **Falsifier:** banner state ≠ flag state. **Baseline:** each flag state is its own baseline.
+- **Lanes:** flag matrix → visual per state.
+- **Capture:** mock the client-config response for each flag state; screenshot each.
+- **Trust gate:** the flag is actually toggled (read `remoteFeatureFlags`); two distinct states are shown, not the same frame twice.
+- **Publish:** a two-up before/after (flag off / flag on) in the PR body.
+
+## Refactor / no-op — "extract a hook, no behavior change"
+- **Claim (negation):** behavior of `<surface>` is unchanged. **Falsifier:** any output/behavior diff. **Baseline:** base behavior.
+- **Lanes:** regression suite stays green + snapshot diff empty + bundle size within noise.
+- **Trust gate:** snapshots were *not* regenerated to hide a diff; the tests actually cover the surface; bundle delta is within noise, not "small but real".
+- **Publish:** "no behavior change — regression suite green, snapshots unchanged, bundle ±0"; link CI. A passing screenshot is not evidence here.