feat(ci): bs-lml-gate coordinator workflow by jakebromberg · Pull Request #171 · WXYC/wxyc-shared

jakebromberg · 2026-06-05T05:08:14Z

Closes #166. Part of WXYC/wiki#80 phase 5.

Summary

Adds .github/workflows/bs-lml-gate.yml — the coordinator that promotes Backend-Service and library-metadata-lookup prod branches together after their staging deploys + E2E + canary smoke all pass. Phase 5a (wxyc-canary#43) merged the CLI shim this gate calls, so this PR is the next domino.

Helpers live at scripts/bs-lml-gate/ (not the top-level scripts/ dir) so the gate's blast radius is one directory listing. Each helper takes inputs via env vars, writes outputs to \$GITHUB_OUTPUT, and has bats tests in scripts/__tests__/bs-lml-gate/ (31 tests):

resolve-shas.sh — fetches both repos' main + prod SHAs via gh api
poll-staging-health.sh — polls a URL until 200 or timeout (default 5min)
verify-version.sh — compares a service's /version to an expected SHA
push-prod.sh — advances a repo's prod ref via the GitHub refs API (no local clone — keeps the self-hosted runner clean)
log-bypass.sh — appends an audit comment to the tracker issue

Plus .github/actionlint.yaml declaring the e2e-runner self-hosted label (also used by e2e-runner-bootstrap) so future actionlint runs don't flag it.

Design decisions

Re-query main on every run. client_payload.source_sha is informational only. The gate always re-resolves both main SHAs after acquiring the concurrency lock, so a stale dispatch arriving after a newer merge cannot regress prod.
cancel-in-progress: false. A cancellation mid-push would leave one repo's prod advanced and the other not — splits the staging-gate invariant.
Fine-grained PATs, not GITHUB_TOKEN. Per-repo PATs (PROD_PUSH_TOKEN_BS, PROD_PUSH_TOKEN_LML) scoped to Contents: write on prod only. The audit log shows the promoter identity rather than the generic GHA bot.
Bypass cannot be silent. workflow_dispatch with skip_gate: true requires a non-empty justification and posts an audit comment to the pinned tracker issue before any prod push. If the audit append fails, the workflow exits non-zero before touching prod.
Half-success is recoverable, not prevented. If BS push succeeds and LML push fails, the operator manually fast-forwards the laggard (runbook in wiki#81). Preventing this requires a two-phase commit pattern not justified for our volume.

Blocked operationally by (but not blocked for merge)

This workflow is the orchestrator — the dispatchers and staging envs come from peer phases:

wxyc-shared#165 — e2e-runner provisioning
library-metadata-lookup#496 — LML staging env + dispatch
Backend-Service#1335 — BS staging containers + dispatch
wiki#81 — runbook + bypass-path documentation

Until those land, the workflow exists but no dispatches reach it. Merging the gate now lets phases 3 + 4 author dispatchers against a real workflow file.

Setup checklist (documented in README under "Staging gate (bs-lml-gate)")

Fine-grained PATs PROD_PUSH_TOKEN_BS, PROD_PUSH_TOKEN_LML as repo secrets.
Repo variables BS_STAGING_BASE_URL, BS_STAGING_AUTH_URL, LML_STAGING_BASE_URL (placeholder until phases 3, 4 stand up envs).
Pinned bypass-tracker issue; vars.GATE_BYPASS_ISSUE_NUMBER set to its number.
LML_API_KEY_STAGING secret (from phase 3).

Test plan

npm run test:bs-lml-gate — 31/31 bats green
npm run lint — clean
npm test — 460/460 green
actionlint .github/workflows/*.yml — clean
shellcheck scripts/bs-lml-gate/*.sh — clean
Post-merge: manual workflow_dispatch with skip_gate: true against current main SHAs, expecting a no-op (since prod doesn't exist yet, this will exercise the seed POST path of push-prod.sh). Defer until secrets are in place.
First real gate pass: end-to-end test once phases 3 + 4 land (covered by wiki#81 runbook).

Coordinator workflow for BS + LML prod promotion (wiki#80 phase 5). On repository_dispatch from either repo's staging deploy, re-resolves both main SHAs, polls both staging /health, verifies each /version matches main, runs E2E + canary smoke, and on pass fast-forwards each repo's prod branch. Helpers live at scripts/bs-lml-gate/ (not the top-level scripts/) so the gate's blast radius is one directory listing. Each has bats tests in scripts/__tests__/bs-lml-gate/ (31 tests). Trigger payload is informational only — the gate always re-queries main so a stale dispatch can't regress prod. Concurrency cancel-in-progress=false so a cancelled mid-push never leaves the two prod branches split. Prod pushes use repo-scoped fine-grained PATs (not GITHUB_TOKEN) so the audit log shows the promoter identity. Bypass path via workflow_dispatch with required justification; audit comment posted to a pinned tracker issue *before* any prod push, with failure-to-audit blocking the bypass.

Twelve findings from the iter-2 code review. Canary contract corrections (would have made the gate red on first dispatch): pass --base-url, --auth-url, --lml-url to `wxyc-canary check` (only --suite was set before); send Origin via CANARY_ORIGIN_URL (the var the CLI reads — ORIGIN_URL was being ignored); bump CANARY_PIN to the post-prepare-script wxyc-canary commit so `npm install -g github:...` actually builds dist/cli.js, then smoke-test the install with `wxyc-canary --help` before the real invocation. Audit-then-push invariant hardening: make justification `required: true` so the form-submit boundary enforces what log-bypass.sh already requires at runtime; add noop guard to the audit step so a bypass-against-already-promoted-prods doesn't emit a misleading "SHA promoted" comment; make `success() &&` explicit on the BS push step with a comment explaining the load-bearing role; track outcome of each push step (`id: push_bs`, `id: push_lml`) — LML push runs only if BS wasn't skipped, and the job summary reads the actual outcomes instead of inferring from SHA diffs (split-promotion gets a `:rotating_light:` callout pointing at the runbook). Helper-script defensiveness in log-bypass.sh: iterate the validation loop over actual env-var names so the error message tells the operator exactly which variable to set (`BYPASS_ISSUE_NUMBER`, not `BYPASS_ISSUE`); wrap user-supplied justification in a fenced code block to neutralize @-mentions, links, and headings (fence auto-flips between ``` and ~~~ if the input contains the chosen delimiter, embedded fence sequences get broken up as belt-and-suspenders). Helper-script defensiveness in resolve-shas.sh: distinguish 404 (branch missing → empty SHA → seed path) from every other gh failure (transient 5xx, expired PAT, rate-limit → propagate). The previous collapse-all-errors-to-empty behavior would let a 503 on a `prod` lookup mis-trigger the seed/POST path on an existing branch. Surfaces gh's stderr on the propagated path. Also capture `$?` via `|| rc=$?` instead of `if cmd; then ... fi` (which resets $? to 0 by the time the else branch reads it — silent bug in the previous iteration). Documentation: add header comment on the /version verify steps pointing at the phase 3/4 prereqs (LML#496 / BS#1335) that add the /version routes. verify-version steps will 404 until those land, which is the expected state since the staging envs don't exist yet either. Tests: add resolve-shas tests for 5xx and 401 paths (must fail-fast, not seed); add log-bypass tests for the renamed error message and the markdown-escape behavior; update the resolve-shas fake `gh` to emit 404-style stderr by default and a 5xx-style stderr for explicit failure fixtures.

…ction defense) Twelve findings from iter-3, two of them critical. CRITICAL: log-bypass.sh used `gh api -f body=@${body_file}` after the iter-2 cleanup. Per `gh api --help`, only `-F/--field` supports `@<path>` for file reads; `-f/--raw-field` adds a literal string parameter. With `-f`, every audit comment posted to the tracker issue is the seven-character literal string `@/tmp/tmp.XXXXXX` instead of the rendered actor/SHAs/justification block. The fake `gh` in the bats suite read `body=@<path>` regardless of flag, so all 9 log-bypass tests stayed green while the production behavior was broken. Reverts to `-F`; fixes the fake `gh` to mirror real flag semantics; adds a regression test that asserts the script uses `-F` and the body that landed is the rendered markdown (not the literal path). CRITICAL: LML push iter-2 gating was `if: always() && steps.push_bs.outcome != 'skipped' && ...`, which ran LML when BS push FAILED (failure != skipped) — producing exactly the split-prod state the concurrency comment is supposed to prevent. The transient-5xx-on-BS-push scenario is the load-bearing failure mode this gate exists to defend against. Reverted to `success() && steps.noop.outputs.noop != 'true'` (same shape as BS push), so any failure between resolve and BS-push skips LML cleanly. `required: true` reverted to `required: false` on `inputs.justification`: forcing a justification on every workflow_dispatch (including normal non-bypass re-runs) was UX friction that conflicted with the field description ("Required when skip_gate=true; ignored otherwise"). log-bypass.sh's empty-string check is the load-bearing enforcement; the form description is the contract. `resolve-shas.sh` 404 grep tightened: was `grep -qiE 'HTTP 404|Not Found'`, now `grep -q 'HTTP 404'` (case-sensitive, anchored). Old regex matched gh stderr like `host not found` (DNS) and `object not found via redirect (HTTP 503)`, mis-classifying real failures as branch-missing → seed and driving push-prod into the POST path on an existing branch. `push-prod.sh` now writes `did_push=true|false` to `$GITHUB_OUTPUT` so the workflow summary can distinguish "API call landed" from "short-circuited at target==current". Summary's `render_outcome` now reports `unchanged (already at main)` for no-op pushes instead of misleadingly labeling them `advanced`. Split-promotion warning now triggers on the actual ref-movement state (did_push) rather than inferring from step outcome, and a symmetric warning catches the (now-impossible-via-gate) inverted case as a workflow invariant violation. Summary also adds an explicit `cancelled mid-push` label. Heredoc EOF-injection defense in log-bypass.sh: the audit body now builds via `printf '%s\n' ...` instead of `cat <<EOF`, so a justification line of exactly `EOF` can't terminate the heredoc early and corrupt the comment. Adds a regression test with a payload containing a standalone `EOF` line. Whitespace-only justification rejection: `[[ -z "${BYPASS_JUSTIFICATION//[[:space:]]/}" ]]` catches payloads like `' \n\t '` that previously passed the `-z` length check while contributing no audit content. Shell-injection defense via env vars: canary-check + dispatch-summary steps moved all attacker-controlled values (`vars.*`, `github.event.client_payload.*`) into the step `env:` block rather than direct `${{ }}` template-expansion inside the run script. A compromised admin (or a malicious repository_dispatch payload) can't break out of shell quoting to execute commands on the self-hosted runner. `vars.*` write privilege is lower than `secrets.*`, so this tightens the lower-bar attack surface. Tests: fence-neutralization test rewritten to assert the actual invariant (outer fence is not escapable by input content) rather than a stricter property that doesn't hold; new `did_push` assertions on push-prod tests; new `-F`-flag regression test; new EOF + whitespace-only justification tests. Documentation: comment on `CANARY_ORIGIN_URL` flagging that phase 4 (BS#1335) needs to configure BS staging's `BETTER_AUTH_TRUSTED_ORIGINS` to either include the API origin (current workflow value) or expose a `CANARY_STAGING_ORIGIN` repo var pointing at the dj-site staging URL. Revisit when phase 4 wiring lands.

Three confirmed findings from iter-3 review. Two are diagnostic-quality bugs in the job summary; one is a defense against `did_push` going missing. Bypass-summary footer lied on the noop+skip_gate path. When the operator dispatched workflow_dispatch with skip_gate=true while both prods were already at main, the noop early-exit fired, the audit step skipped per its `if: skip_gate == true && noop != 'true'` guard (correctly — there was nothing to audit), the push steps skipped, and the job summary still printed `**Bypass:** ... Justification logged at <tracker>#<issue>.` Following the link took the operator to a tracker issue that had no comment for the run. Now the summary inspects `steps.audit.outcome` and the noop flag explicitly: the "Justification logged" footer fires only when the audit step actually ran and succeeded; the noop+bypass and audit-failure cases each get a distinct, accurate footer. `render_outcome` conflated noop-skip with failure-skip. When `noop=true`, push steps skip with outcome=`skipped`; the previous label "skipped (no-op or prior step failed)" was technically true but unreadable as success. Now the function takes the noop flag and renders "unchanged (both prods already at main)" for the noop case vs "skipped (prior step failed)" otherwise. `emit_output` in push-prod.sh now fails loud on write failure to `$GITHUB_OUTPUT` (full disk, read-only fs, vanished dir). The workflow's split-promotion detector tests `did_push == "true"` exactly, so a silent emit-failure after a successful push would set did_push="" → the inverted-split warning would fire spuriously (claiming LML advanced without BS). Now the script exits non-zero on write failure; the workflow's `success() &&` gating then correctly skips downstream pushes. Tests: new regression test points `$GITHUB_OUTPUT` at a directory (not a file) to force a predictable write failure; asserts the script exits non-zero with a clear stderr.

iter-4's `emit_output` fail-loud closes the silent-loss case but introduces a residual window: if `gh api PATCH refs/heads/prod` succeeds (BS prod actually moves) and the subsequent `printf >>$GITHUB_OUTPUT` fails (disk full, FS hiccup), the script exits 1 → step outcome=failure → `did_push=""`. The summary's split-detector tested `did_push == "true"` exactly, so the warning didn't fire even though prod was split. Adds a third branch to the split-detector: when `BS_OUTCOME==failure` and LML didn't push, render a `:warning: Possible split promotion` callout pointing at the runbook. Most BS push failures are pre-API (gh refused before sending) so this branch will fire false-alarm-but-cheap most of the time; the alternative (missing a real split) is unrecoverable without manual reconciliation. Corrects the comment block that claimed `emit_output` fail-loud guarantees "no false negatives" — that's true for the `did_push=true` invariant (a recorded did_push is always real) but not for split detection (a real push can still produce empty did_push). The new comment names both layers honestly. Also updates the iter-3 comment that implied the inverted-split case is the only residual concern; the BS-failure-with-possibly-moved-ref is the load-bearing case the runbook needs to address.

Cancellation mid-PATCH leaves the same ambiguity as failure mid-PATCH: the gh API call may have landed before the runner caught SIGTERM, so BS prod may have moved. Widens the iter-5 branch from `BS_OUTCOME == failure` to `failure || cancelled` and surfaces the actual outcome in the warning text so the operator knows which terminal state they're looking at. This is the cancellation analog of the iter-5 emit_output fail-loud gap. The first branch (did_push=true on BS, did_push!=true on LML) already covers the case where emit_output completed before cancellation; this catches the case where cancellation arrived between the gh API call and the emit_output write.

This was referenced Jun 5, 2026

Phase 5: bs-lml-gate coordinator workflow #166

Closed

fix: prepare script so npm install from git URL builds dist WXYC/wxyc-canary#46

Closed

fix(pkg): prepare script so npm install from git URL builds dist WXYC/wxyc-canary#47

Merged

jakebromberg added 5 commits June 5, 2026 06:13

jakebromberg merged commit 9dc9e0f into main Jun 5, 2026
1 check passed

jakebromberg deleted the feat/166-bs-lml-gate branch June 5, 2026 14:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): bs-lml-gate coordinator workflow#171

feat(ci): bs-lml-gate coordinator workflow#171
jakebromberg merged 6 commits into
mainfrom
feat/166-bs-lml-gate

jakebromberg commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jakebromberg commented Jun 5, 2026

Summary

Design decisions

Blocked operationally by (but not blocked for merge)

Setup checklist (documented in README under "Staging gate (bs-lml-gate)")

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant