feat(ci): bs-lml-gate coordinator workflow#171
Merged
Conversation
Coordinator workflow for BS + LML prod promotion (wiki#80 phase 5). On repository_dispatch from either repo's staging deploy, re-resolves both main SHAs, polls both staging /health, verifies each /version matches main, runs E2E + canary smoke, and on pass fast-forwards each repo's prod branch. Helpers live at scripts/bs-lml-gate/ (not the top-level scripts/) so the gate's blast radius is one directory listing. Each has bats tests in scripts/__tests__/bs-lml-gate/ (31 tests). Trigger payload is informational only — the gate always re-queries main so a stale dispatch can't regress prod. Concurrency cancel-in-progress=false so a cancelled mid-push never leaves the two prod branches split. Prod pushes use repo-scoped fine-grained PATs (not GITHUB_TOKEN) so the audit log shows the promoter identity. Bypass path via workflow_dispatch with required justification; audit comment posted to a pinned tracker issue *before* any prod push, with failure-to-audit blocking the bypass.
This was referenced Jun 5, 2026
Twelve findings from the iter-2 code review. Canary contract corrections (would have made the gate red on first dispatch): pass --base-url, --auth-url, --lml-url to `wxyc-canary check` (only --suite was set before); send Origin via CANARY_ORIGIN_URL (the var the CLI reads — ORIGIN_URL was being ignored); bump CANARY_PIN to the post-prepare-script wxyc-canary commit so `npm install -g github:...` actually builds dist/cli.js, then smoke-test the install with `wxyc-canary --help` before the real invocation. Audit-then-push invariant hardening: make justification `required: true` so the form-submit boundary enforces what log-bypass.sh already requires at runtime; add noop guard to the audit step so a bypass-against-already-promoted-prods doesn't emit a misleading "SHA promoted" comment; make `success() &&` explicit on the BS push step with a comment explaining the load-bearing role; track outcome of each push step (`id: push_bs`, `id: push_lml`) — LML push runs only if BS wasn't skipped, and the job summary reads the actual outcomes instead of inferring from SHA diffs (split-promotion gets a `:rotating_light:` callout pointing at the runbook). Helper-script defensiveness in log-bypass.sh: iterate the validation loop over actual env-var names so the error message tells the operator exactly which variable to set (`BYPASS_ISSUE_NUMBER`, not `BYPASS_ISSUE`); wrap user-supplied justification in a fenced code block to neutralize @-mentions, links, and headings (fence auto-flips between ``` and ~~~ if the input contains the chosen delimiter, embedded fence sequences get broken up as belt-and-suspenders). Helper-script defensiveness in resolve-shas.sh: distinguish 404 (branch missing → empty SHA → seed path) from every other gh failure (transient 5xx, expired PAT, rate-limit → propagate). The previous collapse-all-errors-to-empty behavior would let a 503 on a `prod` lookup mis-trigger the seed/POST path on an existing branch. Surfaces gh's stderr on the propagated path. Also capture `$?` via `|| rc=$?` instead of `if cmd; then ... fi` (which resets $? to 0 by the time the else branch reads it — silent bug in the previous iteration). Documentation: add header comment on the /version verify steps pointing at the phase 3/4 prereqs (LML#496 / BS#1335) that add the /version routes. verify-version steps will 404 until those land, which is the expected state since the staging envs don't exist yet either. Tests: add resolve-shas tests for 5xx and 401 paths (must fail-fast, not seed); add log-bypass tests for the renamed error message and the markdown-escape behavior; update the resolve-shas fake `gh` to emit 404-style stderr by default and a 5xx-style stderr for explicit failure fixtures.
…ction defense)
Twelve findings from iter-3, two of them critical.
CRITICAL: log-bypass.sh used `gh api -f body=@${body_file}` after the iter-2 cleanup. Per `gh api --help`, only `-F/--field` supports `@<path>` for file reads; `-f/--raw-field` adds a literal string parameter. With `-f`, every audit comment posted to the tracker issue is the seven-character literal string `@/tmp/tmp.XXXXXX` instead of the rendered actor/SHAs/justification block. The fake `gh` in the bats suite read `body=@<path>` regardless of flag, so all 9 log-bypass tests stayed green while the production behavior was broken. Reverts to `-F`; fixes the fake `gh` to mirror real flag semantics; adds a regression test that asserts the script uses `-F` and the body that landed is the rendered markdown (not the literal path).
CRITICAL: LML push iter-2 gating was `if: always() && steps.push_bs.outcome != 'skipped' && ...`, which ran LML when BS push FAILED (failure != skipped) — producing exactly the split-prod state the concurrency comment is supposed to prevent. The transient-5xx-on-BS-push scenario is the load-bearing failure mode this gate exists to defend against. Reverted to `success() && steps.noop.outputs.noop != 'true'` (same shape as BS push), so any failure between resolve and BS-push skips LML cleanly.
`required: true` reverted to `required: false` on `inputs.justification`: forcing a justification on every workflow_dispatch (including normal non-bypass re-runs) was UX friction that conflicted with the field description ("Required when skip_gate=true; ignored otherwise"). log-bypass.sh's empty-string check is the load-bearing enforcement; the form description is the contract.
`resolve-shas.sh` 404 grep tightened: was `grep -qiE 'HTTP 404|Not Found'`, now `grep -q 'HTTP 404'` (case-sensitive, anchored). Old regex matched gh stderr like `host not found` (DNS) and `object not found via redirect (HTTP 503)`, mis-classifying real failures as branch-missing → seed and driving push-prod into the POST path on an existing branch.
`push-prod.sh` now writes `did_push=true|false` to `$GITHUB_OUTPUT` so the workflow summary can distinguish "API call landed" from "short-circuited at target==current". Summary's `render_outcome` now reports `unchanged (already at main)` for no-op pushes instead of misleadingly labeling them `advanced`. Split-promotion warning now triggers on the actual ref-movement state (did_push) rather than inferring from step outcome, and a symmetric warning catches the (now-impossible-via-gate) inverted case as a workflow invariant violation. Summary also adds an explicit `cancelled mid-push` label.
Heredoc EOF-injection defense in log-bypass.sh: the audit body now builds via `printf '%s\n' ...` instead of `cat <<EOF`, so a justification line of exactly `EOF` can't terminate the heredoc early and corrupt the comment. Adds a regression test with a payload containing a standalone `EOF` line.
Whitespace-only justification rejection: `[[ -z "${BYPASS_JUSTIFICATION//[[:space:]]/}" ]]` catches payloads like `' \n\t '` that previously passed the `-z` length check while contributing no audit content.
Shell-injection defense via env vars: canary-check + dispatch-summary steps moved all attacker-controlled values (`vars.*`, `github.event.client_payload.*`) into the step `env:` block rather than direct `${{ }}` template-expansion inside the run script. A compromised admin (or a malicious repository_dispatch payload) can't break out of shell quoting to execute commands on the self-hosted runner. `vars.*` write privilege is lower than `secrets.*`, so this tightens the lower-bar attack surface.
Tests: fence-neutralization test rewritten to assert the actual invariant (outer fence is not escapable by input content) rather than a stricter property that doesn't hold; new `did_push` assertions on push-prod tests; new `-F`-flag regression test; new EOF + whitespace-only justification tests.
Documentation: comment on `CANARY_ORIGIN_URL` flagging that phase 4 (BS#1335) needs to configure BS staging's `BETTER_AUTH_TRUSTED_ORIGINS` to either include the API origin (current workflow value) or expose a `CANARY_STAGING_ORIGIN` repo var pointing at the dj-site staging URL. Revisit when phase 4 wiring lands.
Three confirmed findings from iter-3 review. Two are diagnostic-quality bugs in the job summary; one is a defense against `did_push` going missing. Bypass-summary footer lied on the noop+skip_gate path. When the operator dispatched workflow_dispatch with skip_gate=true while both prods were already at main, the noop early-exit fired, the audit step skipped per its `if: skip_gate == true && noop != 'true'` guard (correctly — there was nothing to audit), the push steps skipped, and the job summary still printed `**Bypass:** ... Justification logged at <tracker>#<issue>.` Following the link took the operator to a tracker issue that had no comment for the run. Now the summary inspects `steps.audit.outcome` and the noop flag explicitly: the "Justification logged" footer fires only when the audit step actually ran and succeeded; the noop+bypass and audit-failure cases each get a distinct, accurate footer. `render_outcome` conflated noop-skip with failure-skip. When `noop=true`, push steps skip with outcome=`skipped`; the previous label "skipped (no-op or prior step failed)" was technically true but unreadable as success. Now the function takes the noop flag and renders "unchanged (both prods already at main)" for the noop case vs "skipped (prior step failed)" otherwise. `emit_output` in push-prod.sh now fails loud on write failure to `$GITHUB_OUTPUT` (full disk, read-only fs, vanished dir). The workflow's split-promotion detector tests `did_push == "true"` exactly, so a silent emit-failure after a successful push would set did_push="" → the inverted-split warning would fire spuriously (claiming LML advanced without BS). Now the script exits non-zero on write failure; the workflow's `success() &&` gating then correctly skips downstream pushes. Tests: new regression test points `$GITHUB_OUTPUT` at a directory (not a file) to force a predictable write failure; asserts the script exits non-zero with a clear stderr.
iter-4's `emit_output` fail-loud closes the silent-loss case but introduces a residual window: if `gh api PATCH refs/heads/prod` succeeds (BS prod actually moves) and the subsequent `printf >>$GITHUB_OUTPUT` fails (disk full, FS hiccup), the script exits 1 → step outcome=failure → `did_push=""`. The summary's split-detector tested `did_push == "true"` exactly, so the warning didn't fire even though prod was split. Adds a third branch to the split-detector: when `BS_OUTCOME==failure` and LML didn't push, render a `:warning: Possible split promotion` callout pointing at the runbook. Most BS push failures are pre-API (gh refused before sending) so this branch will fire false-alarm-but-cheap most of the time; the alternative (missing a real split) is unrecoverable without manual reconciliation. Corrects the comment block that claimed `emit_output` fail-loud guarantees "no false negatives" — that's true for the `did_push=true` invariant (a recorded did_push is always real) but not for split detection (a real push can still produce empty did_push). The new comment names both layers honestly. Also updates the iter-3 comment that implied the inverted-split case is the only residual concern; the BS-failure-with-possibly-moved-ref is the load-bearing case the runbook needs to address.
Cancellation mid-PATCH leaves the same ambiguity as failure mid-PATCH: the gh API call may have landed before the runner caught SIGTERM, so BS prod may have moved. Widens the iter-5 branch from `BS_OUTCOME == failure` to `failure || cancelled` and surfaces the actual outcome in the warning text so the operator knows which terminal state they're looking at. This is the cancellation analog of the iter-5 emit_output fail-loud gap. The first branch (did_push=true on BS, did_push!=true on LML) already covers the case where emit_output completed before cancellation; this catches the case where cancellation arrived between the gh API call and the emit_output write.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #166. Part of WXYC/wiki#80 phase 5.
Summary
Adds
.github/workflows/bs-lml-gate.yml— the coordinator that promotes Backend-Service and library-metadata-lookupprodbranches together after their staging deploys + E2E + canary smoke all pass. Phase 5a (wxyc-canary#43) merged the CLI shim this gate calls, so this PR is the next domino.Helpers live at
scripts/bs-lml-gate/(not the top-levelscripts/dir) so the gate's blast radius is one directory listing. Each helper takes inputs via env vars, writes outputs to\$GITHUB_OUTPUT, and has bats tests inscripts/__tests__/bs-lml-gate/(31 tests):resolve-shas.sh— fetches both repos' main + prod SHAs viagh apipoll-staging-health.sh— polls a URL until 200 or timeout (default 5min)verify-version.sh— compares a service's/versionto an expected SHApush-prod.sh— advances a repo'sprodref via the GitHub refs API (no local clone — keeps the self-hosted runner clean)log-bypass.sh— appends an audit comment to the tracker issuePlus
.github/actionlint.yamldeclaring thee2e-runnerself-hosted label (also used by e2e-runner-bootstrap) so future actionlint runs don't flag it.Design decisions
client_payload.source_shais informational only. The gate always re-resolves bothmainSHAs after acquiring the concurrency lock, so a stale dispatch arriving after a newer merge cannot regressprod.cancel-in-progress: false. A cancellation mid-push would leave one repo'sprodadvanced and the other not — splits the staging-gate invariant.PROD_PUSH_TOKEN_BS,PROD_PUSH_TOKEN_LML) scoped toContents: writeonprodonly. The audit log shows the promoter identity rather than the generic GHA bot.workflow_dispatchwithskip_gate: truerequires a non-emptyjustificationand posts an audit comment to the pinned tracker issue before any prod push. If the audit append fails, the workflow exits non-zero before touching prod.Blocked operationally by (but not blocked for merge)
This workflow is the orchestrator — the dispatchers and staging envs come from peer phases:
Until those land, the workflow exists but no dispatches reach it. Merging the gate now lets phases 3 + 4 author dispatchers against a real workflow file.
Setup checklist (documented in README under "Staging gate (bs-lml-gate)")
PROD_PUSH_TOKEN_BS,PROD_PUSH_TOKEN_LMLas repo secrets.BS_STAGING_BASE_URL,BS_STAGING_AUTH_URL,LML_STAGING_BASE_URL(placeholder until phases 3, 4 stand up envs).vars.GATE_BYPASS_ISSUE_NUMBERset to its number.LML_API_KEY_STAGINGsecret (from phase 3).Test plan
npm run test:bs-lml-gate— 31/31 bats greennpm run lint— cleannpm test— 460/460 greenactionlint .github/workflows/*.yml— cleanshellcheck scripts/bs-lml-gate/*.sh— cleanworkflow_dispatchwithskip_gate: trueagainst currentmainSHAs, expecting a no-op (sinceproddoesn't exist yet, this will exercise the seed POST path ofpush-prod.sh). Defer until secrets are in place.