fix(major-upgrade): bump SND template + settle before binary swap#390
Conversation
…teNodeImage Per-child UpdateNodeImage patches each child SeiNode spec.image while the parent SND template stays at the old image. The SND controller re-asserts the child image from its template every reconcile (ensureSeiNode), so the child image flip-flops, the StatefulSet churns, updateRevision never settles, observe-image never completes, mark-ready never fires, and the seid sidecar gate keeps the node from signing -> quorum collapse -> await fails. Replace early-upgrade-node-0 + wait-for-target-height + per-child upgrade-nodes-1-2-3 with one bump-snd-image step that `kubectl patch seinodedeployment spec.template.spec.image`. The SND template is the single source of child image; the controller rolls all validators onto the new binary. await-post-upgrade-progress fans out over all four nodes' POST_UPGRADE_HEIGHT predicate. - runner Role: add patch/update on seinodedeployments for bump-snd-image. - inplace_rollout envtest: assert child generations hold steady post-rollout (the no-flip-flop guarantee the scenario now depends on). Does not preserve staggered per-node rollout (deliberate -- see README limitation #3). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…inary swap The major-upgrade entry is Serial; bump-snd-image followed wait-for-proposal-to-pass with nothing between them, so the SND template image was swapped to the post-upgrade build while the chain was still ~100 blocks below UPGRADE_HEIGHT. The new binary then processed a pre-upgrade block and panicked "BINARY UPDATED BEFORE TRIGGER" (sei-cosmos x/upgrade abci.go) -- node-0's observed failure. Insert a settle-into-halt step (fixed 150s sleep) between proposal-pass and the bump. UPGRADE_HEIGHT is current + 200 blocks at compute-target-height, but the proposal flow (~60s voting + tally) burns most of that budget first, leaving ~100 blocks/~60s once the proposal passes; 150s is ~2.5x that remainder. The upgrade height can't be polled: all validators halt together at UPGRADE_HEIGHT and stop serving RPC exactly when the predicate would be satisfiable, so the gate is time-based. Over-waiting is free (the chain sits halted until the swap); only too-short fails, so the wait errs generous. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR SummaryMedium Risk Overview Workflow redesign: Per-child RBAC: Tests: Envtest Docs: Reviewed by Cursor Bugbot for commit 17000d4. Bugbot is set up for automated code reviews on this repo. Configure here. |
- envtest: assert child spec.image stays pinned to the template image after rollout (the real no-flip-flop invariant) instead of metadata.generation, which takes one benign bump when the revision podLabel resyncs post-rollout. - runner/rbac.yaml: drop unused `update` on seinodedeployments; bump-snd-image only `kubectl patch`es, so `patch` is sufficient. - major-upgrade.yaml: make the settle-wait comment's block-rate assumption explicit (~1s/block breaks the 150s wait) instead of the misleading "2.5x". - README: drop stale PANIC_BOUNDARY from compute-target-height outputs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Problem
The nightly
major-upgradescenario fails before completing the upgrade, for two independent reasons:SeiNode.spec.imagedirectly (per-childUpdateNodeImage). Those SeiNodes are owned by the SeiNodeDeployment, whose controller re-asserts the child image from the template on every reconcile (ensureSeiNode). The per-child patch and the SND controller fight → childspec.imageflip-flops → StatefulSet churn →observe-imagenever settles → deadlock.Serial; the image swap followedwait-for-proposal-to-passwith nothing between, so the post-upgrade binary landed while the chain was still ~100 blocks belowUPGRADE_HEIGHT. The new binary then processed a pre-upgrade block and panickedBINARY UPDATED BEFORE TRIGGER(sei-cosmos/x/upgrade/abci.go).Fix
bump-snd-imagereplaces the per-child fan-out with a single patch of the SND template image. The controller is the sole source of truth for child image, so there's no fight and the rollout settles. (runner/rbac.yamlgainsseinodedeployments: patch,update; an envtest asserts child generation stays stable.)settle-into-halt(fixed 150s wait) sits between proposal-pass and the bump. The upgrade height can't be polled — all validators halt together atUPGRADE_HEIGHTand stop serving RPC exactly when the predicate would be true — so the gate is time-based. Over-waiting is free (the chain sits halted until the swap); only too-short fails, so it errs generous (~2.5× the ~60s remaining after the proposal passes).Validation
Ran end-to-end on harbor against this branch (
SCENARIO_REF, unchangedSEITASK_IMAGE): genesis → proposal → votes →settle-into-halt(chain halted on the old binary, observed) →bump-snd-image→ all 4 validators rolled onto v6.5.0, applied the upgrade at the height, and advanced pastPOST_UPGRADE_HEIGHT(await-post-upgrade-progress✓ for all four). NoBINARY UPDATED BEFORE TRIGGER, no observe-image deadlock.Out of scope (deliberate)
cosmos_only. Thememiavl_onlyoverride the release/load scenarios use is only valid on the newer$SEID_IMAGEthey run; it panics on v6.4.4.Platform follow-up (separate PR)
clusters/harbor/nightly/major-upgrade/: bumpSCENARIO_REFto this merge commit, and mirror theseinodedeployments: patch,updategrant intorbac.yaml.🤖 Generated with Claude Code