Add unbounded-nightly cluster (nightly build+deploy of main)#312
Open
plombardi89 wants to merge 13 commits into
Open
Add unbounded-nightly cluster (nightly build+deploy of main)#312plombardi89 wants to merge 13 commits into
plombardi89 wants to merge 13 commits into
Conversation
Rename the Orca integration-cluster deploy tooling so it no longer bakes in the 'stable' channel, since it is now shared by both unbounded-stable and the new unbounded-nightly cluster: hack/orca/deploy-stable.sh -> hack/orca/deploy-integration.sh hack/orca/smoke-stable.sh -> hack/orca/smoke-integration.sh hack/orca/stable/ -> hack/orca/integration/ Neutralize the Garage template labels (part-of: orca-stable -> orca) and admin-token default, and update all references (release-upgrade.yaml, create-credentials-secret.sh, and the manifests render test). The Garage labels are not used as selectors and the admin token is internal to the template, so existing clusters are unaffected on re-apply.
Introduce an unbounded-nightly integration cluster that mirrors
unbounded-stable but deploys a from-source snapshot of main HEAD every
morning at 06:00 UTC instead of a published release.
The new .github/workflows/nightly.yaml:
- resolves the snapshot commit and derives a nightly-<shortsha> tag
- builds and pushes amd64 images (net-controller, net-node, machina,
orca) for that tag, without the release path's cosign/SBOM/Trivy
gating (the nightly cluster is a throwaway soak target)
- renders manifests against the nightly tags and deploys them to the
unbounded-nightly GitHub Environment (init on first bootstrap,
upgrade-apply thereafter), reusing the #235 machina-config merge
- deploys Orca via the shared deploy-integration.sh
- runs the shared hack/release/smoke suite
The target cluster is configured via the unbounded-nightly Environment
using the existing hack/scripts/setup-deploy-environment.sh (no change
needed there); the workflow header documents the one-time setup.
Automate the full unbounded-nightly operator runbook in a single
idempotent script:
- builds forge and runs 'forge cluster create' (which also makes the
gateway pool labeled unbounded-cloud.io/unbounded-net-gateway=true,
opens WireGuard ports, lays down the bootstrap token, and writes a
kubeconfig); parses forge's JSON stdout for the RG / node RG /
subscription / kubeconfig path
- auto-detects the cluster node/pod CIDRs from AKS (mirroring
aks-quickstart.sh:detect_cluster_cidrs); site CIDRs default to
constants
- creates the Orca origin storage account (default ub<site>01) +
container and reads its key
- configures the unbounded-nightly Environment via
setup-deploy-environment.sh
- creates the unbounded-kube namespace + orca-credentials Secret
- triggers nightly.yaml with force_init=true and watches it
Point the nightly.yaml header's first-time-setup notes at the script
(forge handles gateway labeling, CIDRs are auto-detected).
workflow_dispatch and schedule only run from the default branch, so the nightly workflow could not be exercised before merge. Add a TEMPORARY push trigger on the nightly-deploy branch (marked for removal before merge) so pushing the branch runs its own workflow file end-to-end (build -> init deploy -> Orca -> smoke) against the unbounded-nightly cluster, giving reviewers a real run to inspect. Also fix the resolve job to snapshot github.sha instead of the default branch, so a push-triggered run builds the pushed commit rather than main. This is a correctness fix and stays after merge. setup-nightly-cluster.sh gains --no-trigger: provision the cluster, origin, Environment, and Secret without dispatching, then push the branch to fire the run. The default-branch workflow preflight is skipped in that mode.
set_var ran 'gh variable set --body ""' for empty values (e.g. a blank
ORCA_AZURE_ENDPOINT, which means 'use the default *.blob.core.windows.net').
gh treats an empty --body as no value on a TTY and prompts interactively
('Paste your variable'), hanging non-interactive callers like
setup-nightly-cluster.sh. Pipe the value via stdin instead; stdin is never
a TTY here, so empty values are stored without prompting.
GitHub Actions variables cannot be empty: 'gh variable set' with an empty value returns HTTP 422 (missing required key 'value'). A blank ORCA_AZURE_ENDPOINT (meaning 'use the default *.blob.core.windows.net') therefore broke setup. Skip empty values in set_var; an unset variable already resolves to "" in the workflow, which is the intended behavior. Supersedes the earlier stdin approach (which avoided the interactive prompt but still sent an empty value and hit the 422).
The 'Configured secrets'/'Configured variables' summaries call 'gh secret list' / 'gh variable list', which page their table output through an interactive pager when stdout is a TTY, dropping the caller (including setup-nightly-cluster.sh) into a pager that needs manual 'q' to exit. Export GH_PAGER=cat so the whole script runs without paging.
The closing 'Next steps' hint was stable/release-specific and partly wrong: it told operators to label a gateway node (forge already labels the gwmain pool on both stable and nightly clusters) and to trigger release-upgrade.yaml with a tag (nightly deploys a snapshot of main via nightly.yaml on a schedule, not tags). Add --channel stable|nightly (default stable, so stable output is unchanged apart from removing the obsolete gateway-labeling step) and print channel-appropriate trigger guidance. setup-nightly-cluster.sh passes --channel nightly.
main moved actions/checkout from v6.0.3 to v7.0.0 repo-wide (incl. release.yaml and release-upgrade.yaml). Bump the 7 pins in nightly.yaml to stay in lockstep. v7 is compatible with the ref/fetch-depth usage here; no behavior change.
e64c461 to
c30f82c
Compare
The default forge cluster has a 2-node system pool (the gateway pool is tainted), which cannot fit Orca's default 3 replicas alongside Garage and the net/machina workloads, so deploy-orca timed out waiting for the orca rollout. unbounded-nightly is a soak target, not HA, so run a single Orca replica via deploy-integration.sh --replicas 1.
Orca failed at startup with a Garage 403 on GetBucketVersioning
('Operation is not allowed for this key'). Root cause: the one-shot
re-ran create-credentials-secret.sh on every invocation, which mints
fresh Garage S3 keys, while bootstrap-garage.sh imports each under the
same name 'orca' and grants the bucket via --key orca. With several
keys sharing that name the grant became ambiguous and landed on a stale
key, leaving the key Orca actually uses unauthorized.
Fixes:
- bootstrap-garage.sh grants (key allow / bucket allow) by the unique
access key id instead of the name, so the current Secret's key always
gets authorized (self-heals clusters that already drifted on the next
deploy-orca run). Benefits stable too.
- setup-nightly-cluster.sh leaves an existing orca-credentials Secret
untouched instead of regenerating its keys; delete the Secret to
rotate.
Collaborator
Author
Pre-merge testing from the nightly-deploy branch is complete, so drop the temporary push trigger. The workflow now runs only on its 06:00 UTC schedule (against the default-branch head) and via manual workflow_dispatch. It does NOT run on merges/pushes to main. Keeps the github.sha resolve fix and the --no-trigger provisioning support.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduces an unbounded-nightly integration cluster - the nightly sibling of
unbounded-stable. Where stable deploys a published, signed release, nightly builds a from-source snapshot ofmainHEAD every morning at 06:00 UTC and deploys it, so the tip of the tree gets the same soak treatment. Also adds a one-shot provisioning script and makes the shared Orca deploy tooling deployment-neutral.1. Deployment-neutral Orca deploy tooling
The Orca integration tooling is now shared by both clusters, so it no longer bakes in the
stablechannel:deploy-stable.sh->deploy-integration.sh,smoke-stable.sh->smoke-integration.sh,hack/orca/stable/->hack/orca/integration/, and the Garage template labels/admin-token are neutralized. Existing clusters are unaffected on re-apply.2.
.github/workflows/nightly.yamlgithub.sha, orinputs.ref); derives anightly-<shortsha>tag (all jobs pin to the resolved SHA).unbounded-net-controller,unbounded-net-node,machina,orca) for that tag. No cosign/SBOM/Trivy gating - nightly is a throwaway soak target.environment: unbounded-nightly):initon first bootstrap (plugin built from the snapshot, embedded manifests stamped with the nightly tags) orupgrade-applythereafter; reuses the release-upgrade silently wipes machina-config apiServerEndpoint, crashing the controller #235 machina-config merge. Mirrorsrelease-upgrade.yaml(appliesnet/+machina/).deploy-integration.sh, single Orca replica (soak target, fits the forge cluster).hack/release/smoke/*.sh.Triggers:
schedule(06:00 UTC daily) + manualworkflow_dispatch. It does not run on merges/pushes tomain.3.
hack/scripts/setup-nightly-cluster.sh(one-shot provisioner)Idempotently: builds
forge+forge cluster create(gateway pool pre-labeled, WireGuard ports, bootstrap token, kubeconfig); auto-detects cluster CIDRs from AKS; creates the Orca origin storage account (ub<site>01) + container; configures theunbounded-nightlyEnvironment viasetup-deploy-environment.sh; creates theunbounded-kubenamespace +orca-credentialsSecret; triggersnightly.yaml -f force_init=trueand watches (or--no-trigger). Includessetup-deploy-environment.shhardening (skip empty vars, no pager,--channel stable|nightly, drop the obsolete gateway-label step since forge labels the pool) and a Garage key-grant fix (grant by access-key id so re-runs/regenerated keys can't strand Orca's key).Scheduling note
GitHub Actions cron is UTC-only and ignores DST.
0 6 * * *lands at 01:00 ET (EST) / 02:00 ET (EDT) - early on purpose so the from-source build + deploy + smoke finishes before the US working day.Tested end-to-end before merge
A temporary
push:trigger (now removed) let the branch's own workflow run pre-merge. The full pipeline passed: run 28070792438 (commit92370af) -resolve->net-images->component-images->deploy(net + machina +site init) ->deploy-orca(1 Orca replica) ->smoke (core-namespaces-ready), all green against the liveunbounded-nightlycluster. The temporary trigger has since been removed, so the merged workflow runs only on the schedule + dispatch.Operator setup (post-merge)
hack/scripts/setup-nightly-cluster.sh --subscription <sub-id>(see--help). Needsaz/ghlogged in and admin on the repo. The workflow is gated ongithub.repository == 'Azure/unbounded', so it's inert until the Environment + cluster exist. First bootstrap is handled by the script (orgh workflow run nightly.yaml -f force_init=true); thereafter it runs nightly automatically.Verification
actionlintclean onnightly.yamlandrelease-upgrade.yaml; YAML valid; no em-dashes.bash -nclean on all scripts;go test ./internal/orca/manifests/...passes;gofumptclean.