Add unbounded-nightly cluster (nightly build+deploy of main) by plombardi89 · Pull Request #312 · Azure/unbounded

plombardi89 · 2026-06-22T20:30:43Z

Summary

Introduces an unbounded-nightly integration cluster - the nightly sibling of unbounded-stable. Where stable deploys a published, signed release, nightly builds a from-source snapshot of main HEAD every morning at 06:00 UTC and deploys it, so the tip of the tree gets the same soak treatment. Also adds a one-shot provisioning script and makes the shared Orca deploy tooling deployment-neutral.

1. Deployment-neutral Orca deploy tooling

The Orca integration tooling is now shared by both clusters, so it no longer bakes in the stable channel: deploy-stable.sh->deploy-integration.sh, smoke-stable.sh->smoke-integration.sh, hack/orca/stable/->hack/orca/integration/, and the Garage template labels/admin-token are neutralized. Existing clusters are unaffected on re-apply.

2. `.github/workflows/nightly.yaml`

resolve: snapshots the default-branch head (github.sha, or inputs.ref); derives a nightly-<shortsha> tag (all jobs pin to the resolved SHA).
net-images / component-images: build+push amd64 images (unbounded-net-controller, unbounded-net-node, machina, orca) for that tag. No cosign/SBOM/Trivy gating - nightly is a throwaway soak target.
deploy (environment: unbounded-nightly): init on first bootstrap (plugin built from the snapshot, embedded manifests stamped with the nightly tags) or upgrade-apply thereafter; reuses the release-upgrade silently wipes machina-config apiServerEndpoint, crashing the controller #235 machina-config merge. Mirrors release-upgrade.yaml (applies net/ + machina/).
deploy-orca: shared deploy-integration.sh, single Orca replica (soak target, fits the forge cluster).
smoke-discover / smoke-tests: shared hack/release/smoke/*.sh.

Triggers: schedule (06:00 UTC daily) + manual workflow_dispatch. It does not run on merges/pushes to main.

3. `hack/scripts/setup-nightly-cluster.sh` (one-shot provisioner)

Idempotently: builds forge + forge cluster create (gateway pool pre-labeled, WireGuard ports, bootstrap token, kubeconfig); auto-detects cluster CIDRs from AKS; creates the Orca origin storage account (ub<site>01) + container; configures the unbounded-nightly Environment via setup-deploy-environment.sh; creates the unbounded-kube namespace + orca-credentials Secret; triggers nightly.yaml -f force_init=true and watches (or --no-trigger). Includes setup-deploy-environment.sh hardening (skip empty vars, no pager, --channel stable|nightly, drop the obsolete gateway-label step since forge labels the pool) and a Garage key-grant fix (grant by access-key id so re-runs/regenerated keys can't strand Orca's key).

Scheduling note

GitHub Actions cron is UTC-only and ignores DST. 0 6 * * * lands at 01:00 ET (EST) / 02:00 ET (EDT) - early on purpose so the from-source build + deploy + smoke finishes before the US working day.

Tested end-to-end before merge

A temporary push: trigger (now removed) let the branch's own workflow run pre-merge. The full pipeline passed: run 28070792438 (commit 92370af) - resolve -> net-images -> component-images -> deploy (net + machina + site init) -> deploy-orca (1 Orca replica) -> smoke (core-namespaces-ready), all green against the live unbounded-nightly cluster. The temporary trigger has since been removed, so the merged workflow runs only on the schedule + dispatch.

Operator setup (post-merge)

hack/scripts/setup-nightly-cluster.sh --subscription <sub-id> (see --help). Needs az/gh logged in and admin on the repo. The workflow is gated on github.repository == 'Azure/unbounded', so it's inert until the Environment + cluster exist. First bootstrap is handled by the script (or gh workflow run nightly.yaml -f force_init=true); thereafter it runs nightly automatically.

Verification

actionlint clean on nightly.yaml and release-upgrade.yaml; YAML valid; no em-dashes.
bash -n clean on all scripts; go test ./internal/orca/manifests/... passes; gofumpt clean.
Full green pre-merge run (28070792438) against the real cluster.

Rename the Orca integration-cluster deploy tooling so it no longer bakes in the 'stable' channel, since it is now shared by both unbounded-stable and the new unbounded-nightly cluster: hack/orca/deploy-stable.sh -> hack/orca/deploy-integration.sh hack/orca/smoke-stable.sh -> hack/orca/smoke-integration.sh hack/orca/stable/ -> hack/orca/integration/ Neutralize the Garage template labels (part-of: orca-stable -> orca) and admin-token default, and update all references (release-upgrade.yaml, create-credentials-secret.sh, and the manifests render test). The Garage labels are not used as selectors and the admin token is internal to the template, so existing clusters are unaffected on re-apply.

Introduce an unbounded-nightly integration cluster that mirrors unbounded-stable but deploys a from-source snapshot of main HEAD every morning at 06:00 UTC instead of a published release. The new .github/workflows/nightly.yaml: - resolves the snapshot commit and derives a nightly-<shortsha> tag - builds and pushes amd64 images (net-controller, net-node, machina, orca) for that tag, without the release path's cosign/SBOM/Trivy gating (the nightly cluster is a throwaway soak target) - renders manifests against the nightly tags and deploys them to the unbounded-nightly GitHub Environment (init on first bootstrap, upgrade-apply thereafter), reusing the #235 machina-config merge - deploys Orca via the shared deploy-integration.sh - runs the shared hack/release/smoke suite The target cluster is configured via the unbounded-nightly Environment using the existing hack/scripts/setup-deploy-environment.sh (no change needed there); the workflow header documents the one-time setup.

Automate the full unbounded-nightly operator runbook in a single idempotent script: - builds forge and runs 'forge cluster create' (which also makes the gateway pool labeled unbounded-cloud.io/unbounded-net-gateway=true, opens WireGuard ports, lays down the bootstrap token, and writes a kubeconfig); parses forge's JSON stdout for the RG / node RG / subscription / kubeconfig path - auto-detects the cluster node/pod CIDRs from AKS (mirroring aks-quickstart.sh:detect_cluster_cidrs); site CIDRs default to constants - creates the Orca origin storage account (default ub<site>01) + container and reads its key - configures the unbounded-nightly Environment via setup-deploy-environment.sh - creates the unbounded-kube namespace + orca-credentials Secret - triggers nightly.yaml with force_init=true and watches it Point the nightly.yaml header's first-time-setup notes at the script (forge handles gateway labeling, CIDRs are auto-detected).

workflow_dispatch and schedule only run from the default branch, so the nightly workflow could not be exercised before merge. Add a TEMPORARY push trigger on the nightly-deploy branch (marked for removal before merge) so pushing the branch runs its own workflow file end-to-end (build -> init deploy -> Orca -> smoke) against the unbounded-nightly cluster, giving reviewers a real run to inspect. Also fix the resolve job to snapshot github.sha instead of the default branch, so a push-triggered run builds the pushed commit rather than main. This is a correctness fix and stays after merge. setup-nightly-cluster.sh gains --no-trigger: provision the cluster, origin, Environment, and Secret without dispatching, then push the branch to fire the run. The default-branch workflow preflight is skipped in that mode.

set_var ran 'gh variable set --body ""' for empty values (e.g. a blank ORCA_AZURE_ENDPOINT, which means 'use the default *.blob.core.windows.net'). gh treats an empty --body as no value on a TTY and prompts interactively ('Paste your variable'), hanging non-interactive callers like setup-nightly-cluster.sh. Pipe the value via stdin instead; stdin is never a TTY here, so empty values are stored without prompting.

GitHub Actions variables cannot be empty: 'gh variable set' with an empty value returns HTTP 422 (missing required key 'value'). A blank ORCA_AZURE_ENDPOINT (meaning 'use the default *.blob.core.windows.net') therefore broke setup. Skip empty values in set_var; an unset variable already resolves to "" in the workflow, which is the intended behavior. Supersedes the earlier stdin approach (which avoided the interactive prompt but still sent an empty value and hit the 422).

The 'Configured secrets'/'Configured variables' summaries call 'gh secret list' / 'gh variable list', which page their table output through an interactive pager when stdout is a TTY, dropping the caller (including setup-nightly-cluster.sh) into a pager that needs manual 'q' to exit. Export GH_PAGER=cat so the whole script runs without paging.

The closing 'Next steps' hint was stable/release-specific and partly wrong: it told operators to label a gateway node (forge already labels the gwmain pool on both stable and nightly clusters) and to trigger release-upgrade.yaml with a tag (nightly deploys a snapshot of main via nightly.yaml on a schedule, not tags). Add --channel stable|nightly (default stable, so stable output is unchanged apart from removing the obsolete gateway-labeling step) and print channel-appropriate trigger guidance. setup-nightly-cluster.sh passes --channel nightly.

main moved actions/checkout from v6.0.3 to v7.0.0 repo-wide (incl. release.yaml and release-upgrade.yaml). Bump the 7 pins in nightly.yaml to stay in lockstep. v7 is compatible with the ref/fetch-depth usage here; no behavior change.

The default forge cluster has a 2-node system pool (the gateway pool is tainted), which cannot fit Orca's default 3 replicas alongside Garage and the net/machina workloads, so deploy-orca timed out waiting for the orca rollout. unbounded-nightly is a soak target, not HA, so run a single Orca replica via deploy-integration.sh --replicas 1.

Orca failed at startup with a Garage 403 on GetBucketVersioning ('Operation is not allowed for this key'). Root cause: the one-shot re-ran create-credentials-secret.sh on every invocation, which mints fresh Garage S3 keys, while bootstrap-garage.sh imports each under the same name 'orca' and grants the bucket via --key orca. With several keys sharing that name the grant became ambiguous and landed on a stale key, leaving the key Orca actually uses unauthorized. Fixes: - bootstrap-garage.sh grants (key allow / bucket allow) by the unique access key id instead of the name, so the current Secret's key always gets authorized (self-heals clusters that already drifted on the next deploy-orca run). Benefits stable too. - setup-nightly-cluster.sh leaves an existing orca-credentials Secret untouched instead of regenerating its keys; delete the Secret to rotate.

plombardi89 · 2026-06-24T03:10:21Z

https://github.com/Azure/unbounded/actions/runs/28070792438

Pre-merge testing from the nightly-deploy branch is complete, so drop the temporary push trigger. The workflow now runs only on its 06:00 UTC schedule (against the default-branch head) and via manual workflow_dispatch. It does NOT run on merges/pushes to main. Keeps the github.sha resolve fix and the --no-trigger provisioning support.

plombardi89 requested a review from a team June 22, 2026 20:30

plombardi89 had a problem deploying to unbounded-nightly June 23, 2026 05:10 — with GitHub Actions Failure

plombardi89 temporarily deployed to unbounded-nightly June 23, 2026 05:30 — with GitHub Actions Inactive

plombardi89 had a problem deploying to unbounded-nightly June 23, 2026 05:31 — with GitHub Actions Failure

plombardi89 temporarily deployed to unbounded-nightly June 23, 2026 06:04 — with GitHub Actions Inactive

plombardi89 temporarily deployed to unbounded-nightly June 23, 2026 06:05 — with GitHub Actions Inactive

plombardi89 temporarily deployed to unbounded-nightly June 23, 2026 06:06 — with GitHub Actions Inactive

plombardi89 temporarily deployed to unbounded-nightly June 23, 2026 06:10 — with GitHub Actions Inactive

plombardi89 temporarily deployed to unbounded-nightly June 23, 2026 06:11 — with GitHub Actions Inactive

plombardi89 temporarily deployed to unbounded-nightly June 23, 2026 16:08 — with GitHub Actions Inactive

plombardi89 had a problem deploying to unbounded-nightly June 23, 2026 16:08 — with GitHub Actions Failure

plombardi89 added 9 commits June 23, 2026 19:20

plombardi89 force-pushed the nightly-deploy branch from e64c461 to c30f82c Compare June 23, 2026 23:21

plombardi89 temporarily deployed to unbounded-nightly June 23, 2026 23:26 — with GitHub Actions Inactive

plombardi89 had a problem deploying to unbounded-nightly June 23, 2026 23:27 — with GitHub Actions Failure

plombardi89 temporarily deployed to unbounded-nightly June 24, 2026 00:50 — with GitHub Actions Inactive

plombardi89 had a problem deploying to unbounded-nightly June 24, 2026 00:51 — with GitHub Actions Failure

plombardi89 temporarily deployed to unbounded-nightly June 24, 2026 02:27 — with GitHub Actions Inactive

plombardi89 temporarily deployed to unbounded-nightly June 24, 2026 02:28 — with GitHub Actions Inactive

plombardi89 deployed to unbounded-nightly June 24, 2026 02:29 — with GitHub Actions Active

plombardi89 added 2 commits June 23, 2026 23:12

Merge branch 'main' into nightly-deploy

3f91c0c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add unbounded-nightly cluster (nightly build+deploy of main)#312

Add unbounded-nightly cluster (nightly build+deploy of main)#312
plombardi89 wants to merge 13 commits into
mainfrom
nightly-deploy

plombardi89 commented Jun 22, 2026 •

edited

Loading

Uh oh!

plombardi89 commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

plombardi89 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Deployment-neutral Orca deploy tooling

2. .github/workflows/nightly.yaml

3. hack/scripts/setup-nightly-cluster.sh (one-shot provisioner)

Scheduling note

Tested end-to-end before merge

Operator setup (post-merge)

Verification

Uh oh!

plombardi89 commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

plombardi89 commented Jun 22, 2026 •

edited

Loading

2. `.github/workflows/nightly.yaml`

3. `hack/scripts/setup-nightly-cluster.sh` (one-shot provisioner)