Skip to content

Add unbounded-nightly cluster (nightly build+deploy of main)#312

Open
plombardi89 wants to merge 13 commits into
mainfrom
nightly-deploy
Open

Add unbounded-nightly cluster (nightly build+deploy of main)#312
plombardi89 wants to merge 13 commits into
mainfrom
nightly-deploy

Conversation

@plombardi89

@plombardi89 plombardi89 commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

Introduces an unbounded-nightly integration cluster - the nightly sibling of unbounded-stable. Where stable deploys a published, signed release, nightly builds a from-source snapshot of main HEAD every morning at 06:00 UTC and deploys it, so the tip of the tree gets the same soak treatment. Also adds a one-shot provisioning script and makes the shared Orca deploy tooling deployment-neutral.

1. Deployment-neutral Orca deploy tooling

The Orca integration tooling is now shared by both clusters, so it no longer bakes in the stable channel: deploy-stable.sh->deploy-integration.sh, smoke-stable.sh->smoke-integration.sh, hack/orca/stable/->hack/orca/integration/, and the Garage template labels/admin-token are neutralized. Existing clusters are unaffected on re-apply.

2. .github/workflows/nightly.yaml

  • resolve: snapshots the default-branch head (github.sha, or inputs.ref); derives a nightly-<shortsha> tag (all jobs pin to the resolved SHA).
  • net-images / component-images: build+push amd64 images (unbounded-net-controller, unbounded-net-node, machina, orca) for that tag. No cosign/SBOM/Trivy gating - nightly is a throwaway soak target.
  • deploy (environment: unbounded-nightly): init on first bootstrap (plugin built from the snapshot, embedded manifests stamped with the nightly tags) or upgrade-apply thereafter; reuses the release-upgrade silently wipes machina-config apiServerEndpoint, crashing the controller #235 machina-config merge. Mirrors release-upgrade.yaml (applies net/ + machina/).
  • deploy-orca: shared deploy-integration.sh, single Orca replica (soak target, fits the forge cluster).
  • smoke-discover / smoke-tests: shared hack/release/smoke/*.sh.

Triggers: schedule (06:00 UTC daily) + manual workflow_dispatch. It does not run on merges/pushes to main.

3. hack/scripts/setup-nightly-cluster.sh (one-shot provisioner)

Idempotently: builds forge + forge cluster create (gateway pool pre-labeled, WireGuard ports, bootstrap token, kubeconfig); auto-detects cluster CIDRs from AKS; creates the Orca origin storage account (ub<site>01) + container; configures the unbounded-nightly Environment via setup-deploy-environment.sh; creates the unbounded-kube namespace + orca-credentials Secret; triggers nightly.yaml -f force_init=true and watches (or --no-trigger). Includes setup-deploy-environment.sh hardening (skip empty vars, no pager, --channel stable|nightly, drop the obsolete gateway-label step since forge labels the pool) and a Garage key-grant fix (grant by access-key id so re-runs/regenerated keys can't strand Orca's key).

Scheduling note

GitHub Actions cron is UTC-only and ignores DST. 0 6 * * * lands at 01:00 ET (EST) / 02:00 ET (EDT) - early on purpose so the from-source build + deploy + smoke finishes before the US working day.

Tested end-to-end before merge

A temporary push: trigger (now removed) let the branch's own workflow run pre-merge. The full pipeline passed: run 28070792438 (commit 92370af) - resolve -> net-images -> component-images -> deploy (net + machina + site init) -> deploy-orca (1 Orca replica) -> smoke (core-namespaces-ready), all green against the live unbounded-nightly cluster. The temporary trigger has since been removed, so the merged workflow runs only on the schedule + dispatch.

Operator setup (post-merge)

hack/scripts/setup-nightly-cluster.sh --subscription <sub-id> (see --help). Needs az/gh logged in and admin on the repo. The workflow is gated on github.repository == 'Azure/unbounded', so it's inert until the Environment + cluster exist. First bootstrap is handled by the script (or gh workflow run nightly.yaml -f force_init=true); thereafter it runs nightly automatically.

Verification

  • actionlint clean on nightly.yaml and release-upgrade.yaml; YAML valid; no em-dashes.
  • bash -n clean on all scripts; go test ./internal/orca/manifests/... passes; gofumpt clean.
  • Full green pre-merge run (28070792438) against the real cluster.

@plombardi89 plombardi89 requested a review from a team June 22, 2026 20:30
@plombardi89 plombardi89 temporarily deployed to unbounded-nightly June 23, 2026 05:30 — with GitHub Actions Inactive
@plombardi89 plombardi89 temporarily deployed to unbounded-nightly June 23, 2026 06:04 — with GitHub Actions Inactive
@plombardi89 plombardi89 temporarily deployed to unbounded-nightly June 23, 2026 06:05 — with GitHub Actions Inactive
@plombardi89 plombardi89 temporarily deployed to unbounded-nightly June 23, 2026 06:06 — with GitHub Actions Inactive
@plombardi89 plombardi89 temporarily deployed to unbounded-nightly June 23, 2026 06:10 — with GitHub Actions Inactive
@plombardi89 plombardi89 temporarily deployed to unbounded-nightly June 23, 2026 06:10 — with GitHub Actions Inactive
@plombardi89 plombardi89 temporarily deployed to unbounded-nightly June 23, 2026 06:11 — with GitHub Actions Inactive
@plombardi89 plombardi89 temporarily deployed to unbounded-nightly June 23, 2026 16:08 — with GitHub Actions Inactive
Rename the Orca integration-cluster deploy tooling so it no longer bakes
in the 'stable' channel, since it is now shared by both unbounded-stable
and the new unbounded-nightly cluster:

  hack/orca/deploy-stable.sh   -> hack/orca/deploy-integration.sh
  hack/orca/smoke-stable.sh    -> hack/orca/smoke-integration.sh
  hack/orca/stable/            -> hack/orca/integration/

Neutralize the Garage template labels (part-of: orca-stable -> orca) and
admin-token default, and update all references (release-upgrade.yaml,
create-credentials-secret.sh, and the manifests render test). The Garage
labels are not used as selectors and the admin token is internal to the
template, so existing clusters are unaffected on re-apply.
Introduce an unbounded-nightly integration cluster that mirrors
unbounded-stable but deploys a from-source snapshot of main HEAD every
morning at 06:00 UTC instead of a published release.

The new .github/workflows/nightly.yaml:
  - resolves the snapshot commit and derives a nightly-<shortsha> tag
  - builds and pushes amd64 images (net-controller, net-node, machina,
    orca) for that tag, without the release path's cosign/SBOM/Trivy
    gating (the nightly cluster is a throwaway soak target)
  - renders manifests against the nightly tags and deploys them to the
    unbounded-nightly GitHub Environment (init on first bootstrap,
    upgrade-apply thereafter), reusing the #235 machina-config merge
  - deploys Orca via the shared deploy-integration.sh
  - runs the shared hack/release/smoke suite

The target cluster is configured via the unbounded-nightly Environment
using the existing hack/scripts/setup-deploy-environment.sh (no change
needed there); the workflow header documents the one-time setup.
Automate the full unbounded-nightly operator runbook in a single
idempotent script:

  - builds forge and runs 'forge cluster create' (which also makes the
    gateway pool labeled unbounded-cloud.io/unbounded-net-gateway=true,
    opens WireGuard ports, lays down the bootstrap token, and writes a
    kubeconfig); parses forge's JSON stdout for the RG / node RG /
    subscription / kubeconfig path
  - auto-detects the cluster node/pod CIDRs from AKS (mirroring
    aks-quickstart.sh:detect_cluster_cidrs); site CIDRs default to
    constants
  - creates the Orca origin storage account (default ub<site>01) +
    container and reads its key
  - configures the unbounded-nightly Environment via
    setup-deploy-environment.sh
  - creates the unbounded-kube namespace + orca-credentials Secret
  - triggers nightly.yaml with force_init=true and watches it

Point the nightly.yaml header's first-time-setup notes at the script
(forge handles gateway labeling, CIDRs are auto-detected).
workflow_dispatch and schedule only run from the default branch, so the
nightly workflow could not be exercised before merge. Add a TEMPORARY
push trigger on the nightly-deploy branch (marked for removal before
merge) so pushing the branch runs its own workflow file end-to-end
(build -> init deploy -> Orca -> smoke) against the unbounded-nightly
cluster, giving reviewers a real run to inspect.

Also fix the resolve job to snapshot github.sha instead of the default
branch, so a push-triggered run builds the pushed commit rather than
main. This is a correctness fix and stays after merge.

setup-nightly-cluster.sh gains --no-trigger: provision the cluster,
origin, Environment, and Secret without dispatching, then push the
branch to fire the run. The default-branch workflow preflight is skipped
in that mode.
set_var ran 'gh variable set --body ""' for empty values (e.g. a blank
ORCA_AZURE_ENDPOINT, which means 'use the default *.blob.core.windows.net').
gh treats an empty --body as no value on a TTY and prompts interactively
('Paste your variable'), hanging non-interactive callers like
setup-nightly-cluster.sh. Pipe the value via stdin instead; stdin is never
a TTY here, so empty values are stored without prompting.
GitHub Actions variables cannot be empty: 'gh variable set' with an empty
value returns HTTP 422 (missing required key 'value'). A blank
ORCA_AZURE_ENDPOINT (meaning 'use the default *.blob.core.windows.net')
therefore broke setup. Skip empty values in set_var; an unset variable
already resolves to "" in the workflow, which is the intended behavior.

Supersedes the earlier stdin approach (which avoided the interactive
prompt but still sent an empty value and hit the 422).
The 'Configured secrets'/'Configured variables' summaries call
'gh secret list' / 'gh variable list', which page their table output
through an interactive pager when stdout is a TTY, dropping the caller
(including setup-nightly-cluster.sh) into a pager that needs manual 'q'
to exit. Export GH_PAGER=cat so the whole script runs without paging.
The closing 'Next steps' hint was stable/release-specific and partly
wrong: it told operators to label a gateway node (forge already labels
the gwmain pool on both stable and nightly clusters) and to trigger
release-upgrade.yaml with a tag (nightly deploys a snapshot of main via
nightly.yaml on a schedule, not tags).

Add --channel stable|nightly (default stable, so stable output is
unchanged apart from removing the obsolete gateway-labeling step) and
print channel-appropriate trigger guidance. setup-nightly-cluster.sh
passes --channel nightly.
main moved actions/checkout from v6.0.3 to v7.0.0 repo-wide (incl.
release.yaml and release-upgrade.yaml). Bump the 7 pins in nightly.yaml
to stay in lockstep. v7 is compatible with the ref/fetch-depth usage
here; no behavior change.
The default forge cluster has a 2-node system pool (the gateway pool is
tainted), which cannot fit Orca's default 3 replicas alongside Garage and
the net/machina workloads, so deploy-orca timed out waiting for the orca
rollout. unbounded-nightly is a soak target, not HA, so run a single Orca
replica via deploy-integration.sh --replicas 1.
Orca failed at startup with a Garage 403 on GetBucketVersioning
('Operation is not allowed for this key'). Root cause: the one-shot
re-ran create-credentials-secret.sh on every invocation, which mints
fresh Garage S3 keys, while bootstrap-garage.sh imports each under the
same name 'orca' and grants the bucket via --key orca. With several
keys sharing that name the grant became ambiguous and landed on a stale
key, leaving the key Orca actually uses unauthorized.

Fixes:
- bootstrap-garage.sh grants (key allow / bucket allow) by the unique
  access key id instead of the name, so the current Secret's key always
  gets authorized (self-heals clusters that already drifted on the next
  deploy-orca run). Benefits stable too.
- setup-nightly-cluster.sh leaves an existing orca-credentials Secret
  untouched instead of regenerating its keys; delete the Secret to
  rotate.
@plombardi89 plombardi89 temporarily deployed to unbounded-nightly June 24, 2026 02:27 — with GitHub Actions Inactive
@plombardi89 plombardi89 temporarily deployed to unbounded-nightly June 24, 2026 02:28 — with GitHub Actions Inactive
@plombardi89 plombardi89 deployed to unbounded-nightly June 24, 2026 02:29 — with GitHub Actions Active
@plombardi89

Copy link
Copy Markdown
Collaborator Author

Pre-merge testing from the nightly-deploy branch is complete, so drop
the temporary push trigger. The workflow now runs only on its 06:00 UTC
schedule (against the default-branch head) and via manual
workflow_dispatch. It does NOT run on merges/pushes to main. Keeps the
github.sha resolve fix and the --no-trigger provisioning support.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant