fix(planner): Use start script for planner deployment by usmanmani1122 · Pull Request #12694 · Agoric/agoric-sdk

usmanmani1122 · 2026-06-01T12:15:18Z

refs: #XXXX

Description

Google is shutting down the Compute Engine container startup agent (konlet) and the gce-container-declaration instance metadata it relies on: workflows that depend on them stop working on 2026-07-31, with support fully ending 2027-07-31.

Our ymax-planner deployment was built entirely on that mechanism — gcloud compute instances update-container (sets gce-container-declaration, konlet runs the container) plus a digest check that reads the same metadata key. This PR migrates that pipeline to a supported approach: a Docker container driven by a VM startup script, with no dependency on konlet.

The deploy scripts keep the same CLI signatures. Both workflows that call them (deploy-ymax1-planner.yml, docker.yml) are updated only to pass the env file into the digest check (check_digest.sh … "$GITHUB_OUTPUT" "./.env.gcp") so the new env-aware gate compares against the same .env.gcp that gets deployed.

Most critical files to review:

.github/scripts/startup-script.sh (new) — installed as the VM startup-script metadata. On every boot it reads the target image (ymax-container-image) and env (ymax-container-env) from metadata and runs the container with docker run --detach --restart always --network host --volume /var/lib/kv-store:/db_data. This restores the container after reboot/recreation (the role konlet used to play). It also installs a small systemd-timer watchdog that restarts the container if it stops producing logs for 2 minutes — covering the "alive but wedged" case that --restart always does not.
.github/scripts/deploy_vm.sh — drops update-container. Now removes the deprecated gce-container-declaration key, writes ymax-container-image + the startup-script, writes the env to ymax-container-env, and applies via a graceful stop/start (re-runs the startup script; avoids the SQLite-corruption risk of a hard reset).
.github/scripts/check_digest.sh — reads ymax-container-image instead of gce-container-declaration, and now also compares ymax-container-env so an env-only change triggers a redeploy (previously only the image was checked).

Security Considerations

The container env (sourced from the YMAX*_PLANNER_ENV GitHub secrets) is stored in the ymax-container-env instance metadata and written to /run/ymax-planner.env on the VM. This is readable by any principal with compute.instances.get on the project. This is equivalent to the prior exposure: the old gce-container-declaration also embedded the container env in instance metadata. No new external authority is introduced.
The deploy service account now needs compute.instances.stop / compute.instances.start (in addition to the setMetadata it already used). No SSH/IAP access to the VM is required.
The startup script and watchdog run as root on the VM (unchanged from how COS runs startup scripts).

Scaling Considerations

No meaningful change in resource consumption. A deploy now performs a full VM stop/start (brief downtime) rather than a container reset; update-container likewise restarted the instance, so this is comparable.
The watchdog runs once per minute and only inspects logs/docker inspect — negligible overhead.

Documentation Considerations

Backwards compatibility / data: existing konlet-backed VMs migrate on the first run of the new deploy_vm.sh (the gce-container-declaration key is removed and the startup script takes over). The persistent SQLite DB is preserved via the /var/lib/kv-store -> /db_data host-path mount, matching the previous konlet volume.
Any env vars that were baked into the original container declaration must now be present in the YMAX*_PLANNER_ENV secret, since the declaration is removed.

Testing Considerations

Validated all three scripts and the generated watchdog script with bash -n. Verified the startup-script heredoc renders the watchdog with values baked in (CONTAINER, STALE, /dev/null) while keeping runtime expansions literal.
Manually exercised deploy_vm.sh against a throwaway VM (ymax-test-planner): metadata set, deprecated key removed, graceful stop/start succeeded.
Both workflows pass the env file to check_digest.sh (… "$GITHUB_OUTPUT" "./.env.gcp"), matching the repo-root .env.gcp they write and the file deploy_vm.sh stores to metadata, so the env-aware skip/deploy decision is accurate.
Still to verify end-to-end: container comes up healthy with the correct env on ymax0, then ymax1; the watchdog restarts a deliberately-silenced container.

Upgrade Considerations

Run the migration deploy once per VM before 2026-07-31: ymax0 and ymax1
Before cutover, confirm YMAX0_PLANNER_ENV / YMAX1_PLANNER_ENV contain the complete env set, since env is now a full replace via metadata, not a per-key patch
Confirm each VM's DB mount path matches DB_HOST_PATH/DB_MOUNT_PATH in startup-script.sh
Verify the deploy SA has compute.instances.stop / start

Muneeb147

Thanks @usmanmani1122 for taking this up. It was pending since a while as google plans to deprecate update-container soon..

Left some comments..and also we'd need to update the runbook:

Runbook:
https://docs.google.com/document/d/1Mcv8RMh9ni_tH-bsE7TJg-xL35pLRc5nhlPUH-
NVs88/edit?tab=t.0

Sections to update:

Manual Deployment through gcloud (as Fallback)
Updating or Adding Environment Variables

Muneeb147 · 2026-06-03T08:50:04Z

+WATCHDOG_TIMER_SERVICE_NAME="container-watchdog.timer"
+WATCHDOG_STALE="2m"
+
+ENV_FILE="/run/${CONTAINER_NAME}.env"


Is deploy_vm is making sure to put relevant env variables inside /run/ymax-planner.env file already?

No my bad, I see get_metadata "$ENV_NAME_ATTRIBUTE" > "$ENV_FILE" is doing the needful..

Muneeb147 · 2026-06-03T08:57:26Z

+CURRENT_ENV="$(
+  printf '%s' "$METADATA" | \
+  jq '.metadata.items[]? | select(.key=="ymax-container-env") | .value' --raw-output


On the first run, this will be null/empty right? As there is no metadata named ymax-container-env in the VM..

Muneeb147 · 2026-06-03T09:06:28Z

+if test -f "$ENV_FILE"
+then
+  gcloud compute instances add-metadata "$GCE_INSTANCE" \
+    --metadata-from-file "$ENV_NAME_ATTRIBUTE=$ENV_FILE" \


Previously, we used to deliberately skip # --container-env-file "$ENV_FILE" as /.env.gcp is stale and not updated with actual envs of planner-vm.
.env.gcp gets populated from GH secret (which is stale). Now after this change, it'll deploy the stale envs by reading from the file..

So either we keep the GH secret updated with the env.. OR we keep the old behavior of ignoring the .env.gcp file and only rely on the metadata.. And for that, we might need to first set metadata ymax-container-env one time with currently deployed envs..

updates

58ddd5a

usmanmani1122 requested a review from Muneeb147 June 1, 2026 12:15

usmanmani1122 self-assigned this Jun 1, 2026

Muneeb147 reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(planner): Use start script for planner deployment#12694

fix(planner): Use start script for planner deployment#12694
usmanmani1122 wants to merge 1 commit into
masterfrom
use-container-start-script-for-planner-deployment

usmanmani1122 commented Jun 1, 2026

Uh oh!

Muneeb147 left a comment

Uh oh!

Muneeb147 Jun 3, 2026

Uh oh!

Muneeb147 Jun 3, 2026

Uh oh!

Muneeb147 Jun 3, 2026

Uh oh!

Muneeb147 Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

usmanmani1122 commented Jun 1, 2026

Description

Security Considerations

Scaling Considerations

Documentation Considerations

Testing Considerations

Upgrade Considerations

Uh oh!

Muneeb147 left a comment

Choose a reason for hiding this comment

Uh oh!

Muneeb147 Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Muneeb147 Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Muneeb147 Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Muneeb147 Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants