fix(planner): Use start script for planner deployment#12694
fix(planner): Use start script for planner deployment#12694usmanmani1122 wants to merge 1 commit into
Conversation
Muneeb147
left a comment
There was a problem hiding this comment.
Thanks @usmanmani1122 for taking this up. It was pending since a while as google plans to deprecate update-container soon..
Left some comments..and also we'd need to update the runbook:
Runbook:
https://docs.google.com/document/d/1Mcv8RMh9ni_tH-bsE7TJg-xL35pLRc5nhlPUH-
NVs88/edit?tab=t.0
Sections to update:
- Manual Deployment through gcloud (as Fallback)
- Updating or Adding Environment Variables
| WATCHDOG_TIMER_SERVICE_NAME="container-watchdog.timer" | ||
| WATCHDOG_STALE="2m" | ||
|
|
||
| ENV_FILE="/run/${CONTAINER_NAME}.env" |
There was a problem hiding this comment.
Is deploy_vm is making sure to put relevant env variables inside /run/ymax-planner.env file already?
There was a problem hiding this comment.
No my bad, I see get_metadata "$ENV_NAME_ATTRIBUTE" > "$ENV_FILE" is doing the needful..
| CURRENT_ENV="$( | ||
| printf '%s' "$METADATA" | \ | ||
| jq '.metadata.items[]? | select(.key=="ymax-container-env") | .value' --raw-output |
There was a problem hiding this comment.
On the first run, this will be null/empty right? As there is no metadata named ymax-container-env in the VM..
| if test -f "$ENV_FILE" | ||
| then | ||
| gcloud compute instances add-metadata "$GCE_INSTANCE" \ | ||
| --metadata-from-file "$ENV_NAME_ATTRIBUTE=$ENV_FILE" \ |
There was a problem hiding this comment.
Previously, we used to deliberately skip # --container-env-file "$ENV_FILE" as /.env.gcp is stale and not updated with actual envs of planner-vm.
.env.gcp gets populated from GH secret (which is stale). Now after this change, it'll deploy the stale envs by reading from the file..
So either we keep the GH secret updated with the env.. OR we keep the old behavior of ignoring the .env.gcp file and only rely on the metadata.. And for that, we might need to first set metadata ymax-container-env one time with currently deployed envs..
refs: #XXXX
Description
Google is shutting down the Compute Engine container startup agent (konlet) and the
gce-container-declarationinstance metadata it relies on: workflows that depend on them stop working on 2026-07-31, with support fully ending 2027-07-31.Our ymax-planner deployment was built entirely on that mechanism —
gcloud compute instances update-container(setsgce-container-declaration, konlet runs the container) plus a digest check that reads the same metadata key. This PR migrates that pipeline to a supported approach: a Docker container driven by a VM startup script, with no dependency on konlet.The deploy scripts keep the same CLI signatures. Both workflows that call them (
deploy-ymax1-planner.yml,docker.yml) are updated only to pass the env file into the digest check (check_digest.sh … "$GITHUB_OUTPUT" "./.env.gcp") so the new env-aware gate compares against the same.env.gcpthat gets deployed.Most critical files to review:
.github/scripts/startup-script.sh(new) — installed as the VMstartup-scriptmetadata. On every boot it reads the target image (ymax-container-image) and env (ymax-container-env) from metadata and runs the container withdocker run --detach --restart always --network host --volume /var/lib/kv-store:/db_data. This restores the container after reboot/recreation (the role konlet used to play). It also installs a small systemd-timer watchdog that restarts the container if it stops producing logs for 2 minutes — covering the "alive but wedged" case that--restart alwaysdoes not..github/scripts/deploy_vm.sh— dropsupdate-container. Now removes the deprecatedgce-container-declarationkey, writesymax-container-image+ thestartup-script, writes the env toymax-container-env, and applies via a gracefulstop/start(re-runs the startup script; avoids the SQLite-corruption risk of a hardreset)..github/scripts/check_digest.sh— readsymax-container-imageinstead ofgce-container-declaration, and now also comparesymax-container-envso an env-only change triggers a redeploy (previously only the image was checked).Security Considerations
YMAX*_PLANNER_ENVGitHub secrets) is stored in theymax-container-envinstance metadata and written to/run/ymax-planner.envon the VM. This is readable by any principal withcompute.instances.geton the project. This is equivalent to the prior exposure: the oldgce-container-declarationalso embedded the container env in instance metadata. No new external authority is introduced.compute.instances.stop/compute.instances.start(in addition to thesetMetadatait already used). No SSH/IAP access to the VM is required.Scaling Considerations
stop/start(brief downtime) rather than a container reset;update-containerlikewise restarted the instance, so this is comparable.docker inspect— negligible overhead.Documentation Considerations
deploy_vm.sh(thegce-container-declarationkey is removed and the startup script takes over). The persistent SQLite DB is preserved via the/var/lib/kv-store -> /db_datahost-path mount, matching the previous konlet volume.YMAX*_PLANNER_ENVsecret, since the declaration is removed.Testing Considerations
bash -n. Verified the startup-script heredoc renders the watchdog with values baked in (CONTAINER,STALE,/dev/null) while keeping runtime expansions literal.deploy_vm.shagainst a throwaway VM (ymax-test-planner): metadata set, deprecated key removed, graceful stop/start succeeded.check_digest.sh(… "$GITHUB_OUTPUT" "./.env.gcp"), matching the repo-root.env.gcpthey write and the filedeploy_vm.shstores to metadata, so the env-aware skip/deploy decision is accurate.Upgrade Considerations
YMAX0_PLANNER_ENV/YMAX1_PLANNER_ENVcontain the complete env set, since env is now a full replace via metadata, not a per-key patchDB_HOST_PATH/DB_MOUNT_PATHinstartup-script.shcompute.instances.stop/start