Skip to content

fix(planner): Use start script for planner deployment#12694

Open
usmanmani1122 wants to merge 1 commit into
masterfrom
use-container-start-script-for-planner-deployment
Open

fix(planner): Use start script for planner deployment#12694
usmanmani1122 wants to merge 1 commit into
masterfrom
use-container-start-script-for-planner-deployment

Conversation

@usmanmani1122

Copy link
Copy Markdown
Contributor

refs: #XXXX

Description

Google is shutting down the Compute Engine container startup agent (konlet) and the gce-container-declaration instance metadata it relies on: workflows that depend on them stop working on 2026-07-31, with support fully ending 2027-07-31.

Our ymax-planner deployment was built entirely on that mechanism — gcloud compute instances update-container (sets gce-container-declaration, konlet runs the container) plus a digest check that reads the same metadata key. This PR migrates that pipeline to a supported approach: a Docker container driven by a VM startup script, with no dependency on konlet.

The deploy scripts keep the same CLI signatures. Both workflows that call them (deploy-ymax1-planner.yml, docker.yml) are updated only to pass the env file into the digest check (check_digest.sh … "$GITHUB_OUTPUT" "./.env.gcp") so the new env-aware gate compares against the same .env.gcp that gets deployed.

Most critical files to review:

  • .github/scripts/startup-script.sh (new) — installed as the VM startup-script metadata. On every boot it reads the target image (ymax-container-image) and env (ymax-container-env) from metadata and runs the container with docker run --detach --restart always --network host --volume /var/lib/kv-store:/db_data. This restores the container after reboot/recreation (the role konlet used to play). It also installs a small systemd-timer watchdog that restarts the container if it stops producing logs for 2 minutes — covering the "alive but wedged" case that --restart always does not.
  • .github/scripts/deploy_vm.sh — drops update-container. Now removes the deprecated gce-container-declaration key, writes ymax-container-image + the startup-script, writes the env to ymax-container-env, and applies via a graceful stop/start (re-runs the startup script; avoids the SQLite-corruption risk of a hard reset).
  • .github/scripts/check_digest.sh — reads ymax-container-image instead of gce-container-declaration, and now also compares ymax-container-env so an env-only change triggers a redeploy (previously only the image was checked).

Security Considerations

  • The container env (sourced from the YMAX*_PLANNER_ENV GitHub secrets) is stored in the ymax-container-env instance metadata and written to /run/ymax-planner.env on the VM. This is readable by any principal with compute.instances.get on the project. This is equivalent to the prior exposure: the old gce-container-declaration also embedded the container env in instance metadata. No new external authority is introduced.
  • The deploy service account now needs compute.instances.stop / compute.instances.start (in addition to the setMetadata it already used). No SSH/IAP access to the VM is required.
  • The startup script and watchdog run as root on the VM (unchanged from how COS runs startup scripts).

Scaling Considerations

  • No meaningful change in resource consumption. A deploy now performs a full VM stop/start (brief downtime) rather than a container reset; update-container likewise restarted the instance, so this is comparable.
  • The watchdog runs once per minute and only inspects logs/docker inspect — negligible overhead.

Documentation Considerations

  • Backwards compatibility / data: existing konlet-backed VMs migrate on the first run of the new deploy_vm.sh (the gce-container-declaration key is removed and the startup script takes over). The persistent SQLite DB is preserved via the /var/lib/kv-store -> /db_data host-path mount, matching the previous konlet volume.
  • Any env vars that were baked into the original container declaration must now be present in the YMAX*_PLANNER_ENV secret, since the declaration is removed.

Testing Considerations

  • Validated all three scripts and the generated watchdog script with bash -n. Verified the startup-script heredoc renders the watchdog with values baked in (CONTAINER, STALE, /dev/null) while keeping runtime expansions literal.
  • Manually exercised deploy_vm.sh against a throwaway VM (ymax-test-planner): metadata set, deprecated key removed, graceful stop/start succeeded.
  • Both workflows pass the env file to check_digest.sh (… "$GITHUB_OUTPUT" "./.env.gcp"), matching the repo-root .env.gcp they write and the file deploy_vm.sh stores to metadata, so the env-aware skip/deploy decision is accurate.
  • Still to verify end-to-end: container comes up healthy with the correct env on ymax0, then ymax1; the watchdog restarts a deliberately-silenced container.

Upgrade Considerations

  • Run the migration deploy once per VM before 2026-07-31: ymax0 and ymax1
  • Before cutover, confirm YMAX0_PLANNER_ENV / YMAX1_PLANNER_ENV contain the complete env set, since env is now a full replace via metadata, not a per-key patch
  • Confirm each VM's DB mount path matches DB_HOST_PATH/DB_MOUNT_PATH in startup-script.sh
  • Verify the deploy SA has compute.instances.stop / start

@usmanmani1122 usmanmani1122 requested a review from Muneeb147 June 1, 2026 12:15
@usmanmani1122 usmanmani1122 self-assigned this Jun 1, 2026

@Muneeb147 Muneeb147 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @usmanmani1122 for taking this up. It was pending since a while as google plans to deprecate update-container soon..

Left some comments..and also we'd need to update the runbook:

Runbook:
https://docs.google.com/document/d/1Mcv8RMh9ni_tH-bsE7TJg-xL35pLRc5nhlPUH-
NVs88/edit?tab=t.0

Sections to update:

  1. Manual Deployment through gcloud (as Fallback)
  2. Updating or Adding Environment Variables

WATCHDOG_TIMER_SERVICE_NAME="container-watchdog.timer"
WATCHDOG_STALE="2m"

ENV_FILE="/run/${CONTAINER_NAME}.env"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is deploy_vm is making sure to put relevant env variables inside /run/ymax-planner.env file already?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No my bad, I see get_metadata "$ENV_NAME_ATTRIBUTE" > "$ENV_FILE" is doing the needful..

Comment on lines +23 to +25
CURRENT_ENV="$(
printf '%s' "$METADATA" | \
jq '.metadata.items[]? | select(.key=="ymax-container-env") | .value' --raw-output

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the first run, this will be null/empty right? As there is no metadata named ymax-container-env in the VM..

if test -f "$ENV_FILE"
then
gcloud compute instances add-metadata "$GCE_INSTANCE" \
--metadata-from-file "$ENV_NAME_ATTRIBUTE=$ENV_FILE" \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, we used to deliberately skip # --container-env-file "$ENV_FILE" as /.env.gcp is stale and not updated with actual envs of planner-vm.
.env.gcp gets populated from GH secret (which is stale). Now after this change, it'll deploy the stale envs by reading from the file..

So either we keep the GH secret updated with the env.. OR we keep the old behavior of ignoring the .env.gcp file and only rely on the metadata.. And for that, we might need to first set metadata ymax-container-env one time with currently deployed envs..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants