Skip to content

fix(proxy): filter on-demand wake/sleep by node_id + stop leaking detail in 503 bodies#129

Merged
dviejokfs merged 1 commit into
mainfrom
fix/on-demand-wake-followups
Jun 11, 2026
Merged

fix(proxy): filter on-demand wake/sleep by node_id + stop leaking detail in 503 bodies#129
dviejokfs merged 1 commit into
mainfrom
fix/on-demand-wake-followups

Conversation

@dviejokfs

Copy link
Copy Markdown
Contributor

Two follow-ups from PR #124 (on-demand first-request 503 fix). Closes #126, closes #127.

#126 — multi-node wake/sleep started/stopped ALL containers locally

OnDemandManager::do_wake and sleep_environment loaded every deployment_containers row for the deployment with no node_id filter, then started/stopped each one via the local Docker ContainerLifecycle. On a multi-node cluster, containers whose node_id points at a remote worker don't exist on the local daemon → start_container fails → the partial-wake revert fires → the request gets the wake_failed 503. So scale-to-zero wake only worked correctly on single-node deployments.

Fix:

  • OnDemandManager now carries local_node_id: Option<i32>. A container is local iff node_id IS NULL (the deploy pipeline only stamps a node_id for remote workers, so control-plane-local containers carry none) or node_id == local_node_id.
  • do_wake partitions containers and starts only the local set. Remote-owned containers are skipped with a warn! (the worker-side wake RPC doesn't exist yet). A fully-remote environment has nothing this proxy can start, so it reverts sleeping=false and returns an error instead of falsely reporting a successful wake.
  • sleep_environment is symmetric: it stops only local containers and leaves remote ones for their own node's idle sweep.
  • The control plane passes local_node_id = None (it has no self node row; its locally-deployed containers carry node_id = NULL).

Note: remote containers are not woken by this change — the worker-side wake RPC is a separate, larger multi-node feature. This fix makes the local path correct and honest (warn + skip / error) instead of silently breaking every multi-node wake.

#127 — 503 bodies leaked environment_id and internal error strings

The on-demand wake 503 responses in proxy.rs are served to unauthenticated clients (a sleeping env has no auth context yet) and disclosed internal detail.

Fix:

  • Dropped environment_id from all three 503 bodies (wake_throttled, wake_pending, wake_failed) — clients key retries off Retry-After, not the id.
  • wake_failed no longer interpolates e.to_string() (the OnDemandError Display string can carry container/deployment context). It returns a static message; the detailed error is still logged server-side via error!(... error = %e ...). This also removes the fragile .replace('"', ...) JSON escaping.

Tests

  • Reworked the two pre-existing tests that asserted the buggy "start/stop all" behavior (test_wake_multiple_containers_multi_node, test_sleep_stops_multiple_containers) plus the lifecycle test, to assert local-only behavior.
  • Added: test_is_local_container_node_filter, test_wake_starts_only_local_containers_skips_remote, test_wake_all_remote_containers_errors_and_reverts, test_sleep_stops_only_local_containers_skips_remote.
  • cargo test --lib -p temps-proxy — 311 passed. cargo check --bin temps clean.

…ail in 503 bodies

Two follow-ups from PR #124.

#126 — do_wake/sleep_environment loaded ALL deployment_containers with no
node_id filter and started/stopped every one via the local Docker daemon.
On a multi-node cluster, remote-owned containers don't exist locally, so the
start failed and the whole wake reverted — scale-to-zero only worked on
single-node deployments. OnDemandManager now carries a local_node_id; a
container is local iff node_id IS NULL or == local_node_id. Wake starts only
local containers (skips remote with a warning); a fully-remote env errors and
reverts instead of falsely reporting success. Sleep is symmetric. Control
plane passes None (its local containers carry node_id=NULL).

#127 — the on-demand 503 bodies are served to unauthenticated clients. Dropped
environment_id from all three (wake_throttled/wake_pending/wake_failed) and
stopped interpolating the OnDemandError Display string into wake_failed; the
detail is logged server-side. Clients key retries off Retry-After.

Reworked the two pre-existing multi-node tests that asserted the buggy
"start/stop all" behavior; added is_local_container, local-only wake/sleep,
and all-remote-errors tests. 81 on_demand tests pass.
@dviejokfs dviejokfs merged commit fa33353 into main Jun 11, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant