fix(proxy): filter on-demand wake/sleep by node_id + stop leaking detail in 503 bodies#129
Merged
Merged
Conversation
…ail in 503 bodies Two follow-ups from PR #124. #126 — do_wake/sleep_environment loaded ALL deployment_containers with no node_id filter and started/stopped every one via the local Docker daemon. On a multi-node cluster, remote-owned containers don't exist locally, so the start failed and the whole wake reverted — scale-to-zero only worked on single-node deployments. OnDemandManager now carries a local_node_id; a container is local iff node_id IS NULL or == local_node_id. Wake starts only local containers (skips remote with a warning); a fully-remote env errors and reverts instead of falsely reporting success. Sleep is symmetric. Control plane passes None (its local containers carry node_id=NULL). #127 — the on-demand 503 bodies are served to unauthenticated clients. Dropped environment_id from all three (wake_throttled/wake_pending/wake_failed) and stopped interpolating the OnDemandError Display string into wake_failed; the detail is logged server-side. Clients key retries off Retry-After. Reworked the two pre-existing multi-node tests that asserted the buggy "start/stop all" behavior; added is_local_container, local-only wake/sleep, and all-remote-errors tests. 81 on_demand tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two follow-ups from PR #124 (on-demand first-request 503 fix). Closes #126, closes #127.
#126 — multi-node wake/sleep started/stopped ALL containers locally
OnDemandManager::do_wakeandsleep_environmentloaded everydeployment_containersrow for the deployment with nonode_idfilter, then started/stopped each one via the local DockerContainerLifecycle. On a multi-node cluster, containers whosenode_idpoints at a remote worker don't exist on the local daemon →start_containerfails → the partial-wake revert fires → the request gets thewake_failed503. So scale-to-zero wake only worked correctly on single-node deployments.Fix:
OnDemandManagernow carrieslocal_node_id: Option<i32>. A container is local iffnode_id IS NULL(the deploy pipeline only stamps anode_idfor remote workers, so control-plane-local containers carry none) ornode_id == local_node_id.do_wakepartitions containers and starts only the local set. Remote-owned containers are skipped with awarn!(the worker-side wake RPC doesn't exist yet). A fully-remote environment has nothing this proxy can start, so it revertssleeping=falseand returns an error instead of falsely reporting a successful wake.sleep_environmentis symmetric: it stops only local containers and leaves remote ones for their own node's idle sweep.local_node_id = None(it has no self node row; its locally-deployed containers carrynode_id = NULL).#127 — 503 bodies leaked environment_id and internal error strings
The on-demand wake 503 responses in
proxy.rsare served to unauthenticated clients (a sleeping env has no auth context yet) and disclosed internal detail.Fix:
environment_idfrom all three 503 bodies (wake_throttled,wake_pending,wake_failed) — clients key retries offRetry-After, not the id.wake_failedno longer interpolatese.to_string()(theOnDemandErrorDisplay string can carry container/deployment context). It returns a static message; the detailed error is still logged server-side viaerror!(... error = %e ...). This also removes the fragile.replace('"', ...)JSON escaping.Tests
test_wake_multiple_containers_multi_node,test_sleep_stops_multiple_containers) plus the lifecycle test, to assert local-only behavior.test_is_local_container_node_filter,test_wake_starts_only_local_containers_skips_remote,test_wake_all_remote_containers_errors_and_reverts,test_sleep_stops_only_local_containers_skips_remote.cargo test --lib -p temps-proxy— 311 passed.cargo check --bin tempsclean.