fix: gate SSH dispatch to Docker-capable workers#74
Open
kaiitunnz wants to merge 12 commits into
Open
Conversation
Carries the task capabilities a worker advertises to the dispatcher. Defaults are restrictive so an unreporting worker is excluded rather than assumed capable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Add a single docker_client()/docker_available() entry point and route the SSH executor through it, so the executor's reachability check and the startup capability probe share one definition. The "Docker is not available" guard now points operators at ENABLE_SSH_BY_DEFAULT. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
The worker pings Docker at startup and reports a WorkerCapabilities to the supervisor at registration, so the dispatcher can tell which workers can actually service SSH tasks rather than inferring it from resource caps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Carry the worker-reported WorkerCapabilities through the registry and exclude workers that do not advertise SSH from SSH-task selection, so a misconfigured node no longer fails the task deep in the executor with a Docker error. An unreporting worker parses as incapable (strict). When no SSH-capable worker exists the task fails fast with an SSH-specific message. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
The supervisor WorkerInfo models containers the supervisor provisions; the worker self-report flows through Redis into the registry Worker instead, so this field was never populated or read. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
docker_available() opened a client purely to ping the daemon; closing it releases the HTTP connection pool so the helper is safe to call repeatedly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Mirror the server's WorkerCapabilities so SDK and CLI consumers can see which workers are SSH-capable, matching how ssh_limits is already surfaced. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
The SSH executor reaches Docker through the shared docker_client() helper now, so the tests patch that instead of the removed module-level docker import. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Clears the strict server pip-audit: cryptography 47 (GHSA-537c-gmf6-5ccf) and python-multipart 0.0.27 (GHSA-v9pg-7xvm-68hf, GHSA-5rvq-cxj2-64vf, GHSA-6jv3-5f52-599m). Both are server-only; worker requirements are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Frame the capability mechanism as a general routing contract with SSH as an example, and keep the SSH-specific mechanics in the worker README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Order capabilities ahead of ssh_limits consistently across the model, signatures, and wire payload, trim the schema docstrings, and make the no-worker dispatch message capability-agnostic rather than SSH-specific. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
An SSH task could be dispatched to a worker that cannot run it, then fail deep in the executor with
Docker is not available; cannot run SSH executor: ... FileNotFoundError(2, 'No such file or directory'). Two gaps combined: (1) the dispatcher filtered candidates only by hardware fit (hw_satisfies), never by whether a worker can actually service SSH; and (2)ssh_limitsisNoneboth when SSH is disabled and when SSH is enabled-but-uncapped, so it cannot stand in for an SSH-capable signal. In a multi-node cluster, a node missingENABLE_SSH_BY_DEFAULT=truebrings up workers without the Docker socket mounted, yet the scheduler still routes SSH tasks to them — non-deterministically, depending on which worker it picks — and the SSH executor raises a controlled (non-retryable) error, so the task fails terminally even when an SSH-capable worker exists.Changes
WorkerCapabilities(ssh) at registration; the SSH executor and the probe now share a single Docker-connection helper.WorkerCapabilitieson each registered worker and gates SSH-task selection on it (capability_satisfies, alongside the existing hardware check), failing fast with an SSH-specific message when no capable worker exists.WorkerCapabilitiesschema, also surfaced on the SDKWorkermodel soflowmesh worker listshows which workers are SSH-capable.Design
Capability is reported by the entity that actually runs SSH — the worker process — and reflects ground truth (Docker reachable) rather than echoing a server config flag, so it catches every cause of the original failure the same way: unmounted socket (the per-node
ENABLE_SSH_BY_DEFAULTmiss), dead daemon, or a missing docker-group gid. Capability and hardware stay separate concerns —capability_satisfiesgates which kinds of task a worker can take,hw_satisfiesgates how big. A worker that reports no capabilities is treated as incapable (strict default): during a rolling upgrade an SSH task waits for an upgraded worker rather than risking a wrong dispatch, which is the safe direction.WorkerCapabilitiesis a model rather than a bare boolean so future specialized executors can gate the same way without another wire-format change.Test Plan
End-to-end against a local single root + CPU worker on freshly built images carrying this branch (2 scenarios / 10 assertions). It targets what the unit suite cannot reach — the full worker-probe → registration → registry → dispatch path: (A) with only an SSH-incapable worker (
enable_ssh=false, no Docker socket mounted), the SSH task is excluded from selection and fails fast instead of being dispatched and erroring inside the executor; (B) on a mixed fleet, the SSH task is routed to the SSH-capable worker, spawns a session container, and runs to completion.Test Result
End-to-end suite: 2 scenarios, 10/10 assertions passed — the incapable-only fleet failed fast, and the mixed fleet routed the task to the capable worker and ran it to
DONE.Pre-submission Checklist
pre-commit run --all-filesand fixed any issues.uv run pytest tests/passes locally.uv sync --all-packages --group ci --frozen).[BREAKING]and described migration steps above.