Skip to content

fix: gate SSH dispatch to Docker-capable workers#74

Open
kaiitunnz wants to merge 12 commits into
mainfrom
kaiitunnz/fix/ssh-dispatch-capability-guard
Open

fix: gate SSH dispatch to Docker-capable workers#74
kaiitunnz wants to merge 12 commits into
mainfrom
kaiitunnz/fix/ssh-dispatch-capability-guard

Conversation

@kaiitunnz

@kaiitunnz kaiitunnz commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Purpose

An SSH task could be dispatched to a worker that cannot run it, then fail deep in the executor with Docker is not available; cannot run SSH executor: ... FileNotFoundError(2, 'No such file or directory'). Two gaps combined: (1) the dispatcher filtered candidates only by hardware fit (hw_satisfies), never by whether a worker can actually service SSH; and (2) ssh_limits is None both when SSH is disabled and when SSH is enabled-but-uncapped, so it cannot stand in for an SSH-capable signal. In a multi-node cluster, a node missing ENABLE_SSH_BY_DEFAULT=true brings up workers without the Docker socket mounted, yet the scheduler still routes SSH tasks to them — non-deterministically, depending on which worker it picks — and the SSH executor raises a controlled (non-retryable) error, so the task fails terminally even when an SSH-capable worker exists.

Changes

  • Worker — probes the Docker daemon once at startup and advertises a WorkerCapabilities (ssh) at registration; the SSH executor and the probe now share a single Docker-connection helper.
  • Server — carries WorkerCapabilities on each registered worker and gates SSH-task selection on it (capability_satisfies, alongside the existing hardware check), failing fast with an SSH-specific message when no capable worker exists.
  • Shared / SDK — a WorkerCapabilities schema, also surfaced on the SDK Worker model so flowmesh worker list shows which workers are SSH-capable.
  • Docs — capability-gated SSH dispatch noted in the architecture and worker docs.

Design

Capability is reported by the entity that actually runs SSH — the worker process — and reflects ground truth (Docker reachable) rather than echoing a server config flag, so it catches every cause of the original failure the same way: unmounted socket (the per-node ENABLE_SSH_BY_DEFAULT miss), dead daemon, or a missing docker-group gid. Capability and hardware stay separate concerns — capability_satisfies gates which kinds of task a worker can take, hw_satisfies gates how big. A worker that reports no capabilities is treated as incapable (strict default): during a rolling upgrade an SSH task waits for an upgraded worker rather than risking a wrong dispatch, which is the safe direction. WorkerCapabilities is a model rather than a bare boolean so future specialized executors can gate the same way without another wire-format change.

Test Plan

uv run pytest tests --ignore=tests/worker/test_mp_executor_cleanup_gpu.py
uv run pre-commit run --all-files

End-to-end against a local single root + CPU worker on freshly built images carrying this branch (2 scenarios / 10 assertions). It targets what the unit suite cannot reach — the full worker-probe → registration → registry → dispatch path: (A) with only an SSH-incapable worker (enable_ssh=false, no Docker socket mounted), the SSH task is excluded from selection and fails fast instead of being dispatched and erroring inside the executor; (B) on a mixed fleet, the SSH task is routed to the SSH-capable worker, spawns a session container, and runs to completion.

Test Result

$ uv run pytest tests --ignore=tests/worker/test_mp_executor_cleanup_gpu.py   # All passed
$ uv run pre-commit run --all-files  # isort / black / ruff / mypy / codespell / gitleaks passed

End-to-end suite: 2 scenarios, 10/10 assertions passed — the incapable-only fleet failed fast, and the mixed fleet routed the task to the capable worker and ran it to DONE.


Pre-submission Checklist
  • I have read the contribution guidelines.
  • I have run pre-commit run --all-files and fixed any issues.
  • I have added or updated tests covering my changes (if applicable).
  • I have verified that uv run pytest tests/ passes locally.
  • If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
  • If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen).
  • If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
  • I have updated documentation or config examples if user-facing behavior changed.

kaiitunnz and others added 12 commits June 16, 2026 04:47
Carries the task capabilities a worker advertises to the dispatcher.
Defaults are restrictive so an unreporting worker is excluded rather than
assumed capable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Add a single docker_client()/docker_available() entry point and route the
SSH executor through it, so the executor's reachability check and the startup
capability probe share one definition. The "Docker is not available" guard now
points operators at ENABLE_SSH_BY_DEFAULT.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
The worker pings Docker at startup and reports a WorkerCapabilities to the
supervisor at registration, so the dispatcher can tell which workers can
actually service SSH tasks rather than inferring it from resource caps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Carry the worker-reported WorkerCapabilities through the registry and exclude
workers that do not advertise SSH from SSH-task selection, so a misconfigured
node no longer fails the task deep in the executor with a Docker error. An
unreporting worker parses as incapable (strict). When no SSH-capable worker
exists the task fails fast with an SSH-specific message.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
The supervisor WorkerInfo models containers the supervisor provisions; the
worker self-report flows through Redis into the registry Worker instead, so
this field was never populated or read.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
docker_available() opened a client purely to ping the daemon; closing it
releases the HTTP connection pool so the helper is safe to call repeatedly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Mirror the server's WorkerCapabilities so SDK and CLI consumers can see which
workers are SSH-capable, matching how ssh_limits is already surfaced.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
The SSH executor reaches Docker through the shared docker_client() helper now,
so the tests patch that instead of the removed module-level docker import.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Clears the strict server pip-audit: cryptography 47 (GHSA-537c-gmf6-5ccf)
and python-multipart 0.0.27 (GHSA-v9pg-7xvm-68hf, GHSA-5rvq-cxj2-64vf,
GHSA-6jv3-5f52-599m). Both are server-only; worker requirements are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Frame the capability mechanism as a general routing contract with SSH as an
example, and keep the SSH-specific mechanics in the worker README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Order capabilities ahead of ssh_limits consistently across the model,
signatures, and wire payload, trim the schema docstrings, and make the
no-worker dispatch message capability-agnostic rather than SSH-specific.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz marked this pull request as ready for review June 16, 2026 04:48
@kaiitunnz kaiitunnz requested a review from timzsu as a code owner June 16, 2026 04:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant