fix(slurm): resolve compute NodeAddrs on startup so batch jobs run on the allocated node#692
Merged
Merged
Conversation
… the allocated node
Experiments failed during execution: environment setup and job submission
succeeded and the job went active, but once it finished both the job and the
monitoring task flipped to FAILED.
Root cause is in the devstack SLURM cluster, not the orchestrator. slurmctld
resolves each node's NodeAddr exactly once, when it reads slurm.conf at startup.
In the compose stack the controller can come up before the compute containers
are DNS-resolvable (or while a prior run's container IPs are still cached), so it
latches onto a stale/crossed NodeName -> IP mapping and keeps it for its whole
lifetime. Batch-job launches for a job allocated to c1 were then delivered to
c2's slurmd, which rejects them ("Host c2 not in hostlist c1"); slurmctld reads
that as a node failure and requeues the job until it is held
(JobHoldMaxRequeue), so jobs never actually run to completion. (srun and getent
looked fine because they resolve live; only the controller's cached batch path
was affected.) A single `scontrol reconfigure` re-resolves every NodeAddr and
restores correct routing.
Changes:
- conf/slurm/docker-entrypoint.sh: the slurmctld role now backgrounds a small
reconciler that waits for the controller and every configured compute node to
be resolvable, then issues one `scontrol reconfigure` to re-resolve all
NodeAddrs against current DNS. Self-heals the startup race on every boot;
harmless when addresses were already correct.
- MonitoringTask: distinguish an infrastructure failure that SLURM requeues
(NODE_FAIL / BOOT_FAIL) from a genuinely terminal failure, and ride out a
short grace window before treating a sustained absence as failure, so a
transient infra hiccup is tracked through the job lifecycle instead of
aborting the experiment. The JOB_SUBMISSION -> MONITORING task structure is
unchanged.
Test plan:
- Cold start (full slurm image rebuild + container recreate): the reconciler
runs and logs the re-resolve; fresh batch jobs route correctly (c1 -> c1,
c2 -> c2) with zero wrong-node errors.
- End-to-end Echo experiment via the portal completes (environment setup ->
job submission -> monitoring -> data staging) with output staged back; the
SLURM job reaches COMPLETED on its allocated node.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Experiments failed during execution: environment setup and job submission succeeded and the job went active, but once the job finished, both the job and the monitoring task flipped to
FAILED.Root cause
The failure is in the devstack SLURM cluster, not the orchestrator.
slurmctldresolves each node'sNodeAddrexactly once, when it readsslurm.confat startup. In the compose stack the controller can come up before the compute containers are DNS-resolvable (or while a prior run's container IPs are still cached), so it latches onto a stale/crossedNodeName -> IPmapping and keeps it for its whole lifetime.Batch-job launches for a job allocated to
c1were then delivered toc2's slurmd, which rejects them:slurmctldreads this as a node failure and requeues the job until it is held (JobHoldMaxRequeue), so jobs never actually run to completion. (srunandgetentlooked fine because they resolve live; only the controller's cached batch-launch path was affected.) A singlescontrol reconfigurere-resolves everyNodeAddrand restores correct routing.Changes
conf/slurm/docker-entrypoint.sh(primary fix): theslurmctldrole now backgrounds a small reconciler that waits for the controller and every configured compute node to be resolvable, then issues onescontrol reconfigureto re-resolve allNodeAddrs against current DNS. This self-heals the startup race on every boot and is a harmless no-op when the addresses were already correct.MonitoringTask(complementary hardening): distinguish an infrastructure failure that SLURM requeues (NODE_FAIL/BOOT_FAIL) from a genuinely terminal failure, and ride out a short grace window before treating a sustained absence as failure, so a transient infra hiccup is tracked through the job lifecycle instead of aborting the experiment. TheJOB_SUBMISSION -> MONITORINGtask structure is unchanged.Test plan
c1 -> c1,c2 -> c2) with zero wrong-node errors.COMPLETEDon its allocated node.