Skip to content

fix(slurm): resolve compute NodeAddrs on startup so batch jobs run on the allocated node#692

Merged
yasithdev merged 1 commit into
apache:masterfrom
yasithdev:fix-experiment-execution
Jun 14, 2026
Merged

fix(slurm): resolve compute NodeAddrs on startup so batch jobs run on the allocated node#692
yasithdev merged 1 commit into
apache:masterfrom
yasithdev:fix-experiment-execution

Conversation

@yasithdev

Copy link
Copy Markdown
Contributor

Problem

Experiments failed during execution: environment setup and job submission succeeded and the job went active, but once the job finished, both the job and the monitoring task flipped to FAILED.

Root cause

The failure is in the devstack SLURM cluster, not the orchestrator. slurmctld resolves each node's NodeAddr exactly once, when it reads slurm.conf at startup. In the compose stack the controller can come up before the compute containers are DNS-resolvable (or while a prior run's container IPs are still cached), so it latches onto a stale/crossed NodeName -> IP mapping and keeps it for its whole lifetime.

Batch-job launches for a job allocated to c1 were then delivered to c2's slurmd, which rejects them:

[ctld] Allocate JobId=N NodeList=c1
[ctld] error: Batch completion for JobId=N sent from wrong node (c2 rather than c1)
[c2]   Launching batch job N ... error: Host c2 not in hostlist c1

slurmctld reads this as a node failure and requeues the job until it is held (JobHoldMaxRequeue), so jobs never actually run to completion. (srun and getent looked fine because they resolve live; only the controller's cached batch-launch path was affected.) A single scontrol reconfigure re-resolves every NodeAddr and restores correct routing.

Changes

  • conf/slurm/docker-entrypoint.sh (primary fix): the slurmctld role now backgrounds a small reconciler that waits for the controller and every configured compute node to be resolvable, then issues one scontrol reconfigure to re-resolve all NodeAddrs against current DNS. This self-heals the startup race on every boot and is a harmless no-op when the addresses were already correct.
  • MonitoringTask (complementary hardening): distinguish an infrastructure failure that SLURM requeues (NODE_FAIL / BOOT_FAIL) from a genuinely terminal failure, and ride out a short grace window before treating a sustained absence as failure, so a transient infra hiccup is tracked through the job lifecycle instead of aborting the experiment. The JOB_SUBMISSION -> MONITORING task structure is unchanged.

Test plan

  • Cold start (full slurm image rebuild + container recreate): the reconciler runs and logs the re-resolve; fresh batch jobs route correctly (c1 -> c1, c2 -> c2) with zero wrong-node errors.
  • End-to-end Echo experiment via the portal completes (environment setup -> job submission -> monitoring -> data staging) with output staged back; the SLURM job reaches COMPLETED on its allocated node.

… the allocated node

Experiments failed during execution: environment setup and job submission
succeeded and the job went active, but once it finished both the job and the
monitoring task flipped to FAILED.

Root cause is in the devstack SLURM cluster, not the orchestrator. slurmctld
resolves each node's NodeAddr exactly once, when it reads slurm.conf at startup.
In the compose stack the controller can come up before the compute containers
are DNS-resolvable (or while a prior run's container IPs are still cached), so it
latches onto a stale/crossed NodeName -> IP mapping and keeps it for its whole
lifetime. Batch-job launches for a job allocated to c1 were then delivered to
c2's slurmd, which rejects them ("Host c2 not in hostlist c1"); slurmctld reads
that as a node failure and requeues the job until it is held
(JobHoldMaxRequeue), so jobs never actually run to completion. (srun and getent
looked fine because they resolve live; only the controller's cached batch path
was affected.) A single `scontrol reconfigure` re-resolves every NodeAddr and
restores correct routing.

Changes:

- conf/slurm/docker-entrypoint.sh: the slurmctld role now backgrounds a small
  reconciler that waits for the controller and every configured compute node to
  be resolvable, then issues one `scontrol reconfigure` to re-resolve all
  NodeAddrs against current DNS. Self-heals the startup race on every boot;
  harmless when addresses were already correct.

- MonitoringTask: distinguish an infrastructure failure that SLURM requeues
  (NODE_FAIL / BOOT_FAIL) from a genuinely terminal failure, and ride out a
  short grace window before treating a sustained absence as failure, so a
  transient infra hiccup is tracked through the job lifecycle instead of
  aborting the experiment. The JOB_SUBMISSION -> MONITORING task structure is
  unchanged.

Test plan:

- Cold start (full slurm image rebuild + container recreate): the reconciler
  runs and logs the re-resolve; fresh batch jobs route correctly (c1 -> c1,
  c2 -> c2) with zero wrong-node errors.
- End-to-end Echo experiment via the portal completes (environment setup ->
  job submission -> monitoring -> data staging) with output staged back; the
  SLURM job reaches COMPLETED on its allocated node.
@yasithdev yasithdev merged commit 49dcf41 into apache:master Jun 14, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant