fix(slurm): resolve compute NodeAddrs on startup so batch jobs run on the allocated node by yasithdev · Pull Request #692 · apache/airavata

yasithdev · 2026-06-14T17:06:09Z

Problem

Experiments failed during execution: environment setup and job submission succeeded and the job went active, but once the job finished, both the job and the monitoring task flipped to FAILED.

Root cause

The failure is in the devstack SLURM cluster, not the orchestrator. slurmctld resolves each node's NodeAddr exactly once, when it reads slurm.conf at startup. In the compose stack the controller can come up before the compute containers are DNS-resolvable (or while a prior run's container IPs are still cached), so it latches onto a stale/crossed NodeName -> IP mapping and keeps it for its whole lifetime.

Batch-job launches for a job allocated to c1 were then delivered to c2's slurmd, which rejects them:

[ctld] Allocate JobId=N NodeList=c1
[ctld] error: Batch completion for JobId=N sent from wrong node (c2 rather than c1)
[c2]   Launching batch job N ... error: Host c2 not in hostlist c1

slurmctld reads this as a node failure and requeues the job until it is held (JobHoldMaxRequeue), so jobs never actually run to completion. (srun and getent looked fine because they resolve live; only the controller's cached batch-launch path was affected.) A single scontrol reconfigure re-resolves every NodeAddr and restores correct routing.

Changes

conf/slurm/docker-entrypoint.sh (primary fix): the slurmctld role now backgrounds a small reconciler that waits for the controller and every configured compute node to be resolvable, then issues one scontrol reconfigure to re-resolve all NodeAddrs against current DNS. This self-heals the startup race on every boot and is a harmless no-op when the addresses were already correct.
MonitoringTask (complementary hardening): distinguish an infrastructure failure that SLURM requeues (NODE_FAIL / BOOT_FAIL) from a genuinely terminal failure, and ride out a short grace window before treating a sustained absence as failure, so a transient infra hiccup is tracked through the job lifecycle instead of aborting the experiment. The JOB_SUBMISSION -> MONITORING task structure is unchanged.

Test plan

Cold start (full slurm image rebuild + container recreate): the reconciler runs and logs the re-resolve; fresh batch jobs route correctly (c1 -> c1, c2 -> c2) with zero wrong-node errors.
End-to-end Echo experiment via the portal completes (environment setup -> job submission -> monitoring -> data staging) with output staged back; the SLURM job reaches COMPLETED on its allocated node.

… the allocated node Experiments failed during execution: environment setup and job submission succeeded and the job went active, but once it finished both the job and the monitoring task flipped to FAILED. Root cause is in the devstack SLURM cluster, not the orchestrator. slurmctld resolves each node's NodeAddr exactly once, when it reads slurm.conf at startup. In the compose stack the controller can come up before the compute containers are DNS-resolvable (or while a prior run's container IPs are still cached), so it latches onto a stale/crossed NodeName -> IP mapping and keeps it for its whole lifetime. Batch-job launches for a job allocated to c1 were then delivered to c2's slurmd, which rejects them ("Host c2 not in hostlist c1"); slurmctld reads that as a node failure and requeues the job until it is held (JobHoldMaxRequeue), so jobs never actually run to completion. (srun and getent looked fine because they resolve live; only the controller's cached batch path was affected.) A single `scontrol reconfigure` re-resolves every NodeAddr and restores correct routing. Changes: - conf/slurm/docker-entrypoint.sh: the slurmctld role now backgrounds a small reconciler that waits for the controller and every configured compute node to be resolvable, then issues one `scontrol reconfigure` to re-resolve all NodeAddrs against current DNS. Self-heals the startup race on every boot; harmless when addresses were already correct. - MonitoringTask: distinguish an infrastructure failure that SLURM requeues (NODE_FAIL / BOOT_FAIL) from a genuinely terminal failure, and ride out a short grace window before treating a sustained absence as failure, so a transient infra hiccup is tracked through the job lifecycle instead of aborting the experiment. The JOB_SUBMISSION -> MONITORING task structure is unchanged. Test plan: - Cold start (full slurm image rebuild + container recreate): the reconciler runs and logs the re-resolve; fresh batch jobs route correctly (c1 -> c1, c2 -> c2) with zero wrong-node errors. - End-to-end Echo experiment via the portal completes (environment setup -> job submission -> monitoring -> data staging) with output staged back; the SLURM job reaches COMPLETED on its allocated node.

yasithdev merged commit 49dcf41 into apache:master Jun 14, 2026
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(slurm): resolve compute NodeAddrs on startup so batch jobs run on the allocated node#692

fix(slurm): resolve compute NodeAddrs on startup so batch jobs run on the allocated node#692
yasithdev merged 1 commit into
apache:masterfrom
yasithdev:fix-experiment-execution

yasithdev commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yasithdev commented Jun 14, 2026

Problem

Root cause

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant