clusterduck

clusterduck is a hydra launcher plugin for running jobs on a SLURM cluster.

Unlike hydra-submitit, clusterduck also supports "batching" multiple tasks within one job. This may be useful if:

your cluster only allocates exclusive nodes with multiple GPUs, but your tasks only use a single GPU
you have hundreds of small jobs but your cluster imposes a (potentially project-wide) limit on the number of queued jobs

In addition, clusterduck does not wait for your job to finish after submission!

Installation

Install clusterduck with pip install .

pip install .

Developers should note that Hydra plugins are not compatible with new PEP 660-style editable installs. In order to perform an editable install, either use compatibility mode:

pip install -e . --config-settings editable_mode=compat

or use strict editable mode.

pip install -e . --config-settings editable_mode=strict

Be aware that strict mode installs do not expose new files created in the project until the installation is performed again.

Usage

clusterduck essentially allows you to generate sbatch files programmatically. We generate an sbatch script containing a single srun command that calls python. Since every cluster is different, we do not try to be clever, but instead let the user set the arguments for sbatch and srun transparently. Unless overridden, srun inherits arguments to sbatch used to launch the job.

Each hydra override becomes a slurm task. One or more tasks may be grouped into a slurm job, and one or more jobs may be grouped into a slurm job array. By default, we use one task per job and submit a job array if there are multiple tasks to run.

After clusterduck is installed, you can print out the available configuration options with the following command:

python your_app.py hydra/launcher=clusterduck_slurm --cfg hydra -p hydra.launcher

The majority of these arguments are passed to sbatch. In addition, sbatch_kwargs allows for adding and overriding arbitrary sbatch arguments, while srun_kwargs does the same for srun. To use flags with sbatch and srun (e.g. --exclusive), use exclusive=True in the config. Lastly setup and teardown allow for arbitrary shell commands to be executed before and after python is called (useful for environment variables, etc.).

See example configs for cluster platforms under example/conf/platform.

Batching Tasks Within Jobs

To batch multiple tasks into a single job, set tasks_per_node>1. (We use tasks_per_node instead of ntasks, because ntasks>1 is not compatible with PyTorch Lightning.) Beyond that, you may need to adjust the config so that GPUs and CPUs are divided correctly among the tasks. Ideally, you can specify all required resources with e.g. cpus_per_task and gpus_per_task, in which case everything is handled automatically. Unfortunately, not all clusters support this. Please see the examples (example/conf/platform) and experiment with your cluster to see what works.

On some clusters, slurm assigns each job its own dedicated temporary local storage at the path $TMP or $TMPDIR. However, when multiple tasks are grouped into the same job, they all share the same $TMP folder. If you write to these folders, use the tmpdir_vars parameter to tell clusterduck to create subfolders in that directory for each task within the job.

Verbose Logging

To debug resource allocation, please install the optional dependencies with pip install ".[dev]". Then, use hydra's verbose logging feature to activate verbose logging in clusterduck. Either add hydra.verbose=clusterduck to your command or add the following to your config:

hydra:
  verbose: clusterduck

Afterwards, check the slurm logs your job produces to see which GPUs and CPUs and how much memory your job is assigned.

Debugging

We also provide the following non-slurm options for debugging:

use_srun:
If True, the python command will be launched by srun. If False, the python command is run directly inside the job. (default: True)
do_submit:
If False, create the submission file but do not actually submit it. (default: True)
local_debug:
If True, this is a shortcut for use_srun=False and do_submit=False. This generates a script that can be executed locally as a standard shell script. (default: False)

Example

The example script does not requires additional dependencies, but they are nice to have. Install with:

pip install ".[dev]"

To run the example script locally, e.g. looping over both model types twice each, use:

python example/train.py --multirun model=convnet,transformer +iteration="range(2)"

To run the example script with the submitit backend but locally without a cluster, specify the platform like this:

python example/train.py --multirun model=convnet,transformer +iteration="range(2)" +platform=local_debug

To run the example script on the HoreKa cluster, use:

python example/train.py --multirun model=convnet,transformer +iteration="range(2)" +platform=horeka

Caveats

Hydra Sweepers

Because the clusterduck launcher does not wait for the jobs to complete, it is not compatible with any sweepers that optimize some returned value.

Reference

Slurm Settings (see https://slurm.schedmd.com/sbatch.html)

timeout_min (int, default: 60): maximum time for the job in minutes
cpus_per_task (Optional[int], default: None): number of cpus to use for each task
gpus_per_node (Optional[int], default: None): number of gpus to use on each node
tasks_per_node (int, default: 1): number of tasks to spawn on each node
mem_gb (Optional[int], default: None): memory to reserve for the job on each node (in GB)
nodes (int, default: 1): number of nodes to use for the job
name (str, default: '${hydra.job.name}'): name of the job
partition (Optional[str], default: None): slurm partition to use on the cluster
qos (Optional[str], default: None)
comment (Optional[str], default: None)
constraint (Optional[str], default: None)
exclude (Optional[str], default: None)
gres (Optional[str], default: None)
cpus_per_gpu (Optional[int], default: None)
gpus_per_task (Optional[int], default: None)
mem_per_gpu (Optional[str], default: None)
mem_per_cpu (Optional[str], default: None)
account (Optional[str], default: None)

Clusterduck Settings

log_folder (str, default: '${hydra.sweep.dir}/slurm'): Folder where the submission script, pickle and slurm logs will be stored.
stderr_to_stdout (bool, default: True): If True, redirect the standard error of the job to the same file as standard output.
array_parallelism (int, default: 256): Throttle array jobs to only have this many jobs running at once
sbatch_kwargs (Dict[str, Any], default: {}): Any additional arguments that should be passed to sbatch
srun_kwargs (Dict[str, Any], default: {}): Any additional arguments that should be passed to srun
setup (Optional[List[str]], default: None): A list of commands to run in sbatch befure running srun
teardown (Optional[List[str]], default: None): A list of commands to run in sbatch after running srun
tmpdir_vars (Optional[List[str]], default: ['TMP', 'TMPDIR']): If these environment variables are set and there are multiple tasks, clusterduck will create a subfolder for each task and set the environment variable to point to that subfolder. This is useful for avoiding conflicts between tasks when writing temporary files.

Debugging Settings

use_srun (bool, default: True): If True, the python command will be launched by srun. If False, the python command is run directly inside the job.
do_submit (bool, default: True): If False, create the submission file but do not actually submit it.
local_debug (bool, default: False): If True, this is a shortcut for use_srun=False and do_submit=False. This generates a script that can be executed locally as a standard shell script.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
docs		docs
example		example
hydra_plugins/clusterduck_launcher		hydra_plugins/clusterduck_launcher
test		test
.gitignore		.gitignore
.isort.cfg		.isort.cfg
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

clusterduck

Installation

Usage

Batching Tasks Within Jobs

Verbose Logging

Debugging

Example

Caveats

Hydra Sweepers

Reference

Slurm Settings (see https://slurm.schedmd.com/sbatch.html)

Clusterduck Settings

Debugging Settings

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

clusterduck

Installation

Usage

Batching Tasks Within Jobs

Verbose Logging

Debugging

Example

Caveats

Hydra Sweepers

Reference

Slurm Settings (see https://slurm.schedmd.com/sbatch.html)

Clusterduck Settings

Debugging Settings

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages