Slinky on DOKS — Multi-Node GPU Training

Automated deployment of Slinky (Slurm on Kubernetes) on DigitalOcean DOKS, from infrastructure provisioning through running multi-node GPU collective benchmarks over an RDMA fabric.

Supports both NVIDIA (NCCL / CUDA) and AMD (RCCL / ROCm) GPU nodes.

Prefer manual steps? See the Manual Install Guide for step-by-step kubectl/helm commands with explanations.

Support disclaimer: DigitalOcean does not provide direct support for Slinky or Slurm. These instructions are offered as guidance only. While the underlying DigitalOcean services (DOKS, Managed NFS, DBaaS) are fully supported, issues related to Slinky, Slurm, or their configuration are outside the scope of DigitalOcean support.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    DOKS Cluster (VPC)                       │
│                                                             │
│  ┌─────────────────────┐    ┌────────────────────────────┐  │
│  │    mgmt pool (CPU)  │    │     gpu pool (GPU)         │  │
│  │                     │    │  (auto-tainted by DOKS)    │  │
│  │  slurmctld          │    │                            │  │
│  │  slurmdbd           │    │  slurm-worker-slinky-0     │  │
│  │  slurmrestd         │    │  slurm-worker-slinky-1     │  │
│  │  login node         │    │  ...                       │  │
│  │  slurm-operator     │    │                            │  │
│  │  cert-manager       │    │                            │  │
│  │  prometheus/grafana │    │                            │  │
│  └─────────┬───────────┘    └────────────────────────────┘  │
│            │                                                │
│  ┌─────────┴───────────┐    ┌────────────────────────────┐  │
│  │   Managed MySQL     │    │     Managed NFS            │  │
│  │   (accounting)      │    │     (/shared)              │  │
│  └─────────────────────┘    └────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

DOKS automatically applies taints to GPU node pools, so non-GPU workloads (operator, monitoring, Slurm control plane) naturally schedule on the mgmt nodes without explicit nodeSelector rules.

Prerequisites

CLI Tools

doctl configured with your API token
Terraform >= 1.5
Helm >= 3.12
kubectl
Docker (only if building the custom slurmd image locally)

DigitalOcean Account

GPU Droplet access enabled (request via support if needed)
doctl authenticated (doctl auth init)

Container Registry

GPU workers run a custom slurmd image that includes the GPU communication libraries and benchmark binaries. Push this image to a registry accessible by DOKS (e.g., ghcr.io).

DOKS image size limits: Layers > 5GB or total image size > 20GB are not supported until Q2 2026. Both slurmd-cuda and slurmd-rocm images are designed to stay within these limits.

Environment Variables

Variable	Required	Description
`DIGITALOCEAN_TOKEN`	Yes	DigitalOcean API token (used by Terraform)
`SLURMD_IMAGE`	Yes	Full image reference, e.g. `ghcr.io/your-org/slurmd-cuda:25.11-cuda12.6`
`LOGIN_IMAGE`	No	Custom login image with developer tools, e.g. `ghcr.io/your-org/slurm-login:25.11`. Falls back to the upstream `ghcr.io/slinkyproject/login` image when unset
`REGISTRY_USER`	Yes	Registry username for image pull secret
`REGISTRY_PASSWORD`	Yes	Registry password/token for image pull secret

Configuration

Terraform Variables

cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Edit terraform.tfvars — set region, GPU vendor, node size/count

GPU Vendor

Set gpu_vendor in terraform.tfvars to match your GPU hardware:

GPU Family	`gpu_vendor`	`gpu_node_size`	Region
NVIDIA B300 (8x)	`nvidia`	`gpu-b300x8-2304gb-fabric-contracted`	`ric1`
NVIDIA H100 (8x)	`nvidia`	`gpu-h100x8-640gb`	`atl1`
AMD MI300X (8x)	`amd`	`gpu-mi300x8-1920gb`	`atl1`

The Makefile derives the correct taint key (nvidia.com/gpu or amd.com/gpu) and node selector label automatically from this value.

Bring Your Own Cluster

If you already have a DOKS cluster provisioned via the DO console or API, you can skip cluster creation and let Terraform provision only the Managed MySQL and Managed NFS dependencies.

Get your Cluster ID and VPC ID

Run this one command — it prints both IDs at once:

doctl kubernetes cluster get <your-cluster-name> -o json | python3 -c "
import json,sys
d=json.loads(sys.stdin.read())
if isinstance(d,list): d=d[0]
print('cluster_id:', d['id'])
print('vpc_id:', d['vpc_uuid'])
"

Example output:

cluster_id: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
vpc_id:     yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy

Don't know your cluster name? Run doctl kubernetes cluster list.

Setup

1. Configure tfvars with your existing IDs

# terraform/terraform.tfvars
region       = "ric1"          # must match your cluster's region
project_name = "slinky-poc"
gpu_vendor   = "nvidia"

existing_cluster_id = "abc-1234-..."   # your cluster ID
existing_vpc_id     = "def-5678-..."   # your VPC ID

2. Provision MySQL and NFS

make infra/init
make infra/apply   # creates only MySQL + NFS, cluster is untouched

3. Get kubeconfig

make infra/kubeconfig   # auto-detects external cluster and uses doctl

4. Deploy Slinky

export SLURMD_IMAGE=ghcr.io/your-org/slurmd-cuda:25.11-cuda12.6
export REGISTRY_USER=your-registry-user
export REGISTRY_PASSWORD=your-registry-token
make up-from-existing

make up-from-existing is identical to make up but skips the infrastructure provisioning step — it assumes your cluster is already running and kubeconfig is configured.

Note: DO_API_TOKEN is accepted as an alias for DIGITALOCEAN_TOKEN — either env var works.

Custom slurmd Image

GPU workers require a custom slurmd image because the upstream Slinky image does not include GPU communication libraries or benchmark binaries.

NVIDIA (CUDA / NCCL)

The slurmd-cuda image includes:

NCCL runtime libraries (from nvidia/cuda:12.6.3-devel-ubuntu24.04)
Compiled all_reduce_perf, reduce_scatter_perf, all_gather_perf binaries
RDMA userspace tools (libibverbs, rdma-core, perftest)
OpenMPI

Build via GitHub Actions (recommended):

Trigger the Build slurmd-cuda workflow from your repository's Actions tab, or push to a branch that matches the workflow trigger. The image is pushed to ghcr.io/<your-org>/slurmd-cuda:25.11-cuda12.6.

Build locally:

export SLURMD_IMAGE=ghcr.io/your-org/slurmd-cuda:25.11-cuda12.6
make docker/build-slurmd-cuda
make docker/push-slurmd

See docker/slurmd-cuda/Dockerfile for build details.

AMD (ROCm / RCCL)

The slurmd-rocm image includes ROCm runtime libraries, RCCL, and compiled benchmark binaries.

export SLURMD_IMAGE=ghcr.io/your-org/slurmd-rocm:25.11
make docker/build-slurmd
make docker/push-slurmd

See docker/slurmd-rocm/README.md for build details.

Custom Login Image

The upstream Slinky login image ships only the Slurm client commands (sinfo, srun, sbatch, …). It has no editor, git, python3, sudo, curl, or wget — so once logged in, users can't pull a codebase, edit a file, run a script, or install software. The docker/login/Dockerfile layers these core developer tools on top of the upstream image:

vim, nano — editors
git — clone/pull codebases
python3, python3-pip — run and install Python
sudo (passwordless, PoC-grade) — install packages with apt
curl, wget, less — fetch and page files

export LOGIN_IMAGE=ghcr.io/your-org/slurm-login:25.11
make docker/build-login
make docker/push-login

LOGIN_IMAGE is optional — when unset, the Slurm chart uses the upstream ghcr.io/slinkyproject/login image. When set, make slinky/configure wires it into the LoginSet. Keep it in the same registry as SLURMD_IMAGE so the existing pull secret covers both. See docker/login/Dockerfile for build details.

Quick Start

# 1. Set your slurmd image
export SLURMD_IMAGE=ghcr.io/your-org/slurmd-cuda:25.11-cuda12.6   # NVIDIA
# export SLURMD_IMAGE=ghcr.io/your-org/slurmd-rocm:25.11           # AMD
export LOGIN_IMAGE=ghcr.io/your-org/slurm-login:25.11             # optional: login pod with dev tools

# 2. Deploy everything (infra, kubeconfig, prereqs, NFS, fabric, operator, Slurm)
make up

# 3. Discover GPUs and update Slurm config
make gpu/discover-gres
make slinky/update-slurm

# 4. Verify
make status
make slurm/shell   # interactive login node shell

Quick Deploy (B300)

Already have a B300 DOKS cluster? quick-deploy/deploy_b300.py is a one-command alternative to the make-based flow above. It runs against an existing cluster, auto-discovers the GPU hardware from DOKS node labels, sizes the Slurm NodeSet to your Ready B300 nodes, and defaults to the public DO-Solutions worker/login images — no image build or pull secret needed. It's stdlib-only Python (3.8+), idempotent, and fail-loud. The one external prerequisite is an NFS PersistentVolume created beforehand (make nfs/configure, or --skip-nfs to come up without /shared).

# Dry run first — detect the shape, print the plan and rendered values, apply nothing:
python3 quick-deploy/deploy_b300.py --dry-run

# Then deploy for real:
python3 quick-deploy/deploy_b300.py

See quick-deploy/README.md for prerequisites, the NFS PV setup, all flags, and teardown.

Step-by-Step Guide

1. Infrastructure

Provision DOKS cluster, managed MySQL, managed NFS, and VPC:

make infra/apply

Already have a cluster? See the Bring Your Own Cluster section — set existing_cluster_id and existing_vpc_id in terraform.tfvars and terraform apply will only create MySQL and NFS.

2. Kubeconfig

Save the cluster kubeconfig so kubectl and helm can reach the new cluster:

make infra/kubeconfig

Note: make up runs this automatically after infra/apply.

3. Prerequisites

Install cert-manager (required by Slinky operator) and Prometheus/Grafana:

make prereqs/install

4. Storage

Create NFS PV/PVC from Terraform outputs, used as shared storage (/shared) across login and worker pods:

make nfs/configure

5. RDMA Fabric

Install Multus CNI and fabric NetworkAttachmentDefinitions for RoCE (RDMA over Converged Ethernet):

make fabric/install

Each B300 GPU node has 16 fabric NICs (fabric0–fabric15, two per GPU). Multus attaches these into worker pods for GPU-to-GPU communication across nodes.

6. Slurm Operator

Install the Slinky operator with CRDs:

make slinky/install-operator

7. Slurm Cluster

Creates the DB secret, image pull secret, generates Helm values, and deploys the Slurm cluster:

export SLURMD_IMAGE=ghcr.io/your-org/slurmd-cuda:25.11-cuda12.6   # or your AMD image
export REGISTRY_USER=your-registry-user
export REGISTRY_PASSWORD=your-registry-token
make slinky/install-slurm

8. GPU Discovery

Discover GPU device paths on the GPU nodes and update the Slurm GRes configuration:

make gpu/discover-gres
make slinky/update-slurm

This deploys a probe pod to detect device paths on the GPU node (e.g., /dev/nvidia[0-7] for NVIDIA, /dev/dri/renderD[128,136,...] for AMD) and saves the result to gres.conf.

9. Validation

make slurm/info            # sinfo, squeue, partitions
make slurm/test-fabric     # verify fabric NICs and RDMA devices
make status                # full component status

Running GPU Collective Benchmarks

These benchmarks confirm GPU-to-GPU communication is working correctly over the RDMA fabric.

Prerequisites: fabric deployed (make fabric/install), workers running, compute nodes idle (sinfo shows idle).

NVIDIA — NCCL Tests

Single-Node (8 GPUs, intra-node)

make slurm/submit-nccl-1node

Expected: all_reduce_perf bandwidth table with ~300–450 GB/s bus bandwidth across message sizes.

Multi-Node (16 GPUs, 2 nodes over RoCE)

make slurm/submit-nccl-2node

Expected: bandwidth table with inter-node throughput and NCCL_DEBUG output showing NET/IB RoCE transport selected.

AMD — RCCL Tests

Single-Node (8 GPUs, intra-node)

make slurm/submit-rccl-1node

Expected: all_reduce_perf bandwidth table with ~110 GB/s average bus bandwidth.

Multi-Node (16 GPUs, 2 nodes over RoCE)

make slurm/submit-rccl-2node

Expected: bandwidth table with ~350 GB/s average bus bandwidth and RoCE transport confirmation.

Reading Output

Job output is written to NFS at /shared/output/:

/shared/output/allreduce-1node-<jobid>.out
/shared/output/allreduce-2node-<jobid>.out

To read results from the login pod:

make slurm/shell
ls /shared/output/
cat /shared/output/allreduce-1node-*.out

Teardown

make down

Tears down in reverse order: Slurm cluster, fabric, prerequisites, infrastructure.

Make Targets Reference

make help

Target	Description
Lifecycle
`up`	Full deploy: infra, prereqs, NFS, fabric, operator, Slurm
`up-from-existing`	Deploy Slinky on existing DOKS cluster (run `make infra/import-cluster` first)
`down`	Full teardown
`status`	Show status of all components
Infrastructure
`infra/init`	Initialize Terraform providers and backend
`infra/plan`	Preview Terraform changes
`infra/apply`	Provision DOKS, MySQL, NFS, VPC
`infra/kubeconfig`	Save kubeconfig from Terraform to ~/.kube/config
`infra/import-cluster`	Import existing DOKS cluster into Terraform state (set `CLUSTER_NAME`)
`infra/destroy`	Destroy all infrastructure
`infra/output`	Print all Terraform outputs
Prerequisites
`prereqs/install`	Install cert-manager and Prometheus
`prereqs/status`	Check pod status across prerequisite namespaces
`prereqs/uninstall`	Uninstall all prerequisites
NFS
`nfs/configure`	Generate NFS PV/PVC from Terraform outputs
`nfs/test`	Deploy busybox pod to verify NFS read/write
`nfs/status`	Check PV/PVC binding status
Docker
`docker/build-slurmd`	Build custom slurmd image with ROCm/RCCL (AMD)
`docker/build-slurmd-cuda`	Build custom slurmd image with CUDA/NCCL (NVIDIA)
`docker/push-slurmd`	Push slurmd image to registry
`docker/build-login`	Build custom login image with developer tools
`docker/push-login`	Push custom login image to registry
Fabric
`fabric/install`	Install Multus + fabric NADs
`fabric/install-multus`	Install Multus CNI plugin
`fabric/install-nads`	Create fabric NetworkAttachmentDefinitions
`fabric/status`	Check Multus and NAD status
`fabric/uninstall`	Remove fabric NADs and Multus
GPU
`gpu/discover-gres`	Discover GPU device paths and save gres.conf
Slinky / Slurm
`slinky/install-operator`	Install Slinky operator with CRDs
`slinky/configure`	Generate values-slurm.yaml from template
`slinky/install-slurm`	Install Slurm cluster (creates secrets, configures, deploys)
`slinky/update-slurm`	Helm upgrade Slurm with updated values
`slinky/create-db-secret`	Create Slurm DB password secret
`slinky/create-pull-secret`	Create image pull secret
`slinky/status`	Show pods across slinky + slurm namespaces
`slinky/uninstall`	Uninstall Slurm cluster, operator, CRDs
`slinky/logs`	Tail operator and controller logs
Slurm Operations
`slurm/shell`	Interactive shell on the login pod
`slurm/info`	Show sinfo, squeue, partitions
`slurm/test-fabric`	Verify fabric NICs and RDMA devices on workers
`slurm/submit-nccl-1node`	Submit single-node NCCL all-reduce test (NVIDIA)
`slurm/submit-nccl-2node`	Submit multi-node NCCL all-reduce test (NVIDIA)
`slurm/submit-rccl-1node`	Submit single-node RCCL all-reduce test (AMD)
`slurm/submit-rccl-2node`	Submit multi-node RCCL all-reduce test (AMD)
`slurm/submit-test`	Copy job scripts to NFS and submit basic test jobs
`slurm/run-validation`	Run the full validation suite
`slurm/test-restapi`	Test slurmrestd API endpoints
Observability
`obs/dashboard`	Deploy Slurm Grafana dashboard
`obs/grafana`	Port-forward Grafana to localhost:3000
`obs/prometheus`	Port-forward Prometheus to localhost:9090

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
docker		docker
helm		helm
jobs		jobs
manifests		manifests
quick-deploy		quick-deploy
scripts		scripts
terraform		terraform
.gitignore		.gitignore
MANUAL-INSTALL-GUIDE.md		MANUAL-INSTALL-GUIDE.md
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Slinky on DOKS — Multi-Node GPU Training

Architecture

Prerequisites

CLI Tools

DigitalOcean Account

Container Registry

Environment Variables

Configuration

Terraform Variables

GPU Vendor

Bring Your Own Cluster

Get your Cluster ID and VPC ID

Setup

Custom slurmd Image

NVIDIA (CUDA / NCCL)

AMD (ROCm / RCCL)

Custom Login Image

Quick Start

Quick Deploy (B300)

Step-by-Step Guide

1. Infrastructure

2. Kubeconfig

3. Prerequisites

4. Storage

5. RDMA Fabric

6. Slurm Operator

7. Slurm Cluster

8. GPU Discovery

9. Validation

Running GPU Collective Benchmarks

NVIDIA — NCCL Tests

Single-Node (8 GPUs, intra-node)

Multi-Node (16 GPUs, 2 nodes over RoCE)

AMD — RCCL Tests

Single-Node (8 GPUs, intra-node)

Multi-Node (16 GPUs, 2 nodes over RoCE)

Reading Output

Teardown

Make Targets Reference

Related Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages