Prometheus collector and exporter for metrics extracted from the Slurm workload manager — exposes node, partition, job, CPU, GPU, scheduler, fairshare, reservation, and license data, with ten ready-to-use Grafana dashboards and a starter set of site-neutral alerting rules.
Note
Looking for a next-generation Slurm exporter with native OpenMetrics support (Slurm 25.11+)? Check out my new project: sckyzo/slurm_prometheus_exporter
✨ Features: Native OpenMetrics · Multiple endpoints · Basic Auth & TLS · Global labels · YAML config · Clean Architecture
- ✨ Features
- 🚀 Get started
- ⚙️ Configuration & development
- 📊 Dashboards & alerts
- 📸 Screenshots
- 🔐 Security & supply chain
- 🤖 Automation
- 🤝 Contributing
- 📜 License
- ✅ Wide metric coverage: nodes, partitions, jobs, CPUs, GPUs, scheduler internals (
sdiagRPC stats), fairshare, reservations, licenses, per-user/per-account roll-ups. - ✅ All 14 collectors are optional and toggle via
--collector.<name>/--no-collector.<name>flags. - ✅ GPU metrics per account and user (
slurm_account_gpus_running,slurm_user_gpus_running) — covers--gres,--gpus, and--gpus-per-nodejobs. - ✅ Per-reservation node state metrics (
slurm_reservation_nodes_*). - ✅ TLS + Basic Authentication via
--web.config.file. - ✅ OpenMetrics format (exemplars, Prometheus 2.x+ features).
- ✅ Per-collector health metrics (
slurm_exporter_collector_success,slurm_exporter_collector_duration_seconds). - ✅ Liveness probe at
/healthzfor Kubernetes / systemd orchestration. - ✅ Ten ready-to-use Grafana dashboards + site-neutral Prometheus alerting rules.
- ✅ Multi-arch Docker images (linux/amd64 + linux/arm64), signed with cosign keyless, CycloneDX SBOM per release.
- ✅ Goreportcard A+ (100% across gofmt, go vet, gocyclo, ineffassign, misspell, license).
The fastest path is the published Docker image — assuming the host has a working slurm-client + munged setup (slurmctld host, login node, or a monitoring VM already enrolled in the cluster):
docker run -d --name slurm_exporter \
-p 9341:9341 \
-v /etc/slurm:/etc/slurm:ro \
-v /var/run/munge:/var/run/munge:ro \
-v /etc/munge/munge.key:/etc/munge/munge.key:ro \
sckyzo/slurm-exporter:latest
curl http://localhost:9341/metrics | headThen point Prometheus at :9341/metrics (sample scrape config in
monitoring/).
For everything else — compose / Kubernetes / a remote monitoring node, or running the binary directly on a node — pick one of the three paths below.
Two variants, both published as multi-arch manifests (linux/amd64 +
linux/arm64) to Docker Hub (sckyzo/slurm-exporter) and GHCR
(ghcr.io/sckyzo/slurm_exporter):
| Variant | Tag pattern | Base | When |
|---|---|---|---|
| Standard | :vX.Y.Z, :X.Y, :X, :latest |
Ubuntu 26.04 + slurm-client 25.11 | Cluster runs Slurm 23.x — 26.x packaged from a distro. Just works. |
| Minimal | :vX.Y.Z-minimal, :X.Y-minimal, :X-minimal, :latest-minimal |
distroless/cc-debian12 + libmunge | Slurm built from source / OHPC / outside the 23-26 window. Mount your own slurm-client via --slurm.bin-path. |
Pre-release tags (vX.Y.Z-rc1 etc.) push only the pinned version and
never overwrite the floating aliases.
Full Docker reference — compose, Kubernetes patterns, env-var overrides,
version compatibility, troubleshooting — in
docker/README.md.
Linux, macOS, and Windows binaries (amd64 / 386 / arm64) on the Releases page. Each archive ships with a CycloneDX SBOM and a cosign-verifiable checksum file.
Installing as a systemd service:
# 1. Grab and install the binary
tar -xzf slurm_exporter-*-linux-amd64.tar.gz
sudo mv slurm_exporter /usr/local/bin/
sudo chmod +x /usr/local/bin/slurm_exporter
# 2. Install the unit file (adapt User / ExecStart for your environment)
sudo cp systemd/slurm_exporter.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now slurm_exportergit clone https://github.com/sckyzo/slurm_exporter.git
cd slurm_exporter
make buildThe binary lands in bin/slurm_exporter. See CONTRIBUTING.md
for the full development setup (Go 1.26+, golangci-lint, the
containerized make check / make report targets).
| Topic | Where |
|---|---|
| Flags, collectors, Prometheus scrape config | docs/configuration.md |
| All exported metrics, per-collector reference | docs/metrics.md |
Example /metrics output |
docs/metrics-examples.md |
| Build, test, lint, local test cluster | docs/development.md |
| Contribution rules + common pitfalls | CONTRIBUTING.md |
| Release process | docs/release-process.md |
| Project roadmap | docs/roadmap.md |
| Category | Target | What it does |
|---|---|---|
| Build | make build |
Compiles bin/slurm_exporter with version ldflags |
make clean |
Removes build artefacts (bin/, dist/, module cache) |
|
| Test | make test |
Runs the full unit-test suite |
make race |
Tests with the race detector (containerised) | |
| Quality | make check |
vet + lint + test, all containerised |
make report |
Offline equivalent of goreportcard.com; fails below grade B | |
make report-deps |
Tabular dependency status (current / available / patch-minor-major) | |
| Docker | make docker-build |
Builds the standard image locally as slurm_exporter:dev |
make docker-build-minimal |
Builds the minimal (distroless) variant | |
make docker-build-all |
Both variants in one go | |
make docker-run |
Starts the compose stack (override IMAGE= for a local tag) |
|
make docker-run-minimal |
Same for the minimal compose | |
make docker-stop |
docker compose down on both stacks |
|
make docker-clean |
Removes the locally-built images | |
| Other | make run |
Runs the just-built binary |
make tools-image |
(Re)builds the slurm_exporter-tools container used by check/report |
make check, make report, and make report-deps run inside a container — contributors only need Docker, no host Go install required.
All monitoring assets live under monitoring/:
monitoring/
├── grafana/dashboards/ 10 Grafana dashboards (JSON) + screenshots
└── prometheus/
├── alerts.yml Alerting rules (severity-based, site-neutral)
└── rules.yml Recording rules
End-to-end wiring (Prometheus scrape config, rule_files, Alertmanager) in monitoring/README.md.
Ten dashboards, Grafana 12+, all using a $datasource template variable for portability.
| # | Dashboard | UID | Description |
|---|---|---|---|
| 01 | Cluster Overview | slurm-overview |
Global cluster health: CPU/GPU utilization, node states, job totals, partition summary |
| 02 | Jobs & Queue | slurm-jobs |
Job queue details by user, account, partition — pending reasons, top users |
| 03 | Node Detail | slurm-nodes |
Per-node CPU & memory table (filtered by partition), scalable to 100k+ nodes |
| 04 | Cluster Usage Statistics | slurm-usage |
CPU/GPU utilization gauges, fairshare per account, top users by CPU |
| 05 | Scheduler | slurm-scheduler |
slurmctld internals: cycle time, backfill, RPC statistics |
| 06 | Reservations & Licenses | slurm-reservations |
Active reservations, node states per reservation, license usage |
| 07 | Accounting | slurm-accounting |
User/account consumption, FairShare analysis, top consumers, priority diagnostics |
| 08 | Exporter Health | slurm-health |
Collector OK/FAIL status, scrape duration history, Slurm binary versions |
| 09 | Exporter Performance | slurm-exporter-perf |
Command durations, cache freshness, error rates, scrape health (new in v1.8.0) |
| 10 | All Metrics Reference | slurm-all-metrics |
Exhaustive reference panel for every exported metric |
Import via Grafana UI, provisioning, or API — see monitoring/grafana/dashboards/README.md for the three options.
Scale note: On 100k+ node clusters, always pick a specific partition on the Node Detail dashboard via the
$partitionvariable. The partition summary and the Down/Drain panels are always O(partitions).
Starter set in monitoring/prometheus/alerts.yml (severity-based, site-neutral): node down/drain/maint, partition nodes down, pending-job queue backlog (warn/crit), job failure rate (warn/crit), slurmctld cycle slowness, SlurmDBD queue backlog, GPU saturation. One supporting recording rule in monitoring/prometheus/rules.yml.
Threshold table, calibration guidance, and validation recipes in monitoring/prometheus/README.md.
# Validate before deploying
promtool check rules monitoring/prometheus/alerts.yml monitoring/prometheus/rules.ymlSite-specific labels (team, runbook_url, dashboard_url) are intentionally omitted — add them via Prometheus external_labels or Alertmanager routing.
Screenshots taken on a 20-node test cluster (alice/bob/carol/dave/eve/frank, multiple accounts and partitions). Click any thumbnail to open the full-size image. See
monitoring/grafana/dashboards/README.mdfor the full dashboard documentation.
|
All 10 dashboards documented in |
||
Found a vulnerability? See SECURITY.md for how to report it privately.
Every published artifact carries verifiable provenance and is scanned for known vulnerabilities before release.
- 🖋️ Signed container images — every Docker manifest is signed via cosign keyless (Sigstore / Fulcio). The signing identity is the GitHub Actions workflow itself, attested by the runner's OIDC token. Verify with:
cosign verify sckyzo/slurm-exporter:latest \ --certificate-identity-regexp 'https://github.com/SckyzO/slurm_exporter/.github/workflows/release.yml@.*' \ --certificate-oidc-issuer https://token.actions.githubusercontent.com - 🧾 Signed release checksums —
slurm_exporter_checksums.txtships with.pem(certificate) and.sig(signature) for offline verification of every release archive. - 📦 CycloneDX SBOMs — one
*.sbom.jsonper release archive lists every Go module compiled in (with versions and PURLs). Suitable for Dependency-Track, Anchore Enterprise, and similar. - 🛡️ Vulnerability scanning — Trivy scans both Docker variants on every PR that touches
Dockerfile*,go.mod, orgo.sum. PRs are blocked on HIGH/CRITICAL CVEs that have an upstream fix. A weekly cron re-scans the published images so post-release CVEs surface as workflow failures. - 👤 Non-root by default — the standard image runs as
slurmexporter(uid 9341, gidmunge); the minimal image runs asnonroot(uid 65532). Example compose drops all capabilities, mounts read-only,no-new-privileges. - 🪞 Distroless variant — the
:latest-minimaltag runs ongcr.io/distroless/cc-debian12:nonroot: no shell, no package manager, no userland beyond the dynamic loader and libstdc++. Smallest viable attack surface for a binary that has todlopenlibmunge at runtime. - 🔁 Reproducible build chain — binaries built with pinned Go 1.26.3 in CI; Docker images from pinned
ubuntu:26.04/gcr.io/distroless/cc-debian12:nonroot/debian:13-slim(libmunge extractor). All version bumps go through Dependabot PRs.
Detailed verification recipes (cosign for blobs, SBOM inspection, image labels) in docker/README.md.
The repo runs a few autonomous workflows so dependencies and images stay fresh without manual babysitting:
- Dependabot weekly — Monday 05:00 Europe/Paris, four ecosystems (Go modules, GitHub Actions, two Docker base images). Related deps grouped (
golang.org/x/*,github.com/prometheus/*, etc.). make report-deps— on-demand tabular snapshot of every Go module (direct + indirect) with patch/minor/major bump classification. Runs in the containerized toolchain, no host Go required.- Trivy weekly scan — Monday 06:00 UTC against the published images; CVE regressions show as workflow failures.
- Docker Hub README sync —
docker/README.mdis mirrored to the Docker Hub repo description on every push to master (and on every release). - Auto Docker image refresh — every release tag triggers GoReleaser, builds both variants for both architectures, pushes to GHCR + Docker Hub, signs every manifest, and emits SBOMs.
PRs and issues welcome. Before sending a contribution:
- Read
CONTRIBUTING.md— covers the Definition of Done, code conventions (initialisms, collector pattern, test fixtures), and the Common Pitfalls section (truncation gotcha onsqueue -O field:/sinfo --Format, multi-arch path differences, etc.). - Run
make check(containerized vet + lint + test) andmake report(offline goreportcard, must stay ≥ B) before opening a PR. - One issue → one branch → one PR. Don't mix refactoring and new features in the same change.
The release process and the validation playbook live in docs/release-process.md and docs/validation-checklist.md.
This project is licensed under the GNU General Public License, version 3 or later.
Fork of cea-hpc/slurm_exporter, itself a fork of vpenso/prometheus-slurm-exporter (now apparently unmaintained).
Looking ahead: for Slurm 25.11+ deployments with native OpenMetrics support, see the next-generation project at sckyzo/slurm_prometheus_exporter.
