Prometheus Slurm Exporter 🚀

Prometheus collector and exporter for metrics extracted from the Slurm workload manager — exposes node, partition, job, CPU, GPU, scheduler, fairshare, reservation, and license data, with ten ready-to-use Grafana dashboards and a starter set of site-neutral alerting rules.

Note

Looking for a next-generation Slurm exporter with native OpenMetrics support (Slurm 25.11+)? Check out my new project: sckyzo/slurm_prometheus_exporter

✨ Features: Native OpenMetrics · Multiple endpoints · Basic Auth & TLS · Global labels · YAML config · Clean Architecture

✨ Features

✅ Wide metric coverage: nodes, partitions, jobs, CPUs, GPUs, scheduler internals (sdiag RPC stats), fairshare, reservations, licenses, per-user/per-account roll-ups.
✅ All 14 collectors are optional and toggle via --collector.<name> / --no-collector.<name> flags.
✅ GPU metrics per account and user (slurm_account_gpus_running, slurm_user_gpus_running) — covers --gres, --gpus, and --gpus-per-node jobs.
✅ Per-reservation node state metrics (slurm_reservation_nodes_*).
✅ TLS + Basic Authentication via --web.config.file.
✅ OpenMetrics format (exemplars, Prometheus 2.x+ features).
✅ Per-collector health metrics (slurm_exporter_collector_success, slurm_exporter_collector_duration_seconds).
✅ Liveness probe at /healthz for Kubernetes / systemd orchestration.
✅ Ten ready-to-use Grafana dashboards + site-neutral Prometheus alerting rules.
✅ Multi-arch Docker images (linux/amd64 + linux/arm64), signed with cosign keyless, CycloneDX SBOM per release.
✅ Goreportcard A+ (100% across gofmt, go vet, gocyclo, ineffassign, misspell, license).

🚀 Get started

The fastest path is the published Docker image — assuming the host has a working slurm-client + munged setup (slurmctld host, login node, or a monitoring VM already enrolled in the cluster):

docker run -d --name slurm_exporter \
  -p 9341:9341 \
  -v /etc/slurm:/etc/slurm:ro \
  -v /var/run/munge:/var/run/munge:ro \
  -v /etc/munge/munge.key:/etc/munge/munge.key:ro \
  sckyzo/slurm-exporter:latest

curl http://localhost:9341/metrics | head

Then point Prometheus at :9341/metrics (sample scrape config in monitoring/).

For everything else — compose / Kubernetes / a remote monitoring node, or running the binary directly on a node — pick one of the three paths below.

🐳 Docker images

Two variants, both published as multi-arch manifests (linux/amd64 + linux/arm64) to Docker Hub (sckyzo/slurm-exporter) and GHCR (ghcr.io/sckyzo/slurm_exporter):

Variant	Tag pattern	Base	When
Standard	`:vX.Y.Z`, `:X.Y`, `:X`, `:latest`	Ubuntu 26.04 + slurm-client 25.11	Cluster runs Slurm 23.x — 26.x packaged from a distro. Just works.
Minimal	`:vX.Y.Z-minimal`, `:X.Y-minimal`, `:X-minimal`, `:latest-minimal`	distroless/cc-debian12 + libmunge	Slurm built from source / OHPC / outside the 23-26 window. Mount your own slurm-client via `--slurm.bin-path`.

Pre-release tags (vX.Y.Z-rc1 etc.) push only the pinned version and never overwrite the floating aliases.

Full Docker reference — compose, Kubernetes patterns, env-var overrides, version compatibility, troubleshooting — in docker/README.md.

📥 Pre-compiled binary

Linux, macOS, and Windows binaries (amd64 / 386 / arm64) on the Releases page. Each archive ships with a CycloneDX SBOM and a cosign-verifiable checksum file.

Installing as a systemd service:

# 1. Grab and install the binary
tar -xzf slurm_exporter-*-linux-amd64.tar.gz
sudo mv slurm_exporter /usr/local/bin/
sudo chmod +x /usr/local/bin/slurm_exporter

# 2. Install the unit file (adapt User / ExecStart for your environment)
sudo cp systemd/slurm_exporter.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now slurm_exporter

🔨 From source

git clone https://github.com/sckyzo/slurm_exporter.git
cd slurm_exporter
make build

The binary lands in bin/slurm_exporter. See CONTRIBUTING.md for the full development setup (Go 1.26+, golangci-lint, the containerized make check / make report targets).

⚙️ Configuration & development

Topic	Where
Flags, collectors, Prometheus scrape config	`docs/configuration.md`
All exported metrics, per-collector reference	`docs/metrics.md`
Example `/metrics` output	`docs/metrics-examples.md`
Build, test, lint, local test cluster	`docs/development.md`
Contribution rules + common pitfalls	`CONTRIBUTING.md`
Release process	`docs/release-process.md`
Project roadmap	`docs/roadmap.md`

Make targets

Category	Target	What it does
Build	`make build`	Compiles `bin/slurm_exporter` with version ldflags
	`make clean`	Removes build artefacts (`bin/`, `dist/`, module cache)
Test	`make test`	Runs the full unit-test suite
	`make race`	Tests with the race detector (containerised)
Quality	`make check`	`vet` + `lint` + `test`, all containerised
	`make report`	Offline equivalent of goreportcard.com; fails below grade B
	`make report-deps`	Tabular dependency status (current / available / patch-minor-major)
Docker	`make docker-build`	Builds the standard image locally as `slurm_exporter:dev`
	`make docker-build-minimal`	Builds the minimal (distroless) variant
	`make docker-build-all`	Both variants in one go
	`make docker-run`	Starts the compose stack (override `IMAGE=` for a local tag)
	`make docker-run-minimal`	Same for the minimal compose
	`make docker-stop`	`docker compose down` on both stacks
	`make docker-clean`	Removes the locally-built images
Other	`make run`	Runs the just-built binary
	`make tools-image`	(Re)builds the `slurm_exporter-tools` container used by check/report

make check, make report, and make report-deps run inside a container — contributors only need Docker, no host Go install required.

📊 Dashboards & alerts

All monitoring assets live under monitoring/:

monitoring/
├── grafana/dashboards/    10 Grafana dashboards (JSON) + screenshots
└── prometheus/
    ├── alerts.yml         Alerting rules (severity-based, site-neutral)
    └── rules.yml          Recording rules

End-to-end wiring (Prometheus scrape config, rule_files, Alertmanager) in monitoring/README.md.

Grafana dashboards

Ten dashboards, Grafana 12+, all using a $datasource template variable for portability.

#	Dashboard	UID	Description
01	Cluster Overview	`slurm-overview`	Global cluster health: CPU/GPU utilization, node states, job totals, partition summary
02	Jobs & Queue	`slurm-jobs`	Job queue details by user, account, partition — pending reasons, top users
03	Node Detail	`slurm-nodes`	Per-node CPU & memory table (filtered by partition), scalable to 100k+ nodes
04	Cluster Usage Statistics	`slurm-usage`	CPU/GPU utilization gauges, fairshare per account, top users by CPU
05	Scheduler	`slurm-scheduler`	slurmctld internals: cycle time, backfill, RPC statistics
06	Reservations & Licenses	`slurm-reservations`	Active reservations, node states per reservation, license usage
07	Accounting	`slurm-accounting`	User/account consumption, FairShare analysis, top consumers, priority diagnostics
08	Exporter Health	`slurm-health`	Collector OK/FAIL status, scrape duration history, Slurm binary versions
09	Exporter Performance	`slurm-exporter-perf`	Command durations, cache freshness, error rates, scrape health (new in v1.8.0)
10	All Metrics Reference	`slurm-all-metrics`	Exhaustive reference panel for every exported metric

Import via Grafana UI, provisioning, or API — see monitoring/grafana/dashboards/README.md for the three options.

Scale note: On 100k+ node clusters, always pick a specific partition on the Node Detail dashboard via the $partition variable. The partition summary and the Down/Drain panels are always O(partitions).

Prometheus alerts & recording rules

Starter set in monitoring/prometheus/alerts.yml (severity-based, site-neutral): node down/drain/maint, partition nodes down, pending-job queue backlog (warn/crit), job failure rate (warn/crit), slurmctld cycle slowness, SlurmDBD queue backlog, GPU saturation. One supporting recording rule in monitoring/prometheus/rules.yml.

Threshold table, calibration guidance, and validation recipes in monitoring/prometheus/README.md.

# Validate before deploying
promtool check rules monitoring/prometheus/alerts.yml monitoring/prometheus/rules.yml

Site-specific labels (team, runbook_url, dashboard_url) are intentionally omitted — add them via Prometheus external_labels or Alertmanager routing.

📸 Screenshots

Screenshots taken on a 20-node test cluster (alice/bob/carol/dave/eve/frank, multiple accounts and partitions). Click any thumbnail to open the full-size image. See monitoring/grafana/dashboards/README.md for the full dashboard documentation.

Cluster Overview	Jobs & Queue	Node Detail (scalable 100k+ nodes)
Cluster Usage Statistics	Scheduler	Exporter Health
Reservations & Licenses	Accounting	Exporter Performance
All 10 dashboards documented in `monitoring/grafana/dashboards/README.md`

🔐 Security & supply chain

Found a vulnerability? See SECURITY.md for how to report it privately.

Every published artifact carries verifiable provenance and is scanned for known vulnerabilities before release.

🖋️ Signed container images — every Docker manifest is signed via cosign keyless (Sigstore / Fulcio). The signing identity is the GitHub Actions workflow itself, attested by the runner's OIDC token. Verify with:

cosign verify sckyzo/slurm-exporter:latest \
  --certificate-identity-regexp 'https://github.com/SckyzO/slurm_exporter/.github/workflows/release.yml@.*' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com

🧾 Signed release checksums — slurm_exporter_checksums.txt ships with .pem (certificate) and .sig (signature) for offline verification of every release archive.
📦 CycloneDX SBOMs — one *.sbom.json per release archive lists every Go module compiled in (with versions and PURLs). Suitable for Dependency-Track, Anchore Enterprise, and similar.
🛡️ Vulnerability scanning — Trivy scans both Docker variants on every PR that touches Dockerfile*, go.mod, or go.sum. PRs are blocked on HIGH/CRITICAL CVEs that have an upstream fix. A weekly cron re-scans the published images so post-release CVEs surface as workflow failures.
👤 Non-root by default — the standard image runs as slurmexporter (uid 9341, gid munge); the minimal image runs as nonroot (uid 65532). Example compose drops all capabilities, mounts read-only, no-new-privileges.
🪞 Distroless variant — the :latest-minimal tag runs on gcr.io/distroless/cc-debian12:nonroot: no shell, no package manager, no userland beyond the dynamic loader and libstdc++. Smallest viable attack surface for a binary that has to dlopen libmunge at runtime.
🔁 Reproducible build chain — binaries built with pinned Go 1.26.3 in CI; Docker images from pinned ubuntu:26.04 / gcr.io/distroless/cc-debian12:nonroot / debian:13-slim (libmunge extractor). All version bumps go through Dependabot PRs.

Detailed verification recipes (cosign for blobs, SBOM inspection, image labels) in docker/README.md.

🤖 Automation

The repo runs a few autonomous workflows so dependencies and images stay fresh without manual babysitting:

Dependabot weekly — Monday 05:00 Europe/Paris, four ecosystems (Go modules, GitHub Actions, two Docker base images). Related deps grouped (golang.org/x/*, github.com/prometheus/*, etc.).
make report-deps — on-demand tabular snapshot of every Go module (direct + indirect) with patch/minor/major bump classification. Runs in the containerized toolchain, no host Go required.
Trivy weekly scan — Monday 06:00 UTC against the published images; CVE regressions show as workflow failures.
Docker Hub README sync — docker/README.md is mirrored to the Docker Hub repo description on every push to master (and on every release).
Auto Docker image refresh — every release tag triggers GoReleaser, builds both variants for both architectures, pushes to GHCR + Docker Hub, signs every manifest, and emits SBOMs.

🤝 Contributing

PRs and issues welcome. Before sending a contribution:

Read CONTRIBUTING.md — covers the Definition of Done, code conventions (initialisms, collector pattern, test fixtures), and the Common Pitfalls section (truncation gotcha on squeue -O field: / sinfo --Format, multi-arch path differences, etc.).
Run make check (containerized vet + lint + test) and make report (offline goreportcard, must stay ≥ B) before opening a PR.
One issue → one branch → one PR. Don't mix refactoring and new features in the same change.

The release process and the validation playbook live in docs/release-process.md and docs/validation-checklist.md.

📜 License

This project is licensed under the GNU General Public License, version 3 or later.

🍴 About this fork

Fork of cea-hpc/slurm_exporter, itself a fork of vpenso/prometheus-slurm-exporter (now apparently unmaintained).

Looking ahead: for Slurm 25.11+ deployments with native OpenMetrics support, see the next-generation project at sckyzo/slurm_prometheus_exporter.

Name		Name	Last commit message	Last commit date
Latest commit History 414 Commits
.github		.github
cmd/slurm_exporter		cmd/slurm_exporter
docker		docker
docs		docs
images		images
internal		internal
monitoring		monitoring
scripts		scripts
systemd		systemd
test_data		test_data
tmp		tmp
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.goreleaser.dev.yaml		.goreleaser.dev.yaml
.goreleaser.yaml		.goreleaser.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.minimal		Dockerfile.minimal
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prometheus Slurm Exporter 🚀

📋 Table of Contents

✨ Features

🚀 Get started

🐳 Docker images

📥 Pre-compiled binary

🔨 From source

⚙️ Configuration & development

Make targets

📊 Dashboards & alerts

Grafana dashboards

Prometheus alerts & recording rules

📸 Screenshots

🔐 Security & supply chain

🤖 Automation

🤝 Contributing

📜 License

🍴 About this fork

About

Uh oh!

Releases 18

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Prometheus Slurm Exporter 🚀

📋 Table of Contents

✨ Features

🚀 Get started

🐳 Docker images

📥 Pre-compiled binary

🔨 From source

⚙️ Configuration & development

Make targets

📊 Dashboards & alerts

Grafana dashboards

Prometheus alerts & recording rules

📸 Screenshots

🔐 Security & supply chain

🤖 Automation

🤝 Contributing

📜 License

🍴 About this fork

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 18

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages