Skip to content

SckyzO/slurm_exporter

Repository files navigation

Prometheus Slurm Exporter 🚀

Release Latest release Go Report Card Docker pulls Image size (standard) Image size (minimal) License: GPL v3

Prometheus collector and exporter for metrics extracted from the Slurm workload manager — exposes node, partition, job, CPU, GPU, scheduler, fairshare, reservation, and license data, with ten ready-to-use Grafana dashboards and a starter set of site-neutral alerting rules.

Note

Looking for a next-generation Slurm exporter with native OpenMetrics support (Slurm 25.11+)? Check out my new project: sckyzo/slurm_prometheus_exporter

✨ Features: Native OpenMetrics · Multiple endpoints · Basic Auth & TLS · Global labels · YAML config · Clean Architecture

📋 Table of Contents

✨ Features

  • ✅ Wide metric coverage: nodes, partitions, jobs, CPUs, GPUs, scheduler internals (sdiag RPC stats), fairshare, reservations, licenses, per-user/per-account roll-ups.
  • ✅ All 14 collectors are optional and toggle via --collector.<name> / --no-collector.<name> flags.
  • ✅ GPU metrics per account and user (slurm_account_gpus_running, slurm_user_gpus_running) — covers --gres, --gpus, and --gpus-per-node jobs.
  • ✅ Per-reservation node state metrics (slurm_reservation_nodes_*).
  • ✅ TLS + Basic Authentication via --web.config.file.
  • ✅ OpenMetrics format (exemplars, Prometheus 2.x+ features).
  • ✅ Per-collector health metrics (slurm_exporter_collector_success, slurm_exporter_collector_duration_seconds).
  • ✅ Liveness probe at /healthz for Kubernetes / systemd orchestration.
  • ✅ Ten ready-to-use Grafana dashboards + site-neutral Prometheus alerting rules.
  • ✅ Multi-arch Docker images (linux/amd64 + linux/arm64), signed with cosign keyless, CycloneDX SBOM per release.
  • ✅ Goreportcard A+ (100% across gofmt, go vet, gocyclo, ineffassign, misspell, license).

🚀 Get started

The fastest path is the published Docker image — assuming the host has a working slurm-client + munged setup (slurmctld host, login node, or a monitoring VM already enrolled in the cluster):

docker run -d --name slurm_exporter \
  -p 9341:9341 \
  -v /etc/slurm:/etc/slurm:ro \
  -v /var/run/munge:/var/run/munge:ro \
  -v /etc/munge/munge.key:/etc/munge/munge.key:ro \
  sckyzo/slurm-exporter:latest

curl http://localhost:9341/metrics | head

Then point Prometheus at :9341/metrics (sample scrape config in monitoring/).

For everything else — compose / Kubernetes / a remote monitoring node, or running the binary directly on a node — pick one of the three paths below.

🐳 Docker images

Two variants, both published as multi-arch manifests (linux/amd64 + linux/arm64) to Docker Hub (sckyzo/slurm-exporter) and GHCR (ghcr.io/sckyzo/slurm_exporter):

Variant Tag pattern Base When
Standard :vX.Y.Z, :X.Y, :X, :latest Ubuntu 26.04 + slurm-client 25.11 Cluster runs Slurm 23.x — 26.x packaged from a distro. Just works.
Minimal :vX.Y.Z-minimal, :X.Y-minimal, :X-minimal, :latest-minimal distroless/cc-debian12 + libmunge Slurm built from source / OHPC / outside the 23-26 window. Mount your own slurm-client via --slurm.bin-path.

Pre-release tags (vX.Y.Z-rc1 etc.) push only the pinned version and never overwrite the floating aliases.

Full Docker reference — compose, Kubernetes patterns, env-var overrides, version compatibility, troubleshooting — in docker/README.md.

📥 Pre-compiled binary

Linux, macOS, and Windows binaries (amd64 / 386 / arm64) on the Releases page. Each archive ships with a CycloneDX SBOM and a cosign-verifiable checksum file.

Installing as a systemd service:

# 1. Grab and install the binary
tar -xzf slurm_exporter-*-linux-amd64.tar.gz
sudo mv slurm_exporter /usr/local/bin/
sudo chmod +x /usr/local/bin/slurm_exporter

# 2. Install the unit file (adapt User / ExecStart for your environment)
sudo cp systemd/slurm_exporter.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now slurm_exporter

🔨 From source

git clone https://github.com/sckyzo/slurm_exporter.git
cd slurm_exporter
make build

The binary lands in bin/slurm_exporter. See CONTRIBUTING.md for the full development setup (Go 1.26+, golangci-lint, the containerized make check / make report targets).


⚙️ Configuration & development

Topic Where
Flags, collectors, Prometheus scrape config docs/configuration.md
All exported metrics, per-collector reference docs/metrics.md
Example /metrics output docs/metrics-examples.md
Build, test, lint, local test cluster docs/development.md
Contribution rules + common pitfalls CONTRIBUTING.md
Release process docs/release-process.md
Project roadmap docs/roadmap.md

Make targets

Category Target What it does
Build make build Compiles bin/slurm_exporter with version ldflags
make clean Removes build artefacts (bin/, dist/, module cache)
Test make test Runs the full unit-test suite
make race Tests with the race detector (containerised)
Quality make check vet + lint + test, all containerised
make report Offline equivalent of goreportcard.com; fails below grade B
make report-deps Tabular dependency status (current / available / patch-minor-major)
Docker make docker-build Builds the standard image locally as slurm_exporter:dev
make docker-build-minimal Builds the minimal (distroless) variant
make docker-build-all Both variants in one go
make docker-run Starts the compose stack (override IMAGE= for a local tag)
make docker-run-minimal Same for the minimal compose
make docker-stop docker compose down on both stacks
make docker-clean Removes the locally-built images
Other make run Runs the just-built binary
make tools-image (Re)builds the slurm_exporter-tools container used by check/report

make check, make report, and make report-deps run inside a container — contributors only need Docker, no host Go install required.


📊 Dashboards & alerts

All monitoring assets live under monitoring/:

monitoring/
├── grafana/dashboards/    10 Grafana dashboards (JSON) + screenshots
└── prometheus/
    ├── alerts.yml         Alerting rules (severity-based, site-neutral)
    └── rules.yml          Recording rules

End-to-end wiring (Prometheus scrape config, rule_files, Alertmanager) in monitoring/README.md.

Grafana dashboards

Ten dashboards, Grafana 12+, all using a $datasource template variable for portability.

# Dashboard UID Description
01 Cluster Overview slurm-overview Global cluster health: CPU/GPU utilization, node states, job totals, partition summary
02 Jobs & Queue slurm-jobs Job queue details by user, account, partition — pending reasons, top users
03 Node Detail slurm-nodes Per-node CPU & memory table (filtered by partition), scalable to 100k+ nodes
04 Cluster Usage Statistics slurm-usage CPU/GPU utilization gauges, fairshare per account, top users by CPU
05 Scheduler slurm-scheduler slurmctld internals: cycle time, backfill, RPC statistics
06 Reservations & Licenses slurm-reservations Active reservations, node states per reservation, license usage
07 Accounting slurm-accounting User/account consumption, FairShare analysis, top consumers, priority diagnostics
08 Exporter Health slurm-health Collector OK/FAIL status, scrape duration history, Slurm binary versions
09 Exporter Performance slurm-exporter-perf Command durations, cache freshness, error rates, scrape health (new in v1.8.0)
10 All Metrics Reference slurm-all-metrics Exhaustive reference panel for every exported metric

Import via Grafana UI, provisioning, or API — see monitoring/grafana/dashboards/README.md for the three options.

Scale note: On 100k+ node clusters, always pick a specific partition on the Node Detail dashboard via the $partition variable. The partition summary and the Down/Drain panels are always O(partitions).

Prometheus alerts & recording rules

Starter set in monitoring/prometheus/alerts.yml (severity-based, site-neutral): node down/drain/maint, partition nodes down, pending-job queue backlog (warn/crit), job failure rate (warn/crit), slurmctld cycle slowness, SlurmDBD queue backlog, GPU saturation. One supporting recording rule in monitoring/prometheus/rules.yml.

Threshold table, calibration guidance, and validation recipes in monitoring/prometheus/README.md.

# Validate before deploying
promtool check rules monitoring/prometheus/alerts.yml monitoring/prometheus/rules.yml

Site-specific labels (team, runbook_url, dashboard_url) are intentionally omitted — add them via Prometheus external_labels or Alertmanager routing.


📸 Screenshots

Screenshots taken on a 20-node test cluster (alice/bob/carol/dave/eve/frank, multiple accounts and partitions). Click any thumbnail to open the full-size image. See monitoring/grafana/dashboards/README.md for the full dashboard documentation.

Cluster Overview
Cluster Overview

Jobs & Queue
Jobs & Queue

Node Detail (scalable 100k+ nodes)
Node Detail

Cluster Usage Statistics
Cluster Usage Statistics

Scheduler
Scheduler

Exporter Health
Exporter Health

Reservations & Licenses
Reservations & Licenses

Accounting
Accounting

Exporter Performance
Exporter Performance

All 10 dashboards documented in monitoring/grafana/dashboards/README.md


🔐 Security & supply chain

Found a vulnerability? See SECURITY.md for how to report it privately.

Every published artifact carries verifiable provenance and is scanned for known vulnerabilities before release.

  • 🖋️ Signed container images — every Docker manifest is signed via cosign keyless (Sigstore / Fulcio). The signing identity is the GitHub Actions workflow itself, attested by the runner's OIDC token. Verify with:
    cosign verify sckyzo/slurm-exporter:latest \
      --certificate-identity-regexp 'https://github.com/SckyzO/slurm_exporter/.github/workflows/release.yml@.*' \
      --certificate-oidc-issuer https://token.actions.githubusercontent.com
  • 🧾 Signed release checksumsslurm_exporter_checksums.txt ships with .pem (certificate) and .sig (signature) for offline verification of every release archive.
  • 📦 CycloneDX SBOMs — one *.sbom.json per release archive lists every Go module compiled in (with versions and PURLs). Suitable for Dependency-Track, Anchore Enterprise, and similar.
  • 🛡️ Vulnerability scanning — Trivy scans both Docker variants on every PR that touches Dockerfile*, go.mod, or go.sum. PRs are blocked on HIGH/CRITICAL CVEs that have an upstream fix. A weekly cron re-scans the published images so post-release CVEs surface as workflow failures.
  • 👤 Non-root by default — the standard image runs as slurmexporter (uid 9341, gid munge); the minimal image runs as nonroot (uid 65532). Example compose drops all capabilities, mounts read-only, no-new-privileges.
  • 🪞 Distroless variant — the :latest-minimal tag runs on gcr.io/distroless/cc-debian12:nonroot: no shell, no package manager, no userland beyond the dynamic loader and libstdc++. Smallest viable attack surface for a binary that has to dlopen libmunge at runtime.
  • 🔁 Reproducible build chain — binaries built with pinned Go 1.26.3 in CI; Docker images from pinned ubuntu:26.04 / gcr.io/distroless/cc-debian12:nonroot / debian:13-slim (libmunge extractor). All version bumps go through Dependabot PRs.

Detailed verification recipes (cosign for blobs, SBOM inspection, image labels) in docker/README.md.


🤖 Automation

The repo runs a few autonomous workflows so dependencies and images stay fresh without manual babysitting:

  • Dependabot weekly — Monday 05:00 Europe/Paris, four ecosystems (Go modules, GitHub Actions, two Docker base images). Related deps grouped (golang.org/x/*, github.com/prometheus/*, etc.).
  • make report-deps — on-demand tabular snapshot of every Go module (direct + indirect) with patch/minor/major bump classification. Runs in the containerized toolchain, no host Go required.
  • Trivy weekly scan — Monday 06:00 UTC against the published images; CVE regressions show as workflow failures.
  • Docker Hub README syncdocker/README.md is mirrored to the Docker Hub repo description on every push to master (and on every release).
  • Auto Docker image refresh — every release tag triggers GoReleaser, builds both variants for both architectures, pushes to GHCR + Docker Hub, signs every manifest, and emits SBOMs.

🤝 Contributing

PRs and issues welcome. Before sending a contribution:

  • Read CONTRIBUTING.md — covers the Definition of Done, code conventions (initialisms, collector pattern, test fixtures), and the Common Pitfalls section (truncation gotcha on squeue -O field: / sinfo --Format, multi-arch path differences, etc.).
  • Run make check (containerized vet + lint + test) and make report (offline goreportcard, must stay ≥ B) before opening a PR.
  • One issue → one branch → one PR. Don't mix refactoring and new features in the same change.

The release process and the validation playbook live in docs/release-process.md and docs/validation-checklist.md.


📜 License

This project is licensed under the GNU General Public License, version 3 or later.

Buy Me a Coffee


🍴 About this fork

Fork of cea-hpc/slurm_exporter, itself a fork of vpenso/prometheus-slurm-exporter (now apparently unmaintained).

Looking ahead: for Slurm 25.11+ deployments with native OpenMetrics support, see the next-generation project at sckyzo/slurm_prometheus_exporter.

About

Slurm Exporter is a Prometheus exporter designed to scrape and expose a comprehensive range of performance and scheduling metrics from Slurm-managed clusters. It supports both CPU and GPU resource accounting, node and partition state monitoring, job tracking, and scheduler statistics.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

Packages

 
 
 

Contributors