feat(engine/slinky): discover partitions via the controller pod#362
feat(engine/slinky): discover partitions via the controller pod#362giuliocalzo wants to merge 1 commit into
Conversation
A Slinky cluster always runs a controller (slurmctld) but login pods are optional. getPartitionNodes previously execed `scontrol show partition` only in a login pod, so partition discovery failed on clusters installed without login pods. Prefer the controller pod and fall back to a login pod when one is present. Update the chart RBAC rationale and the Slinky engine docs to match. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
| if err != nil { | ||
| return "", err | ||
| } |
There was a problem hiding this comment.
API error on controller listing aborts the login fallback
If GetPodsByLabels returns a non-nil error for the controller component (e.g., an RBAC misconfiguration that allows listing login pods but not controller pods), the function returns immediately and never attempts the login fallback. The original code had symmetric behavior for login pods, so this isn't a new class of failure, but the addition of the controller leg means a new RBAC permission (list pods scoped to controller) is now exercised first. A cluster that previously worked with only login-pod discovery will silently break if the service account lacks permission to list controller pods, even though a login pod is available.
|
🌿 Preview your docs: https://nvidia-preview-pull-request-362.docs.buildwithfern.com/topograph |
Description
A Slinky cluster always runs a controller (
slurmctld), but login pods are optional. The Slinky engine's legacy partition discovery (getPartitionNodes) execedscontrol show partitiononly in a pod labeledapp.kubernetes.io/component: login, so discovery failed on clusters installed without login pods.This changes discovery to prefer the controller pod and fall back to a login pod when one is present:
pkg/engines/slinky/engine.go— iterate["controller", "login"]component labels; exec in the first running pod found. Component label key/values extracted into named constants. The not-found error now names both components and the namespace.pkg/engines/slinky/engine_test.go— newTestGetPartitionNodescovers the dynamic-nodes short-circuit, both parameter-validation errors, and the no-running-pods path (theExecInPodsuccess path isn't exercisable with a fake clientset).charts/topograph/templates/rbac.yaml—pods/execrationale comment updated to reflect controller-first discovery.docs/engines/slinky.md— partition-discovery fallback row updated; the controller is always present, login pods are optional.This is backward compatible: clusters that only have login pods (and no controller match) still work via the fallback.
Checklist
git commit -s).