Skip to content

feat(engine/slinky): discover partitions via the controller pod#362

Open
giuliocalzo wants to merge 1 commit into
NVIDIA:mainfrom
giuliocalzo:feat/slinky-controller-partition-discovery
Open

feat(engine/slinky): discover partitions via the controller pod#362
giuliocalzo wants to merge 1 commit into
NVIDIA:mainfrom
giuliocalzo:feat/slinky-controller-partition-discovery

Conversation

@giuliocalzo

Copy link
Copy Markdown
Collaborator

Description

A Slinky cluster always runs a controller (slurmctld), but login pods are optional. The Slinky engine's legacy partition discovery (getPartitionNodes) execed scontrol show partition only in a pod labeled app.kubernetes.io/component: login, so discovery failed on clusters installed without login pods.

This changes discovery to prefer the controller pod and fall back to a login pod when one is present:

  • pkg/engines/slinky/engine.go — iterate ["controller", "login"] component labels; exec in the first running pod found. Component label key/values extracted into named constants. The not-found error now names both components and the namespace.
  • pkg/engines/slinky/engine_test.go — new TestGetPartitionNodes covers the dynamic-nodes short-circuit, both parameter-validation errors, and the no-running-pods path (the ExecInPod success path isn't exercisable with a fake clientset).
  • charts/topograph/templates/rbac.yamlpods/exec rationale comment updated to reflect controller-first discovery.
  • docs/engines/slinky.md — partition-discovery fallback row updated; the controller is always present, login pods are optional.

This is backward compatible: clusters that only have login pods (and no controller match) still work via the fallback.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • All commits are signed off per DCO (git commit -s).

A Slinky cluster always runs a controller (slurmctld) but login pods are
optional. getPartitionNodes previously execed `scontrol show partition` only
in a login pod, so partition discovery failed on clusters installed without
login pods. Prefer the controller pod and fall back to a login pod when one
is present. Update the chart RBAC rationale and the Slinky engine docs to
match.

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
@greptile-apps

greptile-apps Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes partition discovery in the Slinky engine by preferring the always-present controller pod (slurmctld) over the optional login pod when running scontrol show partition. The change is backward compatible — clusters with only login pods continue to work via fallback.

  • engine.go: getPartitionNodes now iterates [\"controller\", \"login\"] component labels in order, using the first running pod found; component label key/values are extracted into named constants; the not-found error now names both components and the namespace.
  • engine_test.go: New TestGetPartitionNodes covers the dynamic-nodes early return, both parameter-validation errors, and the no-running-pods path.
  • rbac.yaml / slinky.md: Comment and documentation updated to reflect controller-first discovery.

Confidence Score: 4/5

Safe to merge; the controller-first fallback is backward compatible and the logic is straightforward.

The implementation correctly falls back to login pods when no running controller pod is found. One edge case worth watching: if the Kubernetes API returns an error when listing controller pods (rather than an empty list), the login-pod fallback is skipped entirely. This mirrors the original single-component behavior but now applies to a newly exercised code path (controller listing) that didn't exist before.

pkg/engines/slinky/engine.go — specifically the error-return inside the component loop at the GetPodsByLabels call.

Important Files Changed

Filename Overview
pkg/engines/slinky/engine.go Adds controller-first pod discovery in getPartitionNodes; constants extracted for clarity; error message improved. API error on controller listing short-circuits the login fallback.
pkg/engines/slinky/engine_test.go New TestGetPartitionNodes covers dynamic-nodes short-circuit, parameter validation errors, and the no-running-pods path. ExecInPod success path is explicitly noted as not testable with a fake clientset.
charts/topograph/templates/rbac.yaml Comment-only update to pods/exec rationale reflecting controller-first discovery; no functional change.
docs/engines/slinky.md Partition-discovery fallback row updated to describe controller-first with login fallback; accurate and consistent with the implementation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[getPartitionNodes called] --> B{UseDynamicNodes?}
    B -- yes --> C[return dynamicShowPartitionNodes]
    B -- no --> D{params valid?}
    D -- no --> E[return validation error]
    D -- yes --> F[component = controller]
    F --> G[GetPodsByLabels: component=controller]
    G -- error --> H[return error]
    G -- ok --> I{Running pod found?}
    I -- yes --> J[ExecInPod scontrol show partition]
    J -- error --> K[return error]
    J -- ok --> L[return output]
    I -- no --> M[component = login]
    M --> N[GetPodsByLabels: component=login]
    N -- error --> O[return error]
    N -- ok --> P{Running pod found?}
    P -- yes --> Q[ExecInPod scontrol show partition]
    Q -- error --> R[return error]
    Q -- ok --> S[return output]
    P -- no --> T[return: no running controller or login pods]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[getPartitionNodes called] --> B{UseDynamicNodes?}
    B -- yes --> C[return dynamicShowPartitionNodes]
    B -- no --> D{params valid?}
    D -- no --> E[return validation error]
    D -- yes --> F[component = controller]
    F --> G[GetPodsByLabels: component=controller]
    G -- error --> H[return error]
    G -- ok --> I{Running pod found?}
    I -- yes --> J[ExecInPod scontrol show partition]
    J -- error --> K[return error]
    J -- ok --> L[return output]
    I -- no --> M[component = login]
    M --> N[GetPodsByLabels: component=login]
    N -- error --> O[return error]
    N -- ok --> P{Running pod found?}
    P -- yes --> Q[ExecInPod scontrol show partition]
    Q -- error --> R[return error]
    Q -- ok --> S[return output]
    P -- no --> T[return: no running controller or login pods]
Loading

Reviews (1): Last reviewed commit: "feat(engine/slinky): discover partitions..." | Re-trigger Greptile

Comment on lines 459 to 461
if err != nil {
return "", err
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 API error on controller listing aborts the login fallback

If GetPodsByLabels returns a non-nil error for the controller component (e.g., an RBAC misconfiguration that allows listing login pods but not controller pods), the function returns immediately and never attempts the login fallback. The original code had symmetric behavior for login pods, so this isn't a new class of failure, but the addition of the controller leg means a new RBAC permission (list pods scoped to controller) is now exercised first. A cluster that previously worked with only login-pod discovery will silently break if the service account lacks permission to list controller pods, even though a login pod is available.

@github-actions

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant