Skip to content

feat(node-observer): wait for topograph health in-process#363

Closed
giuliocalzo wants to merge 1 commit into
NVIDIA:mainfrom
giuliocalzo:feat/node-observer-wait-healthz
Closed

feat(node-observer): wait for topograph health in-process#363
giuliocalzo wants to merge 1 commit into
NVIDIA:mainfrom
giuliocalzo:feat/node-observer-wait-healthz

Conversation

@giuliocalzo

@giuliocalzo giuliocalzo commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Description

Replace the node-observer chart's curl-based wait init container with an in-process health wait in the node-observer binary, removing the dependency on the curlimages/curl image for this subchart.

  • node-observer waits for the topograph API to become healthy before starting its watch loop. The controller polls /healthz (derived from generateTopologyUrl) every 2s and gives up after 1m, returning an error so the pod restarts if topograph never comes up. Timeout errors report the actual elapsed time rather than the nominal deadline.
  • The wait init container, the waitImage value, and the node-observer.waitImage helper are removed.
  • Helm-unittest coverage and snapshots updated for the slimmer Deployment.

Related follow-up (separate PR): node-data-broker main-container / health-endpoint changes.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • All commits are signed off per DCO (git commit -s).

Test plan

  • go test ./pkg/node_observer/...
  • make chart-test

@greptile-apps

greptile-apps Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR replaces curl-based init-container health-gate patterns in the node-observer and node-data-broker subcharts with in-process health waiting, eliminating the curlimages/curl dependency. For node-observer, a new waitForTopograph poller in pkg/node_observer/health.go blocks Controller.Start() until the topograph /healthz endpoint responds, then exits non-zero (causing a pod restart) after a hardcoded 1-minute timeout.

  • node-observer: Adds health.go with healthCheckURL and waitForTopograph, wires them into Controller.Start(), removes the wait init container and associated waitImage Helm values.
  • Helm / tests: Snapshot and unit tests updated to assert no initContainers block and correct main-container command; node-observer.waitImage helper removed from _helpers.tpl.

Confidence Score: 5/5

Safe to merge; the in-process health gate is a straightforward replacement of the curl init-container with correct timeout and error handling.

The polling loop correctly treats any non-2xx response as a failure (via DoRequest's status-code check), wraps the real context error with elapsed time, and is covered by three focused tests for the ready, timeout, and cancellation paths. Helm templates, snapshots, and unit tests are all updated consistently.

No files require special attention.

Important Files Changed

Filename Overview
pkg/node_observer/health.go New file implementing in-process health polling; logic is correct, error wrapping is accurate, and elapsed-time reporting addresses the previous review thread.
pkg/node_observer/health_test.go Comprehensive tests for URL derivation, retry-until-success, deadline expiry, and context cancellation; all three timeout/cancel cases use ErrorIs for precise error-chain assertion.
pkg/node_observer/controller.go Correctly wires healthCheckURL into NewController and calls waitForTopograph before starting the informer; no regressions observed.
charts/topograph/charts/node-observer/templates/deployment.yaml initContainers block removed cleanly; main container command unchanged; no probe config needed here since health-gate moved in-process.
charts/topograph/charts/node-observer/values.yaml waitImage block removed; no leftover references.
charts/topograph/charts/node-observer/templates/_helpers.tpl node-observer.waitImage helper removed; remaining helpers unaffected.
charts/topograph/tests/subcharts_test.yaml Test correctly updated: asserts no initContainers and validates the main container command value.
charts/topograph/tests/snapshot/render_snapshot_test.yaml.snap All eight snapshot variants updated consistently to remove the wait init-container block.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant K8s as Kubernetes
    participant NO as node-observer pod
    participant C as Controller.Start()
    participant H as waitForTopograph()
    participant T as topograph /healthz

    K8s->>NO: Start container
    NO->>C: Start()
    C->>H: waitForTopograph(ctx, healthURL, 2s, 1m)
    loop Every 2s until ready or 1m timeout
        H->>T: GET /healthz
        alt 2xx response
            T-->>H: 200 OK
            H-->>C: nil (ready)
        else non-2xx or error
            T-->>H: 503 / connection refused
            H->>H: sleep 2s
        end
    end
    alt Timeout (1m elapsed)
        H-->>C: error (DeadlineExceeded)
        C-->>NO: non-zero exit → pod restarts
    else Ready
        C->>C: statusInformer.Start()
        C-->>NO: watch loop running
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant K8s as Kubernetes
    participant NO as node-observer pod
    participant C as Controller.Start()
    participant H as waitForTopograph()
    participant T as topograph /healthz

    K8s->>NO: Start container
    NO->>C: Start()
    C->>H: waitForTopograph(ctx, healthURL, 2s, 1m)
    loop Every 2s until ready or 1m timeout
        H->>T: GET /healthz
        alt 2xx response
            T-->>H: 200 OK
            H-->>C: nil (ready)
        else non-2xx or error
            T-->>H: 503 / connection refused
            H->>H: sleep 2s
        end
    end
    alt Timeout (1m elapsed)
        H-->>C: error (DeadlineExceeded)
        C-->>NO: non-zero exit → pod restarts
    else Ready
        C->>C: statusInformer.Start()
        C-->>NO: watch loop running
    end
Loading

Reviews (6): Last reviewed commit: "feat(node-observer): wait for topograph ..." | Re-trigger Greptile

Comment thread charts/topograph/charts/node-data-broker/templates/daemonset.yaml Outdated
Comment thread pkg/node_observer/health.go
@github-actions

Copy link
Copy Markdown

@giuliocalzo giuliocalzo force-pushed the feat/node-observer-wait-healthz branch from db1e76d to e263cd8 Compare June 25, 2026 20:43
@greptile-apps

greptile-apps Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Want your agent to iterate on Greptile's feedback? Try greploops.

@giuliocalzo giuliocalzo force-pushed the feat/node-observer-wait-healthz branch from e263cd8 to ccdeadf Compare June 26, 2026 09:58
@dmitsh

dmitsh commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

@giuliocalzo , could you split it into 2 PRs, one for node-observer, one for node-data-broker?
Could you also open an issue for each PR?

Move the topograph readiness wait out of the chart's wait init container and
into the node-observer binary. The controller polls /healthz (derived from
generateTopologyUrl) every 2s and gives up after 1m, reporting the actual
elapsed time on timeout. Remove the wait init container, waitImage value, and
node-observer.waitImage helper.

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
@giuliocalzo giuliocalzo force-pushed the feat/node-observer-wait-healthz branch from e9a87e7 to 2c938cf Compare June 27, 2026 14:31
@giuliocalzo giuliocalzo changed the title feat: in-process health waits for node-observer and node-data-broker feat(node-observer): wait for topograph health in-process Jun 27, 2026
@giuliocalzo

Copy link
Copy Markdown
Collaborator Author

Closing in favor of #367.

Both PRs remove the node-observer wait init container and move readiness handling into the main process, but they touch the same chart templates and controller wiring:

#367 is the broader maintainer fix for the same problem space, so this PR is superseded. The node-data-broker work continues separately in #368.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants