feat(node-observer): wait for topograph health in-process by giuliocalzo · Pull Request #363 · NVIDIA/topograph

giuliocalzo · 2026-06-25T19:45:56Z

Description

Replace the node-observer chart's curl-based wait init container with an in-process health wait in the node-observer binary, removing the dependency on the curlimages/curl image for this subchart.

node-observer waits for the topograph API to become healthy before starting its watch loop. The controller polls /healthz (derived from generateTopologyUrl) every 2s and gives up after 1m, returning an error so the pod restarts if topograph never comes up. Timeout errors report the actual elapsed time rather than the nominal deadline.
The wait init container, the waitImage value, and the node-observer.waitImage helper are removed.
Helm-unittest coverage and snapshots updated for the slimmer Deployment.

Related follow-up (separate PR): node-data-broker main-container / health-endpoint changes.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
All commits are signed off per DCO (git commit -s).

Test plan

go test ./pkg/node_observer/...
make chart-test

greptile-apps · 2026-06-25T19:51:41Z

Greptile Summary

This PR replaces curl-based init-container health-gate patterns in the node-observer and node-data-broker subcharts with in-process health waiting, eliminating the curlimages/curl dependency. For node-observer, a new waitForTopograph poller in pkg/node_observer/health.go blocks Controller.Start() until the topograph /healthz endpoint responds, then exits non-zero (causing a pod restart) after a hardcoded 1-minute timeout.

node-observer: Adds health.go with healthCheckURL and waitForTopograph, wires them into Controller.Start(), removes the wait init container and associated waitImage Helm values.
Helm / tests: Snapshot and unit tests updated to assert no initContainers block and correct main-container command; node-observer.waitImage helper removed from _helpers.tpl.

Confidence Score: 5/5

Safe to merge; the in-process health gate is a straightforward replacement of the curl init-container with correct timeout and error handling.

The polling loop correctly treats any non-2xx response as a failure (via DoRequest's status-code check), wraps the real context error with elapsed time, and is covered by three focused tests for the ready, timeout, and cancellation paths. Helm templates, snapshots, and unit tests are all updated consistently.

No files require special attention.

Important Files Changed

Filename	Overview
pkg/node_observer/health.go	New file implementing in-process health polling; logic is correct, error wrapping is accurate, and elapsed-time reporting addresses the previous review thread.
pkg/node_observer/health_test.go	Comprehensive tests for URL derivation, retry-until-success, deadline expiry, and context cancellation; all three timeout/cancel cases use ErrorIs for precise error-chain assertion.
pkg/node_observer/controller.go	Correctly wires healthCheckURL into NewController and calls waitForTopograph before starting the informer; no regressions observed.
charts/topograph/charts/node-observer/templates/deployment.yaml	initContainers block removed cleanly; main container command unchanged; no probe config needed here since health-gate moved in-process.
charts/topograph/charts/node-observer/values.yaml	waitImage block removed; no leftover references.
charts/topograph/charts/node-observer/templates/_helpers.tpl	node-observer.waitImage helper removed; remaining helpers unaffected.
charts/topograph/tests/subcharts_test.yaml	Test correctly updated: asserts no initContainers and validates the main container command value.
charts/topograph/tests/snapshot/render_snapshot_test.yaml.snap	All eight snapshot variants updated consistently to remove the wait init-container block.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant K8s as Kubernetes
    participant NO as node-observer pod
    participant C as Controller.Start()
    participant H as waitForTopograph()
    participant T as topograph /healthz

    K8s->>NO: Start container
    NO->>C: Start()
    C->>H: waitForTopograph(ctx, healthURL, 2s, 1m)
    loop Every 2s until ready or 1m timeout
        H->>T: GET /healthz
        alt 2xx response
            T-->>H: 200 OK
            H-->>C: nil (ready)
        else non-2xx or error
            T-->>H: 503 / connection refused
            H->>H: sleep 2s
        end
    end
    alt Timeout (1m elapsed)
        H-->>C: error (DeadlineExceeded)
        C-->>NO: non-zero exit → pod restarts
    else Ready
        C->>C: statusInformer.Start()
        C-->>NO: watch loop running
    end

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant K8s as Kubernetes
    participant NO as node-observer pod
    participant C as Controller.Start()
    participant H as waitForTopograph()
    participant T as topograph /healthz

    K8s->>NO: Start container
    NO->>C: Start()
    C->>H: waitForTopograph(ctx, healthURL, 2s, 1m)
    loop Every 2s until ready or 1m timeout
        H->>T: GET /healthz
        alt 2xx response
            T-->>H: 200 OK
            H-->>C: nil (ready)
        else non-2xx or error
            T-->>H: 503 / connection refused
            H->>H: sleep 2s
        end
    end
    alt Timeout (1m elapsed)
        H-->>C: error (DeadlineExceeded)
        C-->>NO: non-zero exit → pod restarts
    else Ready
        C->>C: statusInformer.Start()
        C-->>NO: watch loop running
    end

_{Reviews (6): Last reviewed commit: "feat(node-observer): wait for topograph ..." | Re-trigger Greptile}

github-actions · 2026-06-25T19:52:22Z

🌿 Preview your docs: https://nvidia-preview-pull-request-363.docs.buildwithfern.com/topograph

greptile-apps · 2026-06-25T20:48:20Z

Want your agent to iterate on Greptile's feedback? Try greploops.

dmitsh · 2026-06-26T18:38:55Z

@giuliocalzo , could you split it into 2 PRs, one for node-observer, one for node-data-broker?
Could you also open an issue for each PR?

Move the topograph readiness wait out of the chart's wait init container and into the node-observer binary. The controller polls /healthz (derived from generateTopologyUrl) every 2s and gives up after 1m, reporting the actual elapsed time on timeout. Remove the wait init container, waitImage value, and node-observer.waitImage helper. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

giuliocalzo · 2026-06-27T14:34:41Z

Closing in favor of #367.

Both PRs remove the node-observer wait init container and move readiness handling into the main process, but they touch the same chart templates and controller wiring:

feat(node-observer): wait for topograph health in-process #363 — polls /healthz at startup, then starts the existing watch loop. Does not re-trigger if the Topograph API restarts later.
fix(node-observer): refresh topology after API restarts #367 — watches the Topograph API pod (Ready + restart count) and triggers regeneration after API restarts (closes [FEA] Refresh topology after API restarts #366).

#367 is the broader maintainer fix for the same problem space, so this PR is superseded. The node-data-broker work continues separately in #368.

giuliocalzo requested review from dmitsh and ravisoundar as code owners June 25, 2026 19:45

greptile-apps Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread charts/topograph/charts/node-data-broker/templates/daemonset.yaml Outdated

Comment thread pkg/node_observer/health.go

giuliocalzo force-pushed the feat/node-observer-wait-healthz branch from db1e76d to e263cd8 Compare June 25, 2026 20:43

giuliocalzo force-pushed the feat/node-observer-wait-healthz branch from e263cd8 to ccdeadf Compare June 26, 2026 09:58

giuliocalzo force-pushed the feat/node-observer-wait-healthz branch from e9a87e7 to 2c938cf Compare June 27, 2026 14:31

giuliocalzo changed the title ~~feat: in-process health waits for node-observer and node-data-broker~~ feat(node-observer): wait for topograph health in-process Jun 27, 2026

giuliocalzo mentioned this pull request Jun 27, 2026

feat(node-data-broker): run broker as main container with health probes #368

Open

6 tasks

giuliocalzo closed this Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(node-observer): wait for topograph health in-process#363

feat(node-observer): wait for topograph health in-process#363
giuliocalzo wants to merge 1 commit into
NVIDIA:mainfrom
giuliocalzo:feat/node-observer-wait-healthz

giuliocalzo commented Jun 25, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

greptile-apps Bot commented Jun 25, 2026

Uh oh!

dmitsh commented Jun 26, 2026

Uh oh!

giuliocalzo commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

giuliocalzo commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Test plan

Uh oh!

greptile-apps Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

greptile-apps Bot commented Jun 25, 2026

Uh oh!

dmitsh commented Jun 26, 2026

Uh oh!

giuliocalzo commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

giuliocalzo commented Jun 25, 2026 •

edited

Loading

greptile-apps Bot commented Jun 25, 2026 •

edited

Loading