feat(node-observer): wait for topograph health in-process#363
feat(node-observer): wait for topograph health in-process#363giuliocalzo wants to merge 1 commit into
Conversation
Greptile SummaryThis PR replaces curl-based init-container health-gate patterns in the
Confidence Score: 5/5Safe to merge; the in-process health gate is a straightforward replacement of the curl init-container with correct timeout and error handling. The polling loop correctly treats any non-2xx response as a failure (via DoRequest's status-code check), wraps the real context error with elapsed time, and is covered by three focused tests for the ready, timeout, and cancellation paths. Helm templates, snapshots, and unit tests are all updated consistently. No files require special attention. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant K8s as Kubernetes
participant NO as node-observer pod
participant C as Controller.Start()
participant H as waitForTopograph()
participant T as topograph /healthz
K8s->>NO: Start container
NO->>C: Start()
C->>H: waitForTopograph(ctx, healthURL, 2s, 1m)
loop Every 2s until ready or 1m timeout
H->>T: GET /healthz
alt 2xx response
T-->>H: 200 OK
H-->>C: nil (ready)
else non-2xx or error
T-->>H: 503 / connection refused
H->>H: sleep 2s
end
end
alt Timeout (1m elapsed)
H-->>C: error (DeadlineExceeded)
C-->>NO: non-zero exit → pod restarts
else Ready
C->>C: statusInformer.Start()
C-->>NO: watch loop running
end
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant K8s as Kubernetes
participant NO as node-observer pod
participant C as Controller.Start()
participant H as waitForTopograph()
participant T as topograph /healthz
K8s->>NO: Start container
NO->>C: Start()
C->>H: waitForTopograph(ctx, healthURL, 2s, 1m)
loop Every 2s until ready or 1m timeout
H->>T: GET /healthz
alt 2xx response
T-->>H: 200 OK
H-->>C: nil (ready)
else non-2xx or error
T-->>H: 503 / connection refused
H->>H: sleep 2s
end
end
alt Timeout (1m elapsed)
H-->>C: error (DeadlineExceeded)
C-->>NO: non-zero exit → pod restarts
else Ready
C->>C: statusInformer.Start()
C-->>NO: watch loop running
end
Reviews (6): Last reviewed commit: "feat(node-observer): wait for topograph ..." | Re-trigger Greptile |
|
🌿 Preview your docs: https://nvidia-preview-pull-request-363.docs.buildwithfern.com/topograph |
db1e76d to
e263cd8
Compare
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
e263cd8 to
ccdeadf
Compare
|
@giuliocalzo , could you split it into 2 PRs, one for node-observer, one for node-data-broker? |
Move the topograph readiness wait out of the chart's wait init container and into the node-observer binary. The controller polls /healthz (derived from generateTopologyUrl) every 2s and gives up after 1m, reporting the actual elapsed time on timeout. Remove the wait init container, waitImage value, and node-observer.waitImage helper. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
e9a87e7 to
2c938cf
Compare
|
Closing in favor of #367. Both PRs remove the node-observer
#367 is the broader maintainer fix for the same problem space, so this PR is superseded. The node-data-broker work continues separately in #368. |
Description
Replace the node-observer chart's curl-based
waitinit container with an in-process health wait in thenode-observerbinary, removing the dependency on thecurlimages/curlimage for this subchart./healthz(derived fromgenerateTopologyUrl) every 2s and gives up after 1m, returning an error so the pod restarts if topograph never comes up. Timeout errors report the actual elapsed time rather than the nominal deadline.waitinit container, thewaitImagevalue, and thenode-observer.waitImagehelper are removed.Related follow-up (separate PR): node-data-broker main-container / health-endpoint changes.
Checklist
git commit -s).Test plan
go test ./pkg/node_observer/...make chart-test