Skip to content

Nvswitch telemetry gaps#2945

Draft
mkoci wants to merge 27 commits into
NVIDIA:mainfrom
mkoci:nvswitch_telemetry_gaps
Draft

Nvswitch telemetry gaps#2945
mkoci wants to merge 27 commits into
NVIDIA:mainfrom
mkoci:nvswitch_telemetry_gaps

Conversation

@mkoci

@mkoci mkoci commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Description

This PR covers the NV Switch port from #2283. It closes the GB200 NVSwitch telemetry gaps for NVUE gNMI streaming, NVUE REST, and NMX-T.

Type of Change

  • Add - New feature or capability

Related Issues

#2283

Testing

  • Unit tests added/updated
  • Manual testing performed

Additional Notes

Gated previously insecure TLS verification behind dangerously_skip_tls_verification in config surface

  • Source mappings were validated against live GB200 NVSwitch endpoints (gNMI / NVUE-REST / NMX-T).
  • Deployment note: gNMI TLS verification is now strict by default. Lab or self-signed
    NVOS gNMI endpoints must set dangerously_skip_tls_verification = true to connect.

mkoci added 26 commits June 18, 2026 11:04
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…ted mappings

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…VUE REST

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
… sources

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…etheus sink

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…el cardinality fixes

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…ation changes

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…h_serial label

Compose OTLP metric name as {prefix}_{name}_{metric_type}_{unit} to match the
Prometheus sink, and promote switch_serial/switch_id onto datapoint attributes so
Grafana switch dashboards resolve identically across export paths.

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
The NMX-T collector built its reqwest client without danger_accept_invalid_certs,
unlike the sibling NVUE REST collector. On minimal runtime images this fails at
client build time (native-root-CA load) and the switch serves a self-signed cert
anyway, so NMX-T never collected. Match the NVUE REST self-signed handling.

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
tonic 0.14 auto-injects a strict system-root TLS verifier for https:// URIs
(Endpoint::from) and layers its own TlsConnector over any custom connector
(channel/service/connector.rs). That silently negated the hand-rolled
hyper-rustls skip-verify connector, so tonic strictly verified and rejected
NVOS's self-signed gNMI cert -- the channel died right after the server
Certificate message (opaque 'transport error', no HTTP/2 frames).

Use Endpoint::tls_config_with_verifier(ClientTlsConfig::new(), <verifier>) so
the AcceptAnyCertVerifier is applied in tonic's own TLS layer; drop the
hand-rolled connector. tls.rs now exposes accept_any_cert_verifier() instead
of self_signed_tls_config().

Validated on gb-nvl-124-switch06: gNMI SAMPLE+ON_CHANGE streams connect and
86 carbide_hardware_health_nvue_gnmi_* metric families flow via the OtlpSink
into VictoriaMetrics.

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…nfig

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
… config for dev

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
The generated matrix/validation docs were already dropped in
3b0a075 (chore(health): remove temp docs from repo), but the
one-shot generator script was missed. It has no callers, its
required inputs are not in the repo, and its outputs are no longer
tracked, so it cannot run from a clean checkout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 27, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 14087a83-ad57-4cbd-ad69-b85e1d7f248c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
@mkoci mkoci force-pushed the nvswitch_telemetry_gaps branch from b4b9e2a to 961fd8c Compare June 28, 2026 02:28
@copy-pr-bot

copy-pr-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant