Skip to content

feat(health): enrich machine log metadata#2882

Open
jayzhudev wants to merge 5 commits into
NVIDIA:mainfrom
jayzhudev:health/add-log-metadata
Open

feat(health): enrich machine log metadata#2882
jayzhudev wants to merge 5 commits into
NVIDIA:mainfrom
jayzhudev:health/add-log-metadata

Conversation

@jayzhudev

Copy link
Copy Markdown
Contributor

Propagate machine serial, GPU driver version, NVLink domain, and component type through health endpoint context into log-related sinks, so the downstream consumers can associate XID log records with the correct machine, driver, NVLink domain, and component type.

Driver version is derived from NICo machine discovery when all reported GPUs agree, or from static TOML metadata for static targets.

The following is an example for static configuration:

[[endpoint_sources.static_bmc_endpoints]]
ip = ...
port = ...
...
...
machine = {
  id = "fm100htjtiaehv1n5vh67tbmqq4eabcjdng40f7jupsadbedhruh6rag1l0",
  serial = "MN-001",
  driver_version = "570.82",
  slot_number = 15,
  tray_index = 5,
  nvlink_domain_uuid = "00000000-0000-0000-0000-000000000000"
}

Related issues

Closes #2876

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

@jayzhudev jayzhudev added this to the v2.0 milestone Jun 25, 2026
@jayzhudev jayzhudev self-assigned this Jun 25, 2026
@jayzhudev jayzhudev requested a review from a team as a code owner June 25, 2026 15:08
@jayzhudev jayzhudev changed the title Health/add log metadata feat(health): enrich machine log metadata Jun 25, 2026
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Summary by CodeRabbit

  • New Features
    • Health events and logs now include richer machine metadata, such as machine serial number, GPU driver version, and component type.
    • Machine metadata is now also carried through JSONL output and structured tracing logs.
  • Bug Fixes
    • Empty or whitespace-only driver version values are now ignored.
    • Driver version is only shown when a single consistent value is available.
  • Documentation
    • Configuration now supports an optional machine driver version value.

Walkthrough

The PR propagates machine driver version and component type through health metadata, from discovery and configuration into EventContext, OTLP attributes, and sink outputs.

Changes

Machine metadata propagation

Layer / File(s) Summary
Metadata types and builders
crates/health/src/endpoint/model.rs, crates/health/src/config.rs, crates/health/src/endpoint/sources.rs, crates/health/src/api_client.rs, crates/health/src/discovery/spawn.rs, crates/health/benches/*
MachineData gains driver_version, static config and endpoint sources populate it, EndpointMetadata::component_type() is added, and benchmark fixtures build machine metadata with it unset.
GPU driver version derivation
crates/health/src/api_client.rs
ApiEndpointSource derives a machine driver version from discovery data and stores it on MachineData, with tests covering empty, conflicting, and normalized GPU versions.
EventContext accessors
crates/health/src/sink/events.rs, crates/health/src/processor/health_report.rs, crates/health/src/sink/health_report.rs, crates/health/src/sink/mod.rs, crates/health/src/sink/prometheus.rs
EventContext exposes machine serial, driver version, and component type, and the related test contexts and summaries are updated.
OTLP resource attributes
crates/health/src/otlp/convert.rs
OTLP resource attributes emit machine.serial, driver.version, and component.type when present, with updated machine, switch, and power-shelf tests.
Sink exports
crates/health/src/sink/log_file.rs, crates/health/src/sink/tracing.rs
JSONL records and tracing events carry machine_id, machine_serial, driver_version, component_type, and nvlink_domain_uuid.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 79.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the primary change: propagating machine metadata into health log exports.
Description check ✅ Passed The description is directly aligned with the code changes and accurately describes the new metadata propagation and static config example.
Linked Issues check ✅ Passed The changes satisfy #2876 by exporting machine serial, driver version, NVLink domain, and component type through health log sinks.
Out of Scope Changes check ✅ Passed No clear out-of-scope changes are evident; the added config, derivation, and sink wiring support the stated metadata propagation feature.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
crates/health/src/api_client.rs (2)

621-633: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Cover the machine.id fallback branch of machine_component_type.

The table currently exercises only the Host and Dpu arms. The non-trivial fallback (lines 626-631), where an unparseable/PowerShelf/Unknown Forge type defers to the MachineId-encoded type and ultimately to ComputeNode, is untested. A regression there would silently mislabel telemetry. Adding a couple of rows that set machine_type to Unknown/PowerShelf with and without an id would close the gap.

As per path instructions: prefer findings about "missing tests over style-only comments."

Also applies to: 819-832

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/api_client.rs` around lines 621 - 633, Add test coverage
for the fallback branch in machine_component_type so the MachineId-derived path
is exercised, not just the Host and Dpu cases. Extend the existing table-driven
tests around machine_component_type to include rpc::forge::MachineType::Unknown
and rpc::forge::MachineType::PowerShelf, both with a machine.id present
(verifying ComponentType::from_machine_type is used) and with machine.id absent
(verifying it falls back to ComponentType::ComputeNode). Keep the focus on the
machine_component_type function and its fallback logic so regressions in
telemetry classification are caught.

Source: Path instructions


601-619: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Express the unique-version computation as an iterator pipeline.

The style guide directs us to construct collections through iterators rather than seeding a mutable container inside a for loop, which makes the "done building" point explicit and reads more declaratively.

♻️ Proposed iterator-based rewrite
-fn unique_gpu_driver_version(
-    discovery_info: Option<&rpc::machine_discovery::DiscoveryInfo>,
-) -> Option<String> {
-    let mut versions = HashSet::new();
-    let discovery_info = discovery_info?;
-
-    for gpu in &discovery_info.gpus {
-        let version = gpu.driver_version.trim();
-        if !version.is_empty() {
-            versions.insert(version.to_string());
-        }
-    }
-
-    if versions.len() == 1 {
-        versions.into_iter().next()
-    } else {
-        None
-    }
-}
+fn unique_gpu_driver_version(
+    discovery_info: Option<&rpc::machine_discovery::DiscoveryInfo>,
+) -> Option<String> {
+    let versions: HashSet<&str> = discovery_info?
+        .gpus
+        .iter()
+        .map(|gpu| gpu.driver_version.trim())
+        .filter(|version| !version.is_empty())
+        .collect();
+
+    (versions.len() == 1)
+        .then(|| versions.into_iter().next().map(str::to_string))
+        .flatten()
+}
As per coding guidelines: "When building a Vec or a HashMap, prefer using iterators to building them from a for-loop."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/api_client.rs` around lines 601 - 619, Refactor
unique_gpu_driver_version to use an iterator pipeline instead of seeding a
mutable HashSet in a for loop. Keep the same behavior by starting from
discovery_info.gpus, trimming each gpu.driver_version, filtering out empty
strings, collecting into a HashSet, and then returning the single version only
when the set has exactly one entry. Preserve the existing function signature and
the unique-version semantics while making the construction of versions
declarative.

Source: Coding guidelines

crates/health/src/sink/tracing.rs (1)

66-70: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low value

Debug-formatted optionals dilute logfmt searchability.

Recording these accessors with ? serializes the wrapping Option, so an XID record emits machine_serial=Some("MN-001") and driver_version=None rather than clean key/value pairs. Since these fields exist precisely so downstream analyzers can pivot on machine serial and driver version, the Some(..) wrappers and None noise work against that goal. Consider emitting bare values only when present (mirroring the JSONL sink's skip-if-None treatment in log_file.rs), e.g. via tracing::Span/conditional fields, so the structured output stays directly searchable.

This matches the existing machine_id = ?... style, so it is not a regression — flagging it as an observability refinement for the newly surfaced fields.

As per coding guidelines: "prefer placing common fields as attributes passed to tracing function, instead of using string interpolation" for logfmt searchability.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/sink/tracing.rs` around lines 66 - 70, The tracing fields
in the XID sink are logging Option values with `?`, which produces `Some(...)`
and `None` in logfmt instead of searchable key/value pairs. Update the event
construction in `tracing.rs` to emit these fields only when present, following
the skip-if-None approach used by the JSONL sink in `log_file.rs`, and keep the
common fields as tracing attributes rather than string interpolation. Focus on
the accessors on `context` such as `machine_serial`, `driver_version`,
`component_type`, and `nvlink_domain_uuid` so the output stays clean and
directly searchable.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/health/src/api_client.rs`:
- Around line 621-633: Add test coverage for the fallback branch in
machine_component_type so the MachineId-derived path is exercised, not just the
Host and Dpu cases. Extend the existing table-driven tests around
machine_component_type to include rpc::forge::MachineType::Unknown and
rpc::forge::MachineType::PowerShelf, both with a machine.id present (verifying
ComponentType::from_machine_type is used) and with machine.id absent (verifying
it falls back to ComponentType::ComputeNode). Keep the focus on the
machine_component_type function and its fallback logic so regressions in
telemetry classification are caught.
- Around line 601-619: Refactor unique_gpu_driver_version to use an iterator
pipeline instead of seeding a mutable HashSet in a for loop. Keep the same
behavior by starting from discovery_info.gpus, trimming each gpu.driver_version,
filtering out empty strings, collecting into a HashSet, and then returning the
single version only when the set has exactly one entry. Preserve the existing
function signature and the unique-version semantics while making the
construction of versions declarative.

In `@crates/health/src/sink/tracing.rs`:
- Around line 66-70: The tracing fields in the XID sink are logging Option
values with `?`, which produces `Some(...)` and `None` in logfmt instead of
searchable key/value pairs. Update the event construction in `tracing.rs` to
emit these fields only when present, following the skip-if-None approach used by
the JSONL sink in `log_file.rs`, and keep the common fields as tracing
attributes rather than string interpolation. Focus on the accessors on `context`
such as `machine_serial`, `driver_version`, `component_type`, and
`nvlink_domain_uuid` so the output stays clean and directly searchable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fb9d6dc3-d41b-481e-adbd-665c01dd396e

📥 Commits

Reviewing files that changed from the base of the PR and between 790a82d and bce6f86.

📒 Files selected for processing (17)
  • crates/health/benches/collector_pipeline.rs
  • crates/health/benches/processor_pipeline.rs
  • crates/health/benches/sink_pipeline.rs
  • crates/health/src/api_client.rs
  • crates/health/src/config.rs
  • crates/health/src/discovery/spawn.rs
  • crates/health/src/endpoint/mod.rs
  • crates/health/src/endpoint/model.rs
  • crates/health/src/endpoint/sources.rs
  • crates/health/src/otlp/convert.rs
  • crates/health/src/processor/health_report.rs
  • crates/health/src/sink/events.rs
  • crates/health/src/sink/health_report.rs
  • crates/health/src/sink/log_file.rs
  • crates/health/src/sink/mod.rs
  • crates/health/src/sink/prometheus.rs
  • crates/health/src/sink/tracing.rs

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 285 6 26 102 7 144
machine-validation-runner 744 32 188 267 36 221
machine_validation 744 32 188 267 36 221
machine_validation-aarch64 744 32 188 267 36 221
nvmetal-carbide 744 32 188 267 36 221
TOTAL 3267 134 778 1176 151 1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@jayzhudev jayzhudev force-pushed the health/add-log-metadata branch from bce6f86 to 6c6776a Compare June 25, 2026 16:52
Comment thread crates/health/src/endpoint/model.rs Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/health/src/api_client.rs (1)

758-789: 🎯 Functional Correctness | 🟡 Minor

Add the mixed known + unknown case.
["570.82", ""] should resolve to Some("570.82"); empty strings are treated as missing, not conflicting. This locks the exported driver-version label to the current contract.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/api_client.rs` around lines 758 - 789, The test for
unique_gpu_driver_version is missing the mixed known-plus-unknown case, so add a
scenario in unique_gpu_driver_version_uses_single_non_empty_version covering a
real version alongside an empty string and assert it returns the known version.
Update the value_scenarios set in crates/health/src/api_client.rs to treat empty
or whitespace-only driver versions as missing rather than conflicting, matching
the current contract enforced by unique_gpu_driver_version and
discovery_with_driver_versions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/health/src/api_client.rs`:
- Around line 758-789: The test for unique_gpu_driver_version is missing the
mixed known-plus-unknown case, so add a scenario in
unique_gpu_driver_version_uses_single_non_empty_version covering a real version
alongside an empty string and assert it returns the known version. Update the
value_scenarios set in crates/health/src/api_client.rs to treat empty or
whitespace-only driver versions as missing rather than conflicting, matching the
current contract enforced by unique_gpu_driver_version and
discovery_with_driver_versions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0ba81d83-cab5-4055-92f1-a918556f9a4b

📥 Commits

Reviewing files that changed from the base of the PR and between 6c6776a and 434088f.

📒 Files selected for processing (12)
  • crates/health/src/api_client.rs
  • crates/health/src/discovery/spawn.rs
  • crates/health/src/endpoint/model.rs
  • crates/health/src/endpoint/sources.rs
  • crates/health/src/otlp/convert.rs
  • crates/health/src/processor/health_report.rs
  • crates/health/src/sink/events.rs
  • crates/health/src/sink/health_report.rs
  • crates/health/src/sink/log_file.rs
  • crates/health/src/sink/mod.rs
  • crates/health/src/sink/prometheus.rs
  • crates/health/src/sink/tracing.rs
✅ Files skipped from review due to trivial changes (1)
  • crates/health/src/sink/mod.rs
🚧 Files skipped from review as they are similar to previous changes (3)
  • crates/health/src/sink/tracing.rs
  • crates/health/src/sink/log_file.rs
  • crates/health/src/otlp/convert.rs

@jayzhudev jayzhudev force-pushed the health/add-log-metadata branch from 434088f to cd0bcf9 Compare June 25, 2026 21:51

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/health/src/otlp/convert.rs (1)

417-450: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Assert component.type remains present for machine endpoints when optional fields are absent.

This case verifies omission of optional machine attributes, but it does not protect the new component.type contract. component.type is derived from EndpointMetadata, so this test should still assert "compute_node" even when machine_serial, driver_version, and nvlink_domain_uuid are None.

Suggested assertion
         assert_eq!(attr_value(&attrs, "machine.serial"), None);
         assert_eq!(attr_value(&attrs, "driver.version"), None);
+        assert_eq!(attr_value(&attrs, "component.type"), Some("compute_node"));
         assert_eq!(attr_value(&attrs, "nvlink.domain.uuid"), None);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/otlp/convert.rs` around lines 417 - 450, The new test in
resource_attributes should also verify the stable EndpointMetadata-derived
attribute contract: keep asserting the optional machine fields are omitted, and
add an assertion that component.type is still emitted as compute_node for
MachineData even when machine_serial, driver_version, and nvlink_domain_uuid are
None. Use the existing resource_attributes and attr_value helpers in this test
to confirm the component.type mapping remains present.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/health/src/otlp/convert.rs`:
- Around line 417-450: The new test in resource_attributes should also verify
the stable EndpointMetadata-derived attribute contract: keep asserting the
optional machine fields are omitted, and add an assertion that component.type is
still emitted as compute_node for MachineData even when machine_serial,
driver_version, and nvlink_domain_uuid are None. Use the existing
resource_attributes and attr_value helpers in this test to confirm the
component.type mapping remains present.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5f5c41c5-d55d-4062-90f2-b8a1b07fff1f

📥 Commits

Reviewing files that changed from the base of the PR and between 434088f and cd0bcf9.

📒 Files selected for processing (13)
  • crates/health/benches/processor_pipeline.rs
  • crates/health/src/api_client.rs
  • crates/health/src/discovery/spawn.rs
  • crates/health/src/endpoint/model.rs
  • crates/health/src/endpoint/sources.rs
  • crates/health/src/otlp/convert.rs
  • crates/health/src/processor/health_report.rs
  • crates/health/src/sink/events.rs
  • crates/health/src/sink/health_report.rs
  • crates/health/src/sink/log_file.rs
  • crates/health/src/sink/mod.rs
  • crates/health/src/sink/prometheus.rs
  • crates/health/src/sink/tracing.rs
💤 Files with no reviewable changes (1)
  • crates/health/benches/processor_pipeline.rs
✅ Files skipped from review due to trivial changes (5)
  • crates/health/src/sink/prometheus.rs
  • crates/health/src/sink/health_report.rs
  • crates/health/src/discovery/spawn.rs
  • crates/health/src/sink/tracing.rs
  • crates/health/src/sink/log_file.rs
🚧 Files skipped from review as they are similar to previous changes (6)
  • crates/health/src/sink/mod.rs
  • crates/health/src/processor/health_report.rs
  • crates/health/src/endpoint/sources.rs
  • crates/health/src/endpoint/model.rs
  • crates/health/src/api_client.rs
  • crates/health/src/sink/events.rs

@ianderson-nvidia

Copy link
Copy Markdown
Contributor

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/health/src/endpoint/sources.rs (1)

131-139: 🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

Normalize static driver versions to match the discovery path.

Line 138 copies machine.driver_version verbatim, but unique_gpu_driver_version() in crates/health/src/api_client.rs trims whitespace and drops empty strings before emitting machine metadata. A static config like " 570.82 " or " " will therefore produce a different driver.version contract than the API-backed path, and that value is exported directly by the OTLP/log sinks.

Suggested fix
-                        driver_version: machine.driver_version.clone(),
+                        driver_version: machine
+                            .driver_version
+                            .as_deref()
+                            .map(str::trim)
+                            .filter(|version| !version.is_empty())
+                            .map(str::to_string),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/endpoint/sources.rs` around lines 131 - 139, The machine
metadata path in sources.rs copies machine.driver_version verbatim, which can
diverge from the normalized behavior used by unique_gpu_driver_version() in
api_client.rs. Update the EndpointMetadata::Machine construction to normalize
driver_version the same way as the discovery path by trimming whitespace and
treating empty strings as absent before emitting metadata, so static config and
API-backed values produce the same driver.version contract.
🧹 Nitpick comments (1)
crates/health/src/sink/tracing.rs (1)

85-89: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a tracing-sink test for the new metadata fields.

This is now the only new export surface in the cohort without an assertion that machine_serial, driver_version, component_type, and nvlink_domain_uuid are present in both log branches. A small subscriber-based test here would catch future field-name drift or one-branch-only regressions.

As per path instructions, "prefer findings about behavior, concurrency, resource lifetimes, and missing tests over style-only comments."

Also applies to: 99-103

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/sink/tracing.rs` around lines 85 - 89, Add a
subscriber-based test around the tracing sink in tracing.rs that exercises both
log branches and asserts the new metadata fields are emitted: machine_serial,
driver_version, component_type, and nvlink_domain_uuid. Use the existing tracing
sink entrypoints and span/event structure in the tracing sink code to capture
emitted fields, then verify each field appears in both branches so future
field-name drift or one-branch-only regressions are caught.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/health/src/endpoint/sources.rs`:
- Around line 131-139: The machine metadata path in sources.rs copies
machine.driver_version verbatim, which can diverge from the normalized behavior
used by unique_gpu_driver_version() in api_client.rs. Update the
EndpointMetadata::Machine construction to normalize driver_version the same way
as the discovery path by trimming whitespace and treating empty strings as
absent before emitting metadata, so static config and API-backed values produce
the same driver.version contract.

---

Nitpick comments:
In `@crates/health/src/sink/tracing.rs`:
- Around line 85-89: Add a subscriber-based test around the tracing sink in
tracing.rs that exercises both log branches and asserts the new metadata fields
are emitted: machine_serial, driver_version, component_type, and
nvlink_domain_uuid. Use the existing tracing sink entrypoints and span/event
structure in the tracing sink code to capture emitted fields, then verify each
field appears in both branches so future field-name drift or one-branch-only
regressions are caught.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6efdca6c-cd4b-44b6-8b72-8525abcbf073

📥 Commits

Reviewing files that changed from the base of the PR and between 47e304c and 62aa7b4.

📒 Files selected for processing (16)
  • crates/health/benches/collector_pipeline.rs
  • crates/health/benches/processor_pipeline.rs
  • crates/health/benches/sink_pipeline.rs
  • crates/health/src/api_client.rs
  • crates/health/src/config.rs
  • crates/health/src/discovery/spawn.rs
  • crates/health/src/endpoint/model.rs
  • crates/health/src/endpoint/sources.rs
  • crates/health/src/otlp/convert.rs
  • crates/health/src/processor/health_report.rs
  • crates/health/src/sink/events.rs
  • crates/health/src/sink/health_report.rs
  • crates/health/src/sink/log_file.rs
  • crates/health/src/sink/mod.rs
  • crates/health/src/sink/prometheus.rs
  • crates/health/src/sink/tracing.rs

Propagate machine identity, serial, GPU driver version, NVLink domain,
and component type through health endpoint context into log-related
sinks.

Driver version is derived from NICo machine discovery when all reported
GPUs agree, or from static TOML metadata for static targets.

Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
Trim static machine driver_version before storing metadata and
omit it when the value is empty, matching API discovery behavior.

Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
@jayzhudev jayzhudev force-pushed the health/add-log-metadata branch from 62aa7b4 to 95f97c0 Compare June 26, 2026 03:31
@jayzhudev jayzhudev requested a review from yoks June 26, 2026 20:27
@jayzhudev

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
crates/health/src/endpoint/model.rs (1)

124-129: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Update the driver_version contract comment.

This field is no longer populated only from API discovery. StaticEndpointSource::from_config() also fills it from StaticMachineEndpoint.driver_version, so the current doc comment now describes a narrower contract than the code actually implements.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/endpoint/model.rs` around lines 124 - 129, Update the
`driver_version` documentation in `Machine-level GPU driver version` to match
the actual contract: it is populated both from API discovery and from
`StaticEndpointSource::from_config()` via
`StaticMachineEndpoint.driver_version`, not only when discovery finds a single
unique version. Adjust the comment near `driver_version` in the model so it
describes all sources and the current absence/presence behavior accurately,
referencing `StaticEndpointSource::from_config` and
`StaticMachineEndpoint.driver_version` in the wording if needed.
crates/health/src/endpoint/sources.rs (1)

383-465: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Collapse the driver-version cases into a table-driven test.

These cases exercise the same operation with different driver_version inputs and expectations. Converting them to value_scenarios!/check_values would make it easier to add more normalization edge cases without duplicating the fixture setup. As per coding guidelines, "Prefer table-driven tests using the carbide-test-support crate with scenarios! for fallible operations and value_scenarios! for total operations."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/endpoint/sources.rs` around lines 383 - 465, The two
driver_version tests duplicate the same StaticEndpointSource::from_config and
fetch_bmc_hosts setup, so collapse them into a single table-driven test using
value_scenarios! and check_values. Keep one shared fixture for
StaticBmcEndpoint/StaticMachineEndpoint and vary only the driver_version input
plus expected normalized output (trimmed string vs None). Use the existing
EndpointMetadata::Machine assertions inside the scenario loop so adding more
normalization cases stays simple.

Source: Coding guidelines

crates/health/src/sink/tracing.rs (1)

85-103: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a tracing-sink assertion for the new metadata fields.

This changes the tracing export contract, but I do not see a focused assertion for it comparable to the JSONL and OTLP coverage. A field-name or formatting regression here would silently drop the new context from tracing output while the other sink tests still pass. As per coding guidelines, "Add or update focused tests for bug fixes, shared behavior, API contracts, migrations, and cross-module changes." As per path instructions, "Prefer findings about behavior, concurrency, resource lifetimes, and missing tests over style-only comments."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/sink/tracing.rs` around lines 85 - 103, Add a focused test
for the tracing sink export contract to assert the new metadata fields are
emitted. Update the existing tracing sink coverage in
crates/health/src/sink/tracing.rs, using the sink path that builds the “Log
event” record, and verify the output includes machine_id, machine_serial,
driver_version, component_type, and nvlink_domain_uuid with the expected
formatting. Keep the assertion local to the tracing sink so a regression in
field names or formatting is caught even if the JSONL and OTLP tests still pass.

Sources: Coding guidelines, Path instructions

crates/health/src/otlp/convert.rs (1)

417-450: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert component.type remains present for machine endpoints with sparse metadata.

This case only proves optional fields are omitted. component.type is derived from EndpointMetadata::Machine, so a regression that drops compute_node when machine_serial/driver_version are None would still pass.

Suggested assertion
         assert_eq!(attr_value(&attrs, "machine.serial"), None);
         assert_eq!(attr_value(&attrs, "driver.version"), None);
+        assert_eq!(attr_value(&attrs, "component.type"), Some("compute_node"));
         assert_eq!(attr_value(&attrs, "nvlink.domain.uuid"), None);

As per coding guidelines, "Verification should exercise the behavior that changed," and as per path instructions, "Prefer findings about behavior, concurrency, resource lifetimes, and missing tests over style-only comments."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/otlp/convert.rs` around lines 417 - 450, The existing test
in resource_attributes should also verify that component.type is still emitted
for EndpointMetadata::Machine when optional MachineData fields are sparse.
Update the resource_attributes_omit_absent_optional_machine_metadata test to
assert the presence and expected value of component.type alongside the current
checks for omitted machine.serial, driver.version, and nvlink.domain.uuid, using
attr_value on the attrs returned by resource_attributes.

Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/health/src/endpoint/model.rs`:
- Around line 124-129: Update the `driver_version` documentation in
`Machine-level GPU driver version` to match the actual contract: it is populated
both from API discovery and from `StaticEndpointSource::from_config()` via
`StaticMachineEndpoint.driver_version`, not only when discovery finds a single
unique version. Adjust the comment near `driver_version` in the model so it
describes all sources and the current absence/presence behavior accurately,
referencing `StaticEndpointSource::from_config` and
`StaticMachineEndpoint.driver_version` in the wording if needed.

In `@crates/health/src/endpoint/sources.rs`:
- Around line 383-465: The two driver_version tests duplicate the same
StaticEndpointSource::from_config and fetch_bmc_hosts setup, so collapse them
into a single table-driven test using value_scenarios! and check_values. Keep
one shared fixture for StaticBmcEndpoint/StaticMachineEndpoint and vary only the
driver_version input plus expected normalized output (trimmed string vs None).
Use the existing EndpointMetadata::Machine assertions inside the scenario loop
so adding more normalization cases stays simple.

In `@crates/health/src/otlp/convert.rs`:
- Around line 417-450: The existing test in resource_attributes should also
verify that component.type is still emitted for EndpointMetadata::Machine when
optional MachineData fields are sparse. Update the
resource_attributes_omit_absent_optional_machine_metadata test to assert the
presence and expected value of component.type alongside the current checks for
omitted machine.serial, driver.version, and nvlink.domain.uuid, using attr_value
on the attrs returned by resource_attributes.

In `@crates/health/src/sink/tracing.rs`:
- Around line 85-103: Add a focused test for the tracing sink export contract to
assert the new metadata fields are emitted. Update the existing tracing sink
coverage in crates/health/src/sink/tracing.rs, using the sink path that builds
the “Log event” record, and verify the output includes machine_id,
machine_serial, driver_version, component_type, and nvlink_domain_uuid with the
expected formatting. Keep the assertion local to the tracing sink so a
regression in field names or formatting is caught even if the JSONL and OTLP
tests still pass.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6b8908ec-aa48-4de4-984e-3b15d248a76e

📥 Commits

Reviewing files that changed from the base of the PR and between 029a2d1 and 28baa47.

📒 Files selected for processing (16)
  • crates/health/benches/collector_pipeline.rs
  • crates/health/benches/processor_pipeline.rs
  • crates/health/benches/sink_pipeline.rs
  • crates/health/src/api_client.rs
  • crates/health/src/config.rs
  • crates/health/src/discovery/spawn.rs
  • crates/health/src/endpoint/model.rs
  • crates/health/src/endpoint/sources.rs
  • crates/health/src/otlp/convert.rs
  • crates/health/src/processor/health_report.rs
  • crates/health/src/sink/events.rs
  • crates/health/src/sink/health_report.rs
  • crates/health/src/sink/log_file.rs
  • crates/health/src/sink/mod.rs
  • crates/health/src/sink/prometheus.rs
  • crates/health/src/sink/tracing.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add machine metadata to health OTLP log exports

2 participants