feat(health): enrich machine log metadata#2882
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
Summary by CodeRabbit
WalkthroughThe PR propagates machine driver version and component type through health metadata, from discovery and configuration into EventContext, OTLP attributes, and sink outputs. ChangesMachine metadata propagation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
crates/health/src/api_client.rs (2)
621-633: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winCover the
machine.idfallback branch ofmachine_component_type.The table currently exercises only the
HostandDpuarms. The non-trivial fallback (lines 626-631), where an unparseable/PowerShelf/UnknownForge type defers to theMachineId-encoded type and ultimately toComputeNode, is untested. A regression there would silently mislabel telemetry. Adding a couple of rows that setmachine_typetoUnknown/PowerShelfwith and without anidwould close the gap.As per path instructions: prefer findings about "missing tests over style-only comments."
Also applies to: 819-832
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/api_client.rs` around lines 621 - 633, Add test coverage for the fallback branch in machine_component_type so the MachineId-derived path is exercised, not just the Host and Dpu cases. Extend the existing table-driven tests around machine_component_type to include rpc::forge::MachineType::Unknown and rpc::forge::MachineType::PowerShelf, both with a machine.id present (verifying ComponentType::from_machine_type is used) and with machine.id absent (verifying it falls back to ComponentType::ComputeNode). Keep the focus on the machine_component_type function and its fallback logic so regressions in telemetry classification are caught.Source: Path instructions
601-619: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winExpress the unique-version computation as an iterator pipeline.
The style guide directs us to construct collections through iterators rather than seeding a mutable container inside a
forloop, which makes the "done building" point explicit and reads more declaratively.As per coding guidelines: "When building a Vec or a HashMap, prefer using iterators to building them from a for-loop."♻️ Proposed iterator-based rewrite
-fn unique_gpu_driver_version( - discovery_info: Option<&rpc::machine_discovery::DiscoveryInfo>, -) -> Option<String> { - let mut versions = HashSet::new(); - let discovery_info = discovery_info?; - - for gpu in &discovery_info.gpus { - let version = gpu.driver_version.trim(); - if !version.is_empty() { - versions.insert(version.to_string()); - } - } - - if versions.len() == 1 { - versions.into_iter().next() - } else { - None - } -} +fn unique_gpu_driver_version( + discovery_info: Option<&rpc::machine_discovery::DiscoveryInfo>, +) -> Option<String> { + let versions: HashSet<&str> = discovery_info? + .gpus + .iter() + .map(|gpu| gpu.driver_version.trim()) + .filter(|version| !version.is_empty()) + .collect(); + + (versions.len() == 1) + .then(|| versions.into_iter().next().map(str::to_string)) + .flatten() +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/api_client.rs` around lines 601 - 619, Refactor unique_gpu_driver_version to use an iterator pipeline instead of seeding a mutable HashSet in a for loop. Keep the same behavior by starting from discovery_info.gpus, trimming each gpu.driver_version, filtering out empty strings, collecting into a HashSet, and then returning the single version only when the set has exactly one entry. Preserve the existing function signature and the unique-version semantics while making the construction of versions declarative.Source: Coding guidelines
crates/health/src/sink/tracing.rs (1)
66-70: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low valueDebug-formatted optionals dilute logfmt searchability.
Recording these accessors with
?serializes the wrappingOption, so an XID record emitsmachine_serial=Some("MN-001")anddriver_version=Nonerather than clean key/value pairs. Since these fields exist precisely so downstream analyzers can pivot on machine serial and driver version, theSome(..)wrappers andNonenoise work against that goal. Consider emitting bare values only when present (mirroring the JSONL sink's skip-if-Nonetreatment inlog_file.rs), e.g. viatracing::Span/conditional fields, so the structured output stays directly searchable.This matches the existing
machine_id = ?...style, so it is not a regression — flagging it as an observability refinement for the newly surfaced fields.As per coding guidelines: "prefer placing common fields as attributes passed to tracing function, instead of using string interpolation" for logfmt searchability.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/sink/tracing.rs` around lines 66 - 70, The tracing fields in the XID sink are logging Option values with `?`, which produces `Some(...)` and `None` in logfmt instead of searchable key/value pairs. Update the event construction in `tracing.rs` to emit these fields only when present, following the skip-if-None approach used by the JSONL sink in `log_file.rs`, and keep the common fields as tracing attributes rather than string interpolation. Focus on the accessors on `context` such as `machine_serial`, `driver_version`, `component_type`, and `nvlink_domain_uuid` so the output stays clean and directly searchable.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@crates/health/src/api_client.rs`:
- Around line 621-633: Add test coverage for the fallback branch in
machine_component_type so the MachineId-derived path is exercised, not just the
Host and Dpu cases. Extend the existing table-driven tests around
machine_component_type to include rpc::forge::MachineType::Unknown and
rpc::forge::MachineType::PowerShelf, both with a machine.id present (verifying
ComponentType::from_machine_type is used) and with machine.id absent (verifying
it falls back to ComponentType::ComputeNode). Keep the focus on the
machine_component_type function and its fallback logic so regressions in
telemetry classification are caught.
- Around line 601-619: Refactor unique_gpu_driver_version to use an iterator
pipeline instead of seeding a mutable HashSet in a for loop. Keep the same
behavior by starting from discovery_info.gpus, trimming each gpu.driver_version,
filtering out empty strings, collecting into a HashSet, and then returning the
single version only when the set has exactly one entry. Preserve the existing
function signature and the unique-version semantics while making the
construction of versions declarative.
In `@crates/health/src/sink/tracing.rs`:
- Around line 66-70: The tracing fields in the XID sink are logging Option
values with `?`, which produces `Some(...)` and `None` in logfmt instead of
searchable key/value pairs. Update the event construction in `tracing.rs` to
emit these fields only when present, following the skip-if-None approach used by
the JSONL sink in `log_file.rs`, and keep the common fields as tracing
attributes rather than string interpolation. Focus on the accessors on `context`
such as `machine_serial`, `driver_version`, `component_type`, and
`nvlink_domain_uuid` so the output stays clean and directly searchable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fb9d6dc3-d41b-481e-adbd-665c01dd396e
📒 Files selected for processing (17)
crates/health/benches/collector_pipeline.rscrates/health/benches/processor_pipeline.rscrates/health/benches/sink_pipeline.rscrates/health/src/api_client.rscrates/health/src/config.rscrates/health/src/discovery/spawn.rscrates/health/src/endpoint/mod.rscrates/health/src/endpoint/model.rscrates/health/src/endpoint/sources.rscrates/health/src/otlp/convert.rscrates/health/src/processor/health_report.rscrates/health/src/sink/events.rscrates/health/src/sink/health_report.rscrates/health/src/sink/log_file.rscrates/health/src/sink/mod.rscrates/health/src/sink/prometheus.rscrates/health/src/sink/tracing.rs
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
bce6f86 to
6c6776a
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/health/src/api_client.rs (1)
758-789: 🎯 Functional Correctness | 🟡 MinorAdd the mixed known + unknown case.
["570.82", ""]should resolve toSome("570.82"); empty strings are treated as missing, not conflicting. This locks the exported driver-version label to the current contract.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/api_client.rs` around lines 758 - 789, The test for unique_gpu_driver_version is missing the mixed known-plus-unknown case, so add a scenario in unique_gpu_driver_version_uses_single_non_empty_version covering a real version alongside an empty string and assert it returns the known version. Update the value_scenarios set in crates/health/src/api_client.rs to treat empty or whitespace-only driver versions as missing rather than conflicting, matching the current contract enforced by unique_gpu_driver_version and discovery_with_driver_versions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/health/src/api_client.rs`:
- Around line 758-789: The test for unique_gpu_driver_version is missing the
mixed known-plus-unknown case, so add a scenario in
unique_gpu_driver_version_uses_single_non_empty_version covering a real version
alongside an empty string and assert it returns the known version. Update the
value_scenarios set in crates/health/src/api_client.rs to treat empty or
whitespace-only driver versions as missing rather than conflicting, matching the
current contract enforced by unique_gpu_driver_version and
discovery_with_driver_versions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0ba81d83-cab5-4055-92f1-a918556f9a4b
📒 Files selected for processing (12)
crates/health/src/api_client.rscrates/health/src/discovery/spawn.rscrates/health/src/endpoint/model.rscrates/health/src/endpoint/sources.rscrates/health/src/otlp/convert.rscrates/health/src/processor/health_report.rscrates/health/src/sink/events.rscrates/health/src/sink/health_report.rscrates/health/src/sink/log_file.rscrates/health/src/sink/mod.rscrates/health/src/sink/prometheus.rscrates/health/src/sink/tracing.rs
✅ Files skipped from review due to trivial changes (1)
- crates/health/src/sink/mod.rs
🚧 Files skipped from review as they are similar to previous changes (3)
- crates/health/src/sink/tracing.rs
- crates/health/src/sink/log_file.rs
- crates/health/src/otlp/convert.rs
434088f to
cd0bcf9
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/health/src/otlp/convert.rs (1)
417-450: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winAssert
component.typeremains present for machine endpoints when optional fields are absent.This case verifies omission of optional machine attributes, but it does not protect the new
component.typecontract.component.typeis derived fromEndpointMetadata, so this test should still assert"compute_node"even whenmachine_serial,driver_version, andnvlink_domain_uuidareNone.Suggested assertion
assert_eq!(attr_value(&attrs, "machine.serial"), None); assert_eq!(attr_value(&attrs, "driver.version"), None); + assert_eq!(attr_value(&attrs, "component.type"), Some("compute_node")); assert_eq!(attr_value(&attrs, "nvlink.domain.uuid"), None);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/otlp/convert.rs` around lines 417 - 450, The new test in resource_attributes should also verify the stable EndpointMetadata-derived attribute contract: keep asserting the optional machine fields are omitted, and add an assertion that component.type is still emitted as compute_node for MachineData even when machine_serial, driver_version, and nvlink_domain_uuid are None. Use the existing resource_attributes and attr_value helpers in this test to confirm the component.type mapping remains present.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/health/src/otlp/convert.rs`:
- Around line 417-450: The new test in resource_attributes should also verify
the stable EndpointMetadata-derived attribute contract: keep asserting the
optional machine fields are omitted, and add an assertion that component.type is
still emitted as compute_node for MachineData even when machine_serial,
driver_version, and nvlink_domain_uuid are None. Use the existing
resource_attributes and attr_value helpers in this test to confirm the
component.type mapping remains present.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5f5c41c5-d55d-4062-90f2-b8a1b07fff1f
📒 Files selected for processing (13)
crates/health/benches/processor_pipeline.rscrates/health/src/api_client.rscrates/health/src/discovery/spawn.rscrates/health/src/endpoint/model.rscrates/health/src/endpoint/sources.rscrates/health/src/otlp/convert.rscrates/health/src/processor/health_report.rscrates/health/src/sink/events.rscrates/health/src/sink/health_report.rscrates/health/src/sink/log_file.rscrates/health/src/sink/mod.rscrates/health/src/sink/prometheus.rscrates/health/src/sink/tracing.rs
💤 Files with no reviewable changes (1)
- crates/health/benches/processor_pipeline.rs
✅ Files skipped from review due to trivial changes (5)
- crates/health/src/sink/prometheus.rs
- crates/health/src/sink/health_report.rs
- crates/health/src/discovery/spawn.rs
- crates/health/src/sink/tracing.rs
- crates/health/src/sink/log_file.rs
🚧 Files skipped from review as they are similar to previous changes (6)
- crates/health/src/sink/mod.rs
- crates/health/src/processor/health_report.rs
- crates/health/src/endpoint/sources.rs
- crates/health/src/endpoint/model.rs
- crates/health/src/api_client.rs
- crates/health/src/sink/events.rs
|
@coderabbitai full review |
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/health/src/endpoint/sources.rs (1)
131-139: 🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick winNormalize static driver versions to match the discovery path.
Line 138 copies
machine.driver_versionverbatim, butunique_gpu_driver_version()incrates/health/src/api_client.rstrims whitespace and drops empty strings before emitting machine metadata. A static config like" 570.82 "or" "will therefore produce a differentdriver.versioncontract than the API-backed path, and that value is exported directly by the OTLP/log sinks.Suggested fix
- driver_version: machine.driver_version.clone(), + driver_version: machine + .driver_version + .as_deref() + .map(str::trim) + .filter(|version| !version.is_empty()) + .map(str::to_string),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/endpoint/sources.rs` around lines 131 - 139, The machine metadata path in sources.rs copies machine.driver_version verbatim, which can diverge from the normalized behavior used by unique_gpu_driver_version() in api_client.rs. Update the EndpointMetadata::Machine construction to normalize driver_version the same way as the discovery path by trimming whitespace and treating empty strings as absent before emitting metadata, so static config and API-backed values produce the same driver.version contract.
🧹 Nitpick comments (1)
crates/health/src/sink/tracing.rs (1)
85-89: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd a tracing-sink test for the new metadata fields.
This is now the only new export surface in the cohort without an assertion that
machine_serial,driver_version,component_type, andnvlink_domain_uuidare present in both log branches. A small subscriber-based test here would catch future field-name drift or one-branch-only regressions.As per path instructions, "prefer findings about behavior, concurrency, resource lifetimes, and missing tests over style-only comments."
Also applies to: 99-103
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/sink/tracing.rs` around lines 85 - 89, Add a subscriber-based test around the tracing sink in tracing.rs that exercises both log branches and asserts the new metadata fields are emitted: machine_serial, driver_version, component_type, and nvlink_domain_uuid. Use the existing tracing sink entrypoints and span/event structure in the tracing sink code to capture emitted fields, then verify each field appears in both branches so future field-name drift or one-branch-only regressions are caught.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/health/src/endpoint/sources.rs`:
- Around line 131-139: The machine metadata path in sources.rs copies
machine.driver_version verbatim, which can diverge from the normalized behavior
used by unique_gpu_driver_version() in api_client.rs. Update the
EndpointMetadata::Machine construction to normalize driver_version the same way
as the discovery path by trimming whitespace and treating empty strings as
absent before emitting metadata, so static config and API-backed values produce
the same driver.version contract.
---
Nitpick comments:
In `@crates/health/src/sink/tracing.rs`:
- Around line 85-89: Add a subscriber-based test around the tracing sink in
tracing.rs that exercises both log branches and asserts the new metadata fields
are emitted: machine_serial, driver_version, component_type, and
nvlink_domain_uuid. Use the existing tracing sink entrypoints and span/event
structure in the tracing sink code to capture emitted fields, then verify each
field appears in both branches so future field-name drift or one-branch-only
regressions are caught.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6efdca6c-cd4b-44b6-8b72-8525abcbf073
📒 Files selected for processing (16)
crates/health/benches/collector_pipeline.rscrates/health/benches/processor_pipeline.rscrates/health/benches/sink_pipeline.rscrates/health/src/api_client.rscrates/health/src/config.rscrates/health/src/discovery/spawn.rscrates/health/src/endpoint/model.rscrates/health/src/endpoint/sources.rscrates/health/src/otlp/convert.rscrates/health/src/processor/health_report.rscrates/health/src/sink/events.rscrates/health/src/sink/health_report.rscrates/health/src/sink/log_file.rscrates/health/src/sink/mod.rscrates/health/src/sink/prometheus.rscrates/health/src/sink/tracing.rs
Propagate machine identity, serial, GPU driver version, NVLink domain, and component type through health endpoint context into log-related sinks. Driver version is derived from NICo machine discovery when all reported GPUs agree, or from static TOML metadata for static targets. Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
Trim static machine driver_version before storing metadata and omit it when the value is empty, matching API discovery behavior. Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
62aa7b4 to
95f97c0
Compare
|
@coderabbitai full review |
✅ Action performedFull review finished. |
There was a problem hiding this comment.
🧹 Nitpick comments (4)
crates/health/src/endpoint/model.rs (1)
124-129: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUpdate the
driver_versioncontract comment.This field is no longer populated only from API discovery.
StaticEndpointSource::from_config()also fills it fromStaticMachineEndpoint.driver_version, so the current doc comment now describes a narrower contract than the code actually implements.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/endpoint/model.rs` around lines 124 - 129, Update the `driver_version` documentation in `Machine-level GPU driver version` to match the actual contract: it is populated both from API discovery and from `StaticEndpointSource::from_config()` via `StaticMachineEndpoint.driver_version`, not only when discovery finds a single unique version. Adjust the comment near `driver_version` in the model so it describes all sources and the current absence/presence behavior accurately, referencing `StaticEndpointSource::from_config` and `StaticMachineEndpoint.driver_version` in the wording if needed.crates/health/src/endpoint/sources.rs (1)
383-465: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winCollapse the driver-version cases into a table-driven test.
These cases exercise the same operation with different
driver_versioninputs and expectations. Converting them tovalue_scenarios!/check_valueswould make it easier to add more normalization edge cases without duplicating the fixture setup. As per coding guidelines, "Prefer table-driven tests using the carbide-test-support crate withscenarios!for fallible operations andvalue_scenarios!for total operations."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/endpoint/sources.rs` around lines 383 - 465, The two driver_version tests duplicate the same StaticEndpointSource::from_config and fetch_bmc_hosts setup, so collapse them into a single table-driven test using value_scenarios! and check_values. Keep one shared fixture for StaticBmcEndpoint/StaticMachineEndpoint and vary only the driver_version input plus expected normalized output (trimmed string vs None). Use the existing EndpointMetadata::Machine assertions inside the scenario loop so adding more normalization cases stays simple.Source: Coding guidelines
crates/health/src/sink/tracing.rs (1)
85-103: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd a tracing-sink assertion for the new metadata fields.
This changes the tracing export contract, but I do not see a focused assertion for it comparable to the JSONL and OTLP coverage. A field-name or formatting regression here would silently drop the new context from tracing output while the other sink tests still pass. As per coding guidelines, "Add or update focused tests for bug fixes, shared behavior, API contracts, migrations, and cross-module changes." As per path instructions, "Prefer findings about behavior, concurrency, resource lifetimes, and missing tests over style-only comments."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/sink/tracing.rs` around lines 85 - 103, Add a focused test for the tracing sink export contract to assert the new metadata fields are emitted. Update the existing tracing sink coverage in crates/health/src/sink/tracing.rs, using the sink path that builds the “Log event” record, and verify the output includes machine_id, machine_serial, driver_version, component_type, and nvlink_domain_uuid with the expected formatting. Keep the assertion local to the tracing sink so a regression in field names or formatting is caught even if the JSONL and OTLP tests still pass.Sources: Coding guidelines, Path instructions
crates/health/src/otlp/convert.rs (1)
417-450: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAssert
component.typeremains present for machine endpoints with sparse metadata.This case only proves optional fields are omitted.
component.typeis derived fromEndpointMetadata::Machine, so a regression that dropscompute_nodewhenmachine_serial/driver_versionareNonewould still pass.Suggested assertion
assert_eq!(attr_value(&attrs, "machine.serial"), None); assert_eq!(attr_value(&attrs, "driver.version"), None); + assert_eq!(attr_value(&attrs, "component.type"), Some("compute_node")); assert_eq!(attr_value(&attrs, "nvlink.domain.uuid"), None);As per coding guidelines, "Verification should exercise the behavior that changed," and as per path instructions, "Prefer findings about behavior, concurrency, resource lifetimes, and missing tests over style-only comments."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/otlp/convert.rs` around lines 417 - 450, The existing test in resource_attributes should also verify that component.type is still emitted for EndpointMetadata::Machine when optional MachineData fields are sparse. Update the resource_attributes_omit_absent_optional_machine_metadata test to assert the presence and expected value of component.type alongside the current checks for omitted machine.serial, driver.version, and nvlink.domain.uuid, using attr_value on the attrs returned by resource_attributes.Sources: Coding guidelines, Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@crates/health/src/endpoint/model.rs`:
- Around line 124-129: Update the `driver_version` documentation in
`Machine-level GPU driver version` to match the actual contract: it is populated
both from API discovery and from `StaticEndpointSource::from_config()` via
`StaticMachineEndpoint.driver_version`, not only when discovery finds a single
unique version. Adjust the comment near `driver_version` in the model so it
describes all sources and the current absence/presence behavior accurately,
referencing `StaticEndpointSource::from_config` and
`StaticMachineEndpoint.driver_version` in the wording if needed.
In `@crates/health/src/endpoint/sources.rs`:
- Around line 383-465: The two driver_version tests duplicate the same
StaticEndpointSource::from_config and fetch_bmc_hosts setup, so collapse them
into a single table-driven test using value_scenarios! and check_values. Keep
one shared fixture for StaticBmcEndpoint/StaticMachineEndpoint and vary only the
driver_version input plus expected normalized output (trimmed string vs None).
Use the existing EndpointMetadata::Machine assertions inside the scenario loop
so adding more normalization cases stays simple.
In `@crates/health/src/otlp/convert.rs`:
- Around line 417-450: The existing test in resource_attributes should also
verify that component.type is still emitted for EndpointMetadata::Machine when
optional MachineData fields are sparse. Update the
resource_attributes_omit_absent_optional_machine_metadata test to assert the
presence and expected value of component.type alongside the current checks for
omitted machine.serial, driver.version, and nvlink.domain.uuid, using attr_value
on the attrs returned by resource_attributes.
In `@crates/health/src/sink/tracing.rs`:
- Around line 85-103: Add a focused test for the tracing sink export contract to
assert the new metadata fields are emitted. Update the existing tracing sink
coverage in crates/health/src/sink/tracing.rs, using the sink path that builds
the “Log event” record, and verify the output includes machine_id,
machine_serial, driver_version, component_type, and nvlink_domain_uuid with the
expected formatting. Keep the assertion local to the tracing sink so a
regression in field names or formatting is caught even if the JSONL and OTLP
tests still pass.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6b8908ec-aa48-4de4-984e-3b15d248a76e
📒 Files selected for processing (16)
crates/health/benches/collector_pipeline.rscrates/health/benches/processor_pipeline.rscrates/health/benches/sink_pipeline.rscrates/health/src/api_client.rscrates/health/src/config.rscrates/health/src/discovery/spawn.rscrates/health/src/endpoint/model.rscrates/health/src/endpoint/sources.rscrates/health/src/otlp/convert.rscrates/health/src/processor/health_report.rscrates/health/src/sink/events.rscrates/health/src/sink/health_report.rscrates/health/src/sink/log_file.rscrates/health/src/sink/mod.rscrates/health/src/sink/prometheus.rscrates/health/src/sink/tracing.rs
Propagate machine serial, GPU driver version, NVLink domain, and component type through health endpoint context into log-related sinks, so the downstream consumers can associate XID log records with the correct machine, driver, NVLink domain, and component type.
Driver version is derived from NICo machine discovery when all reported GPUs agree, or from static TOML metadata for static targets.
The following is an example for static configuration:
Related issues
Closes #2876
Type of Change
Breaking Changes
Testing