change: hw-health universal stage support#2938
Conversation
Signed-off-by: ianisimov <ianisimov@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (8)
🚧 Files skipped from review as they are similar to previous changes (6)
Summary by CodeRabbit
WalkthroughThe PR replaces the health crate’s collector, processor, and sink split with a unified HealthEvent, SyncEventNode, and EventGraph model. Collector orchestration, sink/export wiring, NVUE transport, and benchmarks were updated to emit, route, and consume the new event types. ChangesHealth Event Graph Refactor
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
crates/health/src/lib.rs (1)
179-190: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winEnable report-producing processors for OTLP/tracing consumers too.
OtlpSinkconsumesHealthReportProduced, butHealthReportProcessoris not installed for OTLP-only configs. Similarly, intrusion-derived health reports are suppressed unless the Carbide health-report sink is enabled.Proposed fix
- if config.sinks.tracing.is_enabled() + let emits_health_reports = config.sinks.tracing.is_enabled() || config.sinks.health_report.is_enabled() + || config.sinks.otlp.is_enabled() || config.sinks.power_shelf_health_report.is_enabled() || config.sinks.switch_health_report.is_enabled() - || config.processors.leak_detection.is_enabled() - { + || config.processors.leak_detection.is_enabled(); + + if emits_health_reports { nodes.push(Arc::new(HealthReportProcessor::new())); } - if config.sinks.health_report.is_enabled() { + if config.sinks.health_report.is_enabled() + || config.sinks.tracing.is_enabled() + || config.sinks.otlp.is_enabled() + { nodes.push(Arc::new(BmcIntrusionSyncEventNode::new())); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/lib.rs` around lines 179 - 190, The node setup in HealthReportProcessor/BmcIntrusionSyncEventNode currently only enables report generation when the tracing or health-report sinks are enabled, which leaves OTLP-only consumers without HealthReportProduced events. Update the condition that pushes HealthReportProcessor so it also activates for OtlpSink consumers, and adjust the BmcIntrusionSyncEventNode gating so intrusion-derived health reports are installed whenever any report-producing consumer needs them, not only when config.sinks.health_report is enabled. Ensure the logic in this initialization block uses the existing config.sinks and processor symbols consistently.crates/health/src/collectors/sensors.rs (1)
159-180: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winKeep derived metrics on the shared metrics path. These samples are now emitted only from
SensorCollector::run_iteration, so disablingsensorswhile leavingmetricsenabled will silently drop drive/PSU derived metrics. Move this emission into the metrics collector or make the dependency explicit.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/collectors/sensors.rs` around lines 159 - 180, The derived-metric emission in SensorCollector::emit_derived_metrics is wired only through SensorCollector::run_iteration, so it disappears when sensors are disabled even if metrics remain enabled. Move this MeasurementObserved emission into the shared metrics collection path (for example, the metrics collector that already handles other MetricSample output) or add an explicit dependency so derived metrics still flow whenever metrics are enabled. Keep the existing MetricSample construction and entity-derived labels, but ensure the triggering path is no longer tied solely to sensors.
🧹 Nitpick comments (1)
crates/health/src/sink/otlp.rs (1)
219-238: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winReplace the
_wildcard with explicit variants to preserve compile-time exhaustiveness.This sink is the lone holdout: both
TracingSink::handle_eventandconvert_eventenumerate everyHealthEventvariant, so adding a new event forces the compiler to flag every handler. Here the trailing_ => return Vec::new()will silently swallow any future variant that ought to be exported, which is precisely the class of bug an exhaustive match prevents. Given this PR is actively expanding the event taxonomy toward universal stage support, I'd recommend spelling out the no-op variants.♻️ Suggested change
- _ => return Vec::new(), + HealthEvent::MeasurementObserved(_) + | HealthEvent::ScrapeRequested { .. } + | HealthEvent::InventoryDiscovered { .. } + | HealthEvent::InventoryUpdated { .. } + | HealthEvent::ScrapeBatchStarted + | HealthEvent::ScrapeBatchFinished + | HealthEvent::NodeRemoved => return Vec::new(), };As per coding guidelines: "Prefer designs that are hard to misuse. The more the compiler can catch bugs, the better."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/sink/otlp.rs` around lines 219 - 238, The match in the sink’s event handling is using a wildcard fallback that can silently ignore future HealthEvent variants. Update the match in the otlp sink’s event-processing path to enumerate the known no-op variants explicitly instead of using “_ => return Vec::new()”, matching the exhaustiveness style already used by TracingSink::handle_event and convert_event. This keeps compile-time coverage intact when new HealthEvent variants are added and makes any missing export handling visible to the compiler.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/health/src/collectors/runtime.rs`:
- Around line 276-281: The mailbox routing in `interested_in` is still
hard-coded to `HealthEvent::InventoryDiscovered`, so collectors using
`PeriodicCollector::handle_event` for `ScrapeRequested`, `InventoryUpdated`, or
other universal-stage events will be filtered out. Update the event matching in
`PeriodicCollector::interested_in` to route all events that the collector can
actually handle based on its configured stages/collector behavior, rather than
only inventory events, and keep the endpoint_key check only where it applies.
---
Outside diff comments:
In `@crates/health/src/collectors/sensors.rs`:
- Around line 159-180: The derived-metric emission in
SensorCollector::emit_derived_metrics is wired only through
SensorCollector::run_iteration, so it disappears when sensors are disabled even
if metrics remain enabled. Move this MeasurementObserved emission into the
shared metrics collection path (for example, the metrics collector that already
handles other MetricSample output) or add an explicit dependency so derived
metrics still flow whenever metrics are enabled. Keep the existing MetricSample
construction and entity-derived labels, but ensure the triggering path is no
longer tied solely to sensors.
In `@crates/health/src/lib.rs`:
- Around line 179-190: The node setup in
HealthReportProcessor/BmcIntrusionSyncEventNode currently only enables report
generation when the tracing or health-report sinks are enabled, which leaves
OTLP-only consumers without HealthReportProduced events. Update the condition
that pushes HealthReportProcessor so it also activates for OtlpSink consumers,
and adjust the BmcIntrusionSyncEventNode gating so intrusion-derived health
reports are installed whenever any report-producing consumer needs them, not
only when config.sinks.health_report is enabled. Ensure the logic in this
initialization block uses the existing config.sinks and processor symbols
consistently.
---
Nitpick comments:
In `@crates/health/src/sink/otlp.rs`:
- Around line 219-238: The match in the sink’s event handling is using a
wildcard fallback that can silently ignore future HealthEvent variants. Update
the match in the otlp sink’s event-processing path to enumerate the known no-op
variants explicitly instead of using “_ => return Vec::new()”, matching the
exhaustiveness style already used by TracingSink::handle_event and
convert_event. This keeps compile-time coverage intact when new HealthEvent
variants are added and makes any missing export handling visible to the
compiler.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a6fc8e4a-b40c-4329-92c6-b016ccd55161
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (43)
crates/health/Cargo.tomlcrates/health/benches/collector_pipeline.rscrates/health/benches/processor_pipeline.rscrates/health/benches/sink_pipeline.rscrates/health/src/collectors/discovery.rscrates/health/src/collectors/entity_metrics.rscrates/health/src/collectors/firmware.rscrates/health/src/collectors/inventory.rscrates/health/src/collectors/leak_detector.rscrates/health/src/collectors/logs/periodic.rscrates/health/src/collectors/logs/sse.rscrates/health/src/collectors/mod.rscrates/health/src/collectors/nmxt.rscrates/health/src/collectors/nvue/gnmi/on_change_processor.rscrates/health/src/collectors/nvue/gnmi/sample_processor.rscrates/health/src/collectors/nvue/gnmi/subscriber.rscrates/health/src/collectors/nvue/rest/client.rscrates/health/src/collectors/nvue/rest/collector.rscrates/health/src/collectors/runtime.rscrates/health/src/collectors/sensors.rscrates/health/src/discovery/cleanup.rscrates/health/src/discovery/context.rscrates/health/src/discovery/iteration.rscrates/health/src/discovery/spawn.rscrates/health/src/lib.rscrates/health/src/otlp/convert.rscrates/health/src/otlp/drain.rscrates/health/src/processor/health_report.rscrates/health/src/processor/intrusion_events.rscrates/health/src/processor/leak_events.rscrates/health/src/processor/mod.rscrates/health/src/processor/rack_leak.rscrates/health/src/sink/composite.rscrates/health/src/sink/events.rscrates/health/src/sink/health_report.rscrates/health/src/sink/log_file.rscrates/health/src/sink/mod.rscrates/health/src/sink/otlp.rscrates/health/src/sink/power_shelf_health_report.rscrates/health/src/sink/prometheus.rscrates/health/src/sink/rack_health_report.rscrates/health/src/sink/switch_health_report.rscrates/health/src/sink/tracing.rs
💤 Files with no reviewable changes (3)
- crates/health/src/discovery/cleanup.rs
- crates/health/Cargo.toml
- crates/health/src/collectors/mod.rs
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
Signed-off-by: ianisimov <ianisimov@nvidia.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
crates/health/src/collectors/runtime.rs (1)
292-315: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy liftMake workload scrape delivery lossless under mailbox backpressure.
ScrapeRequestedis routed into the bounded mailbox, buttry_senddrops it when the queue is full. Inventory refreshes may self-heal; a workload scrape request can be one-shot, so losing it breaks the dispatcher-driven scrape path. Split coalescable inventory updates from command-like scrape requests, or use a lossless/coalesced command path forScrapeRequested.Also applies to: 353-355
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/collectors/runtime.rs` around lines 292 - 315, The mailbox send path in runtime collector handling drops one-shot ScrapeRequested events under backpressure, so split the handling in interested_in/handle_event between coalescable inventory events and command-like scrape requests. Keep InventoryDiscovered and InventoryUpdated on the bounded, droppable path if needed, but make ScrapeRequested lossless by routing it through a non-dropping or coalesced command path in RuntimeCollector and the event dispatcher so scrape requests are never silently lost.crates/health/src/sink/otlp.rs (1)
237-239: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winAvoid deduping distinct firmware observations by component name only.
Line 238 keys firmware OTLP entries by
info.component, which is populated from the Redfish firmware display name in the collector. If an endpoint reports multiple firmware inventory records with the same component name,save_latestwill silently replace earlier records before export. Include at leastinfo.version, or preferably a stable firmware inventory id if available.Minimal safer key
- let key = format!("{}|firmware|{}", context.endpoint_key, info.component); + let key = format!( + "{}|firmware|{}|{}", + context.endpoint_key, info.component, info.version + );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/sink/otlp.rs` around lines 237 - 239, The firmware OTLP dedup key in HealthEvent::FirmwareObserved is only using info.component, which can collapse multiple inventory records with the same display name. Update the key generation in the sink logic to include a more stable disambiguator such as info.version, or a firmware inventory identifier if one is available from the event payload, so save_latest keeps distinct firmware observations separate before export.crates/health/src/collectors/inventory.rs (1)
134-137: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winPreserve the processor metric label contract.
Line 134 still reads
processor_type, but Line 136 now exports it asnode_type. That makes the metric semantically misleading and can break dashboards/alerts expecting the previousprocessor_typelabel.Suggested fix
- attrs.push(( - Cow::Borrowed("node_type"), - node_type.to_snake_case().to_string(), - )); + attrs.push(( + Cow::Borrowed("processor_type"), + node_type.to_snake_case().to_string(), + ));🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/collectors/inventory.rs` around lines 134 - 137, The inventory collector is renaming the processor metric label from processor_type to node_type, which breaks the existing label contract. Update the attribute key in the relevant branch of inventory.rs so the collector continues to emit processor_type while still deriving the value from entity.raw().processor_type in the inventory collector path. Keep the change localized around the attrs.push call in the collector logic so dashboards and alerts using the Processor metric label remain compatible.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/health/src/collectors/inventory.rs`:
- Around line 134-137: The inventory collector is renaming the processor metric
label from processor_type to node_type, which breaks the existing label
contract. Update the attribute key in the relevant branch of inventory.rs so the
collector continues to emit processor_type while still deriving the value from
entity.raw().processor_type in the inventory collector path. Keep the change
localized around the attrs.push call in the collector logic so dashboards and
alerts using the Processor metric label remain compatible.
In `@crates/health/src/collectors/runtime.rs`:
- Around line 292-315: The mailbox send path in runtime collector handling drops
one-shot ScrapeRequested events under backpressure, so split the handling in
interested_in/handle_event between coalescable inventory events and command-like
scrape requests. Keep InventoryDiscovered and InventoryUpdated on the bounded,
droppable path if needed, but make ScrapeRequested lossless by routing it
through a non-dropping or coalesced command path in RuntimeCollector and the
event dispatcher so scrape requests are never silently lost.
In `@crates/health/src/sink/otlp.rs`:
- Around line 237-239: The firmware OTLP dedup key in
HealthEvent::FirmwareObserved is only using info.component, which can collapse
multiple inventory records with the same display name. Update the key generation
in the sink logic to include a more stable disambiguator such as info.version,
or a firmware inventory identifier if one is available from the event payload,
so save_latest keeps distinct firmware observations separate before export.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3ea15a8b-6009-46e9-92f1-139a2e98590c
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (25)
crates/health/src/collectors/discovery.rscrates/health/src/collectors/entity_metrics.rscrates/health/src/collectors/firmware.rscrates/health/src/collectors/inventory.rscrates/health/src/collectors/leak_detector.rscrates/health/src/collectors/nmxt.rscrates/health/src/collectors/nvue/rest/client.rscrates/health/src/collectors/nvue/rest/collector.rscrates/health/src/collectors/runtime.rscrates/health/src/collectors/sensors.rscrates/health/src/discovery/context.rscrates/health/src/discovery/spawn.rscrates/health/src/lib.rscrates/health/src/otlp/convert.rscrates/health/src/otlp/drain.rscrates/health/src/otlp/metrics_drain.rscrates/health/src/processor/health_report.rscrates/health/src/processor/intrusion_events.rscrates/health/src/processor/leak_events.rscrates/health/src/processor/mod.rscrates/health/src/processor/rack_leak.rscrates/health/src/sink/composite.rscrates/health/src/sink/events.rscrates/health/src/sink/mod.rscrates/health/src/sink/otlp.rs
✅ Files skipped from review due to trivial changes (1)
- crates/health/src/otlp/metrics_drain.rs
🚧 Files skipped from review as they are similar to previous changes (17)
- crates/health/src/otlp/drain.rs
- crates/health/src/sink/events.rs
- crates/health/src/discovery/context.rs
- crates/health/src/processor/intrusion_events.rs
- crates/health/src/collectors/nvue/rest/client.rs
- crates/health/src/collectors/firmware.rs
- crates/health/src/collectors/nmxt.rs
- crates/health/src/otlp/convert.rs
- crates/health/src/collectors/leak_detector.rs
- crates/health/src/lib.rs
- crates/health/src/sink/composite.rs
- crates/health/src/collectors/discovery.rs
- crates/health/src/collectors/nvue/rest/collector.rs
- crates/health/src/discovery/spawn.rs
- crates/health/src/processor/leak_events.rs
- crates/health/src/processor/rack_leak.rs
- crates/health/src/processor/health_report.rs
Signed-off-by: ianisimov <ianisimov@nvidia.com>
Signed-off-by: ianisimov <ianisimov@nvidia.com>
Refactored HW health to support universal stage.
Opens road to TelemetryService handling as well as nv-redfish dispatcher integration.
Related issues
Closes #2937
Type of Change
Breaking Changes
Testing
Additional Notes