Skip to content

firmware: resolve firmware artifacts from files[]#2915

Open
rahmonov wants to merge 1 commit into
NVIDIA:mainfrom
rahmonov:migration-to-files
Open

firmware: resolve firmware artifacts from files[]#2915
rahmonov wants to merge 1 commit into
NVIDIA:mainfrom
rahmonov:migration-to-files

Conversation

@rahmonov

Copy link
Copy Markdown
Contributor

We are moving towards being able to manage firmware configuration at runtime.

One of the requirements for this is to download the firmware files at runtime rather than packaging files into the filesystem at deployment (which means updates require re-deployment).

But we need to be careful with regards to backwards-compatibility. That's why I am introducing a resolving logic in this PR.

If the URL is present in the config, it means it needs to be download. If not, the code needs to look at the local filesystem.

Related issues

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Suggestion to the reviewer:

  1. Check the small changes in crates/api-model/src/firmware.rs.
  2. Then review the handler.rs. Will give you a high level impact of the changes.
  3. Everything else is downstream from them.

@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 803a5ac4-62e4-48e4-a373-92b693df2610

📥 Commits

Reviewing files that changed from the base of the PR and between dff3559 and be081f1.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (13)
  • Cargo.toml
  • crates/api-model/src/firmware.rs
  • crates/firmware/Cargo.toml
  • crates/firmware/src/artifact_cache.rs
  • crates/firmware/src/downloader.rs
  • crates/firmware/src/lib.rs
  • crates/firmware/src/tests/config.rs
  • crates/firmware/src/tests/downloader.rs
  • crates/machine-controller/src/config/firmware_global.rs
  • crates/machine-controller/src/handler.rs
  • crates/machine-controller/src/handler/firmware_artifact.rs
  • crates/machine-controller/src/scout_firmware_scripts.rs
  • docs/observability/core_metrics.md
💤 Files with no reviewable changes (1)
  • Cargo.toml
✅ Files skipped from review due to trivial changes (1)
  • docs/observability/core_metrics.md
🚧 Files skipped from review as they are similar to previous changes (10)
  • crates/firmware/src/lib.rs
  • crates/machine-controller/src/config/firmware_global.rs
  • crates/firmware/Cargo.toml
  • crates/firmware/src/tests/downloader.rs
  • crates/api-model/src/firmware.rs
  • crates/firmware/src/artifact_cache.rs
  • crates/machine-controller/src/handler.rs
  • crates/firmware/src/tests/config.rs
  • crates/firmware/src/downloader.rs
  • crates/machine-controller/src/handler/firmware_artifact.rs

Summary by CodeRabbit

  • New Features

    • Firmware downloads now verify integrity with SHA-256 instead of MD5.
    • Firmware artifacts can be defined with a filename and/or a URL, each carrying its own SHA-256.
    • Added a configurable firmware download cache directory (defaulting to /mnt/persistence/fw/download-cache).
  • Bug Fixes

    • Improved validation for firmware artifact inputs, with clearer errors when required data is missing or empty.
    • Cached firmware artifacts are re-verified with SHA-256 and stale/invalid cache files are removed and re-downloaded.
    • Scout firmware script lookup now normalizes model names with spaces.
  • Documentation

    • Added carbide_site_explorer_last_run_status to NICo core metrics docs.

Walkthrough

Firmware metadata now supports optional artifact URLs, downloads verify SHA-256, cache filenames derive from URLs, and machine-controller artifact handling uses a configured cache directory. One observability metrics row was added.

Changes

Firmware artifact resolution and SHA-256 migration

Layer / File(s) Summary
API model and legacy counts
crates/api-model/src/firmware.rs, crates/firmware/src/tests/config.rs
FirmwareFileArtifact accepts optional filename and url, artifact_count() prefers files over legacy fields, and config parsing tests cover the updated file shape.
Firmware cache helper
Cargo.toml, crates/firmware/Cargo.toml, crates/firmware/src/artifact_cache.rs, crates/firmware/src/lib.rs
The firmware crates remove md5, add sha2 and url, and export firmware_cache_filename for URL-derived cache paths.
Downloader SHA-256 checks
crates/firmware/src/downloader.rs, crates/firmware/src/tests/downloader.rs
FirmwareDownloader now verifies SHA-256 digests, and the downloader tests cover missing, matching, stale, and invalid checksum cases.
Artifact resolution and cache config
crates/machine-controller/src/config/firmware_global.rs, crates/machine-controller/src/handler/firmware_artifact.rs
FirmwareGlobal adds a firmware download cache directory, and the new resolver returns local paths and scout file artifacts from optional firmware metadata.
Handler wiring
crates/machine-controller/src/handler.rs, crates/machine-controller/src/scout_firmware_scripts.rs
The handler resolves scout and host artifacts through the new resolver, uses the resolved local path for uploads, advances multi-artifact installs with artifact_count(), and normalizes scout lookup keys with underscore-separated whitespace handling.

Observability metrics documentation

Layer / File(s) Summary
Metrics table row
docs/observability/core_metrics.md
The observability core metrics table includes carbide_site_explorer_last_run_status as a new gauge entry.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/infra-controller#2756: Both PRs touch crates/machine-controller/src/handler.rs and the scout firmware artifact URL construction path.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: resolving firmware artifacts from files[].
Description check ✅ Passed The description is directly related to runtime firmware artifact resolution and backward compatibility.
Docstring Coverage ✅ Passed Docstring coverage is 81.67% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@rahmonov

Copy link
Copy Markdown
Contributor Author

@coderabbitai ptal

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@rahmonov: I’ll review the changes, with particular attention to the firmware resolution path, backward compatibility, and the downstream impacts highlighted in the PR notes.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@rahmonov

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/api-model/src/firmware.rs`:
- Around line 210-214: The firmware artifact model currently allows entries with
only sha256, but these are invalid for downstream resolution. Add validation to
the firmware config/model in firmware.rs so `files[]` entries must provide at
least one of `filename` or `url`, and reject the shape during
load/deserialization rather than later in `firmware_artifact` handling. Use the
existing artifact struct fields `filename`, `url`, and `sha256` to enforce this
boundary check and keep bad metadata out of machine-controller.

In `@crates/firmware/src/downloader.rs`:
- Around line 61-71: `available()` and `available_actual()` currently
short-circuit on an existing `filename`, which skips the new `sha256` check and
can return stale cached bytes as valid. Update the cache-hit path in
`Downloader::available_actual` so existing files are revalidated against the
requested digest before returning `true`, using the same checksum verification
logic already used for downloaded artifacts. Keep the behavior consistent for
both direct calls through `available()` and the async download flow, and add a
regression test that pre-creates `filename` with mismatched contents and asserts
the cache hit is rejected.

In `@crates/machine-controller/src/handler/firmware_artifact.rs`:
- Around line 60-137: Give url precedence in both resolver functions: in
resolve_firmware_artifact and resolve_scout_file_artifact, branch on
artifact.url before artifact.filename so a populated URL is treated as a remote
artifact. Update the local-path/source selection in resolve_firmware_artifact to
use the URL path first and only fall back to filename when URL is absent, and
adjust resolve_scout_file_artifact so it preserves the remote URL instead of
rewriting it through PXE-local handling when url is provided.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f09737c5-fd80-4f27-97da-20a7391280f5

📥 Commits

Reviewing files that changed from the base of the PR and between 52a0a54 and 5e760dd.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (11)
  • Cargo.toml
  • crates/api-model/src/firmware.rs
  • crates/firmware/Cargo.toml
  • crates/firmware/src/artifact_cache.rs
  • crates/firmware/src/downloader.rs
  • crates/firmware/src/lib.rs
  • crates/firmware/src/tests/config.rs
  • crates/firmware/src/tests/downloader.rs
  • crates/machine-controller/src/config/firmware_global.rs
  • crates/machine-controller/src/handler.rs
  • crates/machine-controller/src/handler/firmware_artifact.rs
💤 Files with no reviewable changes (1)
  • Cargo.toml

Comment thread crates/api-model/src/firmware.rs
Comment thread crates/firmware/src/downloader.rs
Comment thread crates/machine-controller/src/handler/firmware_artifact.rs
@rahmonov rahmonov force-pushed the migration-to-files branch from 5e760dd to 3395c31 Compare June 28, 2026 01:47
@rahmonov rahmonov marked this pull request as ready for review June 28, 2026 17:00
@rahmonov rahmonov requested a review from a team as a code owner June 28, 2026 17:00

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
crates/machine-controller/src/handler/firmware_artifact.rs (1)

173-496: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use table-driven resolver tests.

These cases are repeated input/output variants for the same resolver functions. Consolidating them with value_scenarios!, scenarios!, or explicit check_values / check_cases will keep coverage easier to extend. As per coding guidelines, “Use table-driven test style when writing tests in Rust.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/handler/firmware_artifact.rs` around lines 173
- 496, The resolver tests are written as many nearly identical case-by-case unit
tests instead of a table-driven style. Refactor the repeated scenarios around
resolve_firmware_artifact and resolve_scout_file_artifact into
parameterized/table-based tests using the existing test helpers or macros like
value_scenarios!, scenarios!, check_values, or check_cases. Keep the same
assertions and coverage, but group the input/output variants into compact case
lists so new cases can be added without duplicating test bodies.

Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/api-model/src/firmware.rs`:
- Around line 324-334: The artifact_count method currently treats legacy entries
with only url set as empty, so update the single-artifact branch in
artifact_count to also return 1 when self.url is present. Make the change
alongside the existing checks for self.files, self.filenames, and self.filename,
and add a regression test covering a url-only Firmware entry to verify
backward-compatible behavior.

In `@crates/machine-controller/src/handler/firmware_artifact.rs`:
- Around line 157-164: Percent-encode the PXE URL path segments before building
the return value in firmware_artifact.rs, because using the `relative` string
from `Component::Normal` directly in the `Ok(format!(...))` path can produce
invalid URLs or alter the fragment/query. Update the `relative.to_str()` /
`Ok(format!(...))` flow so each path segment is encoded before concatenation,
while keeping the existing `StateHandlerError::GenericError` handling for
invalid UTF-8 intact.
- Around line 72-79: `resolve_firmware_artifact()` currently rejects remote
artifacts when `firmware_cache_filename()` cannot derive a basename from the
URL, even if `files[].filename` is present. Update the `local_path` handling in
`firmware_artifact.rs` so the `Remote` case first tries
`firmware_cache_filename(...)` and, if that returns `None`, falls back to a
cache path built from `files[].filename` while keeping `source` as `Remote`. Use
the existing `firmware.version`, `pos`, and `files[].filename` context to keep
the error path only for cases where neither source is usable.

---

Nitpick comments:
In `@crates/machine-controller/src/handler/firmware_artifact.rs`:
- Around line 173-496: The resolver tests are written as many nearly identical
case-by-case unit tests instead of a table-driven style. Refactor the repeated
scenarios around resolve_firmware_artifact and resolve_scout_file_artifact into
parameterized/table-based tests using the existing test helpers or macros like
value_scenarios!, scenarios!, check_values, or check_cases. Keep the same
assertions and coverage, but group the input/output variants into compact case
lists so new cases can be added without duplicating test bodies.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2b35efd5-f1f9-4666-8d70-e4d9fca9d2a3

📥 Commits

Reviewing files that changed from the base of the PR and between 5e760dd and 3395c31.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (11)
  • Cargo.toml
  • crates/api-model/src/firmware.rs
  • crates/firmware/Cargo.toml
  • crates/firmware/src/artifact_cache.rs
  • crates/firmware/src/downloader.rs
  • crates/firmware/src/lib.rs
  • crates/firmware/src/tests/config.rs
  • crates/firmware/src/tests/downloader.rs
  • crates/machine-controller/src/config/firmware_global.rs
  • crates/machine-controller/src/handler.rs
  • crates/machine-controller/src/handler/firmware_artifact.rs
💤 Files with no reviewable changes (1)
  • Cargo.toml
🚧 Files skipped from review as they are similar to previous changes (8)
  • crates/firmware/src/lib.rs
  • crates/firmware/Cargo.toml
  • crates/firmware/src/tests/config.rs
  • crates/firmware/src/artifact_cache.rs
  • crates/firmware/src/tests/downloader.rs
  • crates/machine-controller/src/config/firmware_global.rs
  • crates/firmware/src/downloader.rs
  • crates/machine-controller/src/handler.rs

Comment thread crates/api-model/src/firmware.rs
Comment thread crates/machine-controller/src/handler/firmware_artifact.rs
Comment thread crates/machine-controller/src/handler/firmware_artifact.rs
@rahmonov rahmonov force-pushed the migration-to-files branch from 3395c31 to dff3559 Compare June 28, 2026 17:29
@github-actions

github-actions Bot commented Jun 28, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 285 6 25 103 7 144
machine-validation-runner 748 30 189 272 36 221
machine_validation 748 30 189 272 36 221
machine_validation-aarch64 748 30 189 272 36 221
nvmetal-carbide 748 30 189 272 36 221
TOTAL 3283 126 781 1197 151 1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

Support the new files[] firmware artifact shape for host firmware upgrades.
Artifacts with urls are resolved as remote downloads into the firmware cache;
artifacts without urls, and legacy entries without files[], are treated as
local files.

Move Scout file artifact URL resolution into the firmware artifact module and
fail firmware upgrades when a required local artifact is missing instead of
deferring indefinitely.
@rahmonov rahmonov force-pushed the migration-to-files branch from dff3559 to be081f1 Compare June 28, 2026 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant