[sonic-installer] uboot: fix two-slot semantics, boot_once handling, and platform check#4489
[sonic-installer] uboot: fix two-slot semantics, boot_once handling, and platform check#4489william8545 wants to merge 2 commits into
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…tform check
Bring sonic-installer's U-Boot bootloader in line with the contracts that
grub.py and aboot.py already implement, and add the missing platform check
that has effectively turned `--skip-platform-check` into a no-op on every
U-Boot SONiC platform.
Slot mapping
* Replace the `if image in images[N]` substring match with explicit
`_get_image_slot(image)` (exact-equality lookup against
`sonic_version_<N>`). The previous code wrote the wrong slot when one
image's version string was a substring of another's (e.g. a release
`SONiC-OS-bmc.0` vs a dirty build `SONiC-OS-bmc.0-dirty-...`).
* Replace the `images[0]` / `images[1]` list-index assumption with a
`{slot: version}` dict returned by `_read_slots()`. The list-index
code broke whenever one slot was empty: with slot 1 = NONE and the
only image in slot 2, `images[0]` was slot 2's image and the code
still wrote `boot_next = run sonic_image_1` — pointing at the empty
slot.
* `set_default_image` and `set_next_image` now return False when the
image is not installed (previously returned True for any input,
making main.py's failure-handling branch dead code).
boot_once handling
* `get_next_image` now reads `boot_once` first, then falls back to
`boot_next` — matching the U-Boot bootcmd order. Before this change
`sonic-installer list` and `verify-next-image` lied about the next
image whenever `set-next-boot` had been used.
* When `boot_once` or `boot_next` references an empty slot, return the
raw `sonic_version_<N>` value rather than silently substituting
another image. Surfaces the broken state to the user instead of
hiding it.
* `set_default_image` clears `boot_once` so a stale one-shot can no
longer shadow the newly-set default (mirrors `grub-set-default`).
* `install_image` clears `boot_once` so a stale one-shot can no
longer shadow the freshly-installed image.
remove_image
* Clears `boot_once` when it still references the slot being removed
— previously the bootcmd would consume `boot_once = run sonic_image_N`
on next reboot and try to bootm a `.fit` that had just been rm -rf'd.
* Clears per-slot aux env vars using a vendor-portable pair-check:
only writes `<var> = ""` when both `<var>` and its sibling exist in
the env. This covers ASPEED's `_old` convention plus
`sonic_dir_1`/`sonic_dir_2` (Centec/Pensando), and intentionally
skips globally-shared vars (Centec uses a single `linuxargs` for
both slots) so the surviving slot stays bootable.
* Tracks the platform-survey of 11 per-slot pairs across the five
U-Boot SONiC platforms (Aspeed, Marvell-prestera arm64, Marvell
armhf, Centec, Pensando) in a single class-level table.
verify_image_platform
* Port grub.py's `installer/platforms_asic` check (extract via
`sed -e '1,/^exit_marker$/d' | tar xf - installer/platforms_asic -O`,
then `grep -Fxq` for the running platform). Previously this method
was `return os.path.isfile(image_path)`, which meant
`sonic-installer install` happily accepted any existing file —
making the documented `--skip-platform-check` flag a no-op in both
directions on U-Boot platforms.
* Backward-compatible with images that don't ship `platforms_asic`
(tar exits non-zero → returns True, matching grub.py).
Signed-off-by: William Tsai <willtsai@nvidia.com>
0c0d251 to
6120a35
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Obsersavations:
|
Yes, accepting any file is incorrect here; it needs to be a SONiC image. |
There was a problem hiding this comment.
Pull request overview
This PR corrects SONiC’s U-Boot bootloader implementation (sonic_installer/bootloader/uboot.py) to behave consistently with the existing Grub/Aboot contracts: robust two-slot selection, correct boot_once semantics, safer cleanup behavior, and a real platform verification check so --skip-platform-check is no longer effectively ignored on U-Boot systems.
Changes:
- Fix two-slot mapping by switching from list-index/substrings to explicit slot reading and exact image-to-slot lookup.
- Make
get_next_image()honor U-Boot boot order (boot_oncebeforeboot_next) and surface empty-slot / unknown-selector states instead of masking them. - Implement platform validation by parsing
installer/platforms_asicfrom the image payload (matching Grub’s behavior) and add/expand unit tests for the new semantics.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
sonic_installer/bootloader/uboot.py |
Correct slot semantics, boot_once precedence/clearing, slot-aux-var cleanup, and add a real platforms_asic-based platform check. |
tests/installer_bootloader_uboot_test.py |
Add regression/unit tests for slot mapping, boot_once behavior, remove-image cleanup, and platform-check behavior. |
Comments suppressed due to low confidence (1)
tests/installer_bootloader_uboot_test.py:22
- MockProc.communicate is currently defined without a self parameter and returns an undefined name (commandline). If this stub is invoked, it will raise (TypeError/NameError) instead of behaving like subprocess.Popen().communicate().
class MockProc():
commandline = "linuxargs="
def communicate():
return commandline, None
| def test_install_image(mock_run_cmd): | ||
| image_path = ['sonic_image'] | ||
| expected_call = [call(['bash', image_path])] | ||
| expected_call = [ | ||
| call(['bash', image_path]), | ||
| call(['/usr/bin/fw_setenv', 'boot_once', '']), | ||
| ] | ||
|
|
||
| bootloader = uboot.UbootBootloader() | ||
| bootloader.install_image(image_path) | ||
| assert mock_run_cmd.call_args_list == expected_call |
There was a problem hiding this comment.
fix in latest commit
| @patch("sonic_installer.bootloader.uboot.device_info") | ||
| @patch("sonic_installer.bootloader.uboot.subprocess.Popen") | ||
| def test_verify_image_platform_mismatch(popen_patch, device_info_patch, | ||
| tmp_path): | ||
| """tar rc=0 + grep rc=1 -> False (platform not in manifest).""" | ||
| image = tmp_path / "sonic.bin" |
There was a problem hiding this comment.
fix in latest commit
Follow-up addressing PR review feedback on the U-Boot two-slot work.
remove_image
* Drop ("linuxargs", "linuxargs_old") from SLOT_AUX_VAR_PAIRS.
set_fips()/get_fips() read and write `linuxargs` as a
slot-agnostic kernel cmdline (they persist sonic_fips= into it
regardless of slot), so clearing it when removing a slot could
wipe the surviving image's boot args / FIPS setting. The slot is
already made non-bootable via sonic_version_<N>=NONE + boot_next
repoint, so the leftover var is inert.
get_next_image
* Fix an inaccurate comment: the no-selector fallback returns the
currently-running image, not images[0] -- it does NOT "mirror
grub.py" (grub falls back to images[0]). No behavior change.
tests
* test_install_image: pass image_path as a str, matching how
sonic_installer.main.install() actually calls install_image()
(avoids encoding a nested argv shape).
* Add test_verify_image_platform_no_manifest: regression for the
tar-nonzero branch (image ships no installer/platforms_asic ->
allow install), preventing accidental tightening of the
backward-compatible contract.
* Add test_remove_image_preserves_linuxargs: regression guarding
that linuxargs is never cleared on remove.
Signed-off-by: William Tsai <willtsai@nvidia.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@judyjoseph could you please help review the change? thanks. |
|
sorry, I forget to close this |
Partially addresses sonic-net/sonic-utilities#4548.
What I did
Bring the U-Boot bootloader
sonic_installer/bootloader/uboot.pyin line with the contracts thatgrub.pyandaboot.pyalready implement, and add the missing platform check that has effectively turned--skip-platform-checkinto a no-op on every U-Boot SONiC platform.This PR is scoped to the Python
uboot.pybootloader implementation and its tests. It does not touch FIPS code paths, reboot scripts, the sharedOnieInstallerBootloaderbase class, the bootloader detection helper, or the buildimage installer shell scripts.Issue coverage from #4548:
boot_onceis ignored, so the reported "Next" image can be wrongcleanupcan remove the user's pinned one-time boot imageset-fipscan corrupt or duplicate thesonic_fipsbootargget-fipstreats multi-character values beginning with1as enabledboot_oncecan shadowset-defaultand image installationboot_oncepointing to that imageNonevsNONE)soft-rebootcan pair the next image's kernel with the current image's bootargsThis PR fully fixes 9 of the 18 reported issues and fixes issue 12 for the tested ASPEED/BMC per-slot U-Boot environment. The remaining 8 are out of scope (FIPS, marker spelling, detection breadth, scripts, buildimage, shared base class) and are tracked for follow-up. Issue 12 still has a shared-
linuxargsportability caveat described below.Issues 5, 6, and 10 were reproduced on an AST2700 BMC during validation and are intentionally deferred to a FIPS-focused follow-up PR.
set-fipsregex[^\s]corrupts multi-char values / requires leading space. The fix is a one-line regex change ([^\s]→\S+, and drop the leading-space requirement) butlinuxargsmutation also needs to be routed per-slot to fix issue 10. Both should land together in a FIPS-focused follow-up PR.get-fipssubstring check returns enabled for any value starting with1. Symmetric with issue 5; will land in the same FIPS follow-up.imageargument. Requires routing read/write tolinuxargsorlinuxargs_oldbased on_get_image_slot(image). Held back to keep this PR scoped, and because the FIPS contract change deserves its own PR for review and test focus."NONE"vs"None". Cosmetic only — both spellings passget_installed_images'IMAGE_PREFIXfilter. Aligning the markers requires touching the buildimage installer scripts as well asuboot.py, so it's better as a separate change that lands both halves together.detect()returnsTruefor any ARM/aarch64 machine. The detection order inbootloader/__init__.py(Aboot → Grub → Uboot) prevents this from being load-bearing today, so the change is defensive rather than user-visible. Worth a separate small PR that also adds a/usr/bin/fw_printenvexistence check.sonic-uboot-env-init.sh), not insonic-utilities. Belongs in asonic-buildimagePR.get_current_imageregex assumesloop=<image>/fs.squashfs. The method is inherited fromOnieInstallerBootloaderand is shared withgrub.py. Fixing it in the right place (probably the base class, with an opt-in override hook) requires touching multiple bootloaders and is a separate cross-cutting change. This PR only preventsget_next_image()from hiding or crashing on selected empty-slot U-Boot states; it does not fix the shared current-image parser.soft-rebootpairs the next image's kernel with the current image's bootargs. The defect is inscripts/soft-reboot, not insonic_installer/bootloader/uboot.py. The fix mirrors the per-slotlinuxargs/sonic_bootargslookup thatfast-reboot/warm-reboot/express-rebootalready use in their device-tree branch. Will land in a separate reboot-scripts PR.Files this PR did not touch at all:
sonic_installer/main.py— unchanged. The CLI surface, deprecation warnings, and click options are unmodified. The pre-existing duplicate-installif not bootloader.set_default_image(...): raise click.Abort()branch is now reachable, but its behaviour is unchanged. Directset-defaultandset-next-bootcommands still rely on the existing installed-image precheck; they do not add new handling for a false return after that precheck.sonic_installer/bootloader/grub.py,sonic_installer/bootloader/aboot.py,sonic_installer/bootloader/onie.py,sonic_installer/bootloader/bootloader.py— unchanged.scripts/reboot,scripts/fast-reboot,scripts/soft-reboot,scripts/warm-reboot,scripts/express-reboot— unchanged.sonic_installer/aliases.ini— unchanged. (Theget_fips/set_fips/verify_next_imageunderscore-alias gap is pre-existing mainline behaviour, not caused or affected by this PR.)upgrade-docker/rollback-dockercode paths — unchanged. (Thetemplate parsing error: ... .ContainerConfig.Labels.Tagwarning is a pre-existing mainline bug, unrelated to U-Boot.)How I did it
Slot mapping (issues 1, 2)
if image in images[N]substring branches inset_default_image/set_next_image/remove_imagewith an explicit_get_image_slot(image)helper that uses exact-equality matching againstsonic_version_<N>. Substring-related collisions (releaseSONiC-OS-X.0vs. dirtySONiC-OS-X.0-dirty-…) no longer silently pick the wrong slot.images[0]/images[1]list-index assumption with a_read_slots()helper returning{slot: version}. When slot 1 is empty and slot 2 holds the only image, the new code correctly writesboot_next='run sonic_image_2'instead of pointing at the empty slot 1.set_default_imageandset_next_imagenow returnFalsefor unknown images instead of always returningTrue. This restores the contract thatgrub.pyandaboot.pyhonor and makesmain.py's duplicate-install failure branch reachable again. The direct CLI handlers are intentionally unchanged in this PR and still rely on their installed-image precheck.boot_oncehandling (issues 3, 4, 7, 11, 15)get_next_imagenow consultsboot_oncefirst, then falls back toboot_next— matching U-Bootbootcmd's actual execution order. As a side effect,sonic-installer list,verify-next-image, andcleanup(which all consumeget_next_image) now agree with what U-Boot will actually do at the next reboot.boot_once/boot_nextreferences an empty slot, return the rawsonic_version_<N>value (e.g."NONE") rather than silently substituting another installed image. The broken state is visible to the user instead of being hidden behind a fake "Next:" line.boot_oncecontains an unrecognised selector (anything that doesn't namesonic_image_<1|2>), return the literal raw value solistdoesn't lie about which command will run first.set_default_imagenow writesfw_setenv boot_once ''after writingboot_next. This mirrorsgrub-set-default's implicit collapse ofnext_entry— a freshly chosen persistent default can no longer be shadowed by a stale one-shot.install_imagelikewise clearsboot_onceafter the installer script returns, so a one-shot set before install can't shadow the newly-installed image at next reboot.get_next_imageno longer hides selected empty-slot states (issue 15). Ifboot_onceorboot_nextpoints at an empty slot, the function returns the rawsonic_version_<N>marker instead of substituting another installed image. If no selector is configured at all, it falls back to the current image or the surviving slot.remove_image(issues 8, 12)boot_oncewhen it still references the slot being removed. Before this PR,bootcmdwould consumeboot_once='run sonic_image_N'on next reboot and try tobootma.fitblob that had just beenrm -rf'd.<var>and its sibling exist in the env. The pair table covers both naming conventions:_oldsuffix (image_dir/image_dir_old,fit_name/fit_name_old,linuxargs/linuxargs_old,sonic_bootargs/sonic_bootargs_old,sonic_boot_load/sonic_boot_load_old,initrd_name/initrd_name_old,fdt_name/fdt_name_old,ubi_sonic_boot_bootargs/ubi_sonic_boot_bootargs_old,ubi_sonic_boot_load/ubi_sonic_boot_load_old,image_name/image_name_old)._1/_2suffix (sonic_dir_1/sonic_dir_2).linuxargsis referenced by bothsonic_image_1andsonic_image_2. It is not a complete proof for every U-Boot vendor: if a shared-linuxargsplatform also has a stalelinuxargs_oldvariable, a future hardening patch should derive slot-local variables from the actualsonic_image_Ncommand before clearing them.verify_image_platform(issue 9)installer/platforms_asiccheck fromgrub.py: extract the manifest from the.binvia the standardsed -e '1,/^exit_marker$/d' | tar xf - installer/platforms_asic -Opipeline, thengrep -Fxqfor the running platform. Before this PR this method wasreturn os.path.isfile(image_path), which meant the platform-check layer accepted any existing file. In the full install flow, the image still had to pass the earlier SONiC binary-version check, but valid foreign-platform SONiC images were not rejected unless the user explicitly requested--skip-platform-check.platforms_asic:tarexits non-zero and the method returnsTrue(matchesgrub.py).Tests
22 regression tests in
tests/installer_bootloader_uboot_test.pycover the new behaviour. The tests useunittest.mockto drivesubprocess.Popen/run_commandagainst crafted env states and assert the exactfw_setenvcall list — same pattern as the existing tests in this file. No live hardware required to run the suite.Backward compatibility / risk
remove_imagewas designed against a survey of the per-slot env vars used by all five vendors and was live-tested on ASPEED/BMC. Vars that are not paired in the running env are intentionally left alone. No vendor hardware tests were run. Platforms with globally-sharedlinuxargsplus stale sibling variables should get additional regression coverage or follow-up hardening that derives slot-local variables from the actualsonic_image_Ncommand.set-defaultclearingboot_once: Behaviour change with a clear rationale (cross-bootloader alignment;grub-set-defaultdoes the same). A user who explicitly setsboot_onceand then runsset-defaultwill lose the one-shot — which is the intended cross-bootloader semantic.install_imageclearingboot_once: Same rationale — freshly-installed images shouldn't be silently shadowed by stale one-shots.verify_image_platformchange: The new check is permissive on images that don't shipinstaller/platforms_asic(returnsTrue), matchinggrub.py. Existing valid installs continue to work; only foreign-platform.bins that do ship a non-matchingplatforms_asicare newly rejected.How to verify it
Unit tests
cd sonic-utilities python3 -m pytest tests/installer_bootloader_uboot_test.py -v22 tests should pass. The non-trivial ones:
test_set_default_image_only_slot_2_populated/test_set_next_image_only_slot_2_populated— regression for the list-index bug. Slot 1 empty + image in slot 2 must writerun sonic_image_2(notrun sonic_image_1).test_set_default_image_unknown_returns_false/test_set_next_image_unknown_returns_false— contract:Falseon unknown image, not silentTrue.test_install_image— install clearsboot_once.test_remove_image_clears_boot_once_pointing_at_removed_slot/test_remove_image_preserves_boot_once_pointing_at_other_slot— the conditionalboot_onceclear inremove_image.test_remove_image_clears_slot_aux_vars/test_remove_image_skips_absent_aux_vars— the pair-check (clear only when both sibling names exist; otherwise leave the var alone).test_get_next_image_boot_once_overrides_boot_next—boot_oncetakes precedence overboot_next.test_get_next_image_boot_once_empty_slot_returns_marker/test_get_next_image_unknown_boot_once_returns_raw— empty-slot and unrecognised-selector handling.test_get_image_slot_substring_safe— exact-equality slot lookup with versions that share substrings.test_verify_image_platform_matches/test_verify_image_platform_mismatch— the newplatforms_asiccheck.Live device validation (AST2700 BMC, two installed images)
Previous command output (if the output of a command-line utility has changed)
sonic-installer listafterset-next-bootbefore this PR:sonic-installer set-defaultwith a version that is a prefix of another slot's version (substring bug):sonic-installer removeleaving stale aux vars andboot_oncebefore this PR:sonic-installer listwhenboot_nextpoints at an empty slot:verify_image_platformaccepting any existing file at the platform-check layer:New command output (if the output of a command-line utility has changed)
sonic-installer listafterset-next-bootwith this PR:set-defaultwith the same substring-prefix state:removenow clearsboot_onceand the per-slot aux vars for the ASPEED/BMC paired-slot environment:listwhenboot_nextpoints at an empty slot:installrejecting a valid foreign-platform SONiC image: