test(recovery): reap the whole detached box tree in pid-file recovery tests#870
Draft
G4614 wants to merge 4 commits into
Draft
test(recovery): reap the whole detached box tree in pid-file recovery tests#870G4614 wants to merge 4 commits into
G4614 wants to merge 4 commits into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
G4614
added a commit
to G4614/boxlite
that referenced
this pull request
Jun 26, 2026
…e-ai#870 reaper) recovery_with_{missing,corrupted}_pid_file deliberately destroy a live detached box's shim.pid to exercise recovery's `Absent` path. Recovery then clears the box's pid, so the teardown `remove(force)` has no pid to signal — leaking the live shim. It's a real on-pass orphan, and because shim.pid is gone, PerTestBoxHome's leak guard can't even see it. Reproduced directly on a host: every CLI command returns success yet the shim keeps running. Fix in the test, where the bug is: capture the recorded pid before destroying the file and SIGKILL it on drop (pass or panic) via a ShimReaper guard. Killing the outer bwrap pid tears down the box's pid namespace and the whole VM tree. This replaces the out-of-process reaper approach from this branch's earlier commit (scripts/test/reap_boxes.sh + make/test.mk EXIT traps + BOXLITE_TEST_HOME_BASE corral). That approach masked leaks at suite exit instead of fixing the tests that cause them — at odds with PerTestBoxHome's deliberate assert-don't-reap design. Those changes are reverted here; the net effect on the branch is just the recovery.rs fix. Audited every make test:integration suite — recovery.rs holds the only two tests of this kind: - pid_file.rs / sigstop_quiesce.rs / detach.rs reap via valid-pid stop/remove; - health_check.rs already reaps its survivor in teardown; - Python detached tests call box.stop() (kills the shim); - Node/C start no detached boxes. Verified: recovery test compiles (--features krun,gvproxy); cargo fmt clean. The leak and the SIGKILL-reap were both demonstrated on this host via a stand-in shim. The full cargo recovery run needs a KVM host with NPROC headroom — boxes can't boot here (jailer RLIMIT_NPROC=100 collides with the host's existing uid threads) — so that run was not performed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…safe) `make test:integration` / `test:stress:disk` could leak live detached shims onto the runner. Since boxlite-ai#851 detached boxes drop bwrap's `--die-with-parent`, so a leaked shim outlives the test binary, `make`, and the CI job — the run-over-run accumulation that poisons the self-hosted e2e runner (boxlite-ai#137/boxlite-ai#141). The existing guards cannot catch the dominant failure mode. The nextest `vm` profile uses `slow-timeout = { terminate-after }`, so a hung test is SIGKILL'd — bypassing every in-process cleanup: Rust `PerTestBoxHome` / `TestContext` / `BoxCleanup` Drop, pytest fixtures, vitest `afterAll`. boxlite-ai#860 only added a panic-time sweep to the CLI harness (unwind only, not timeout); Rust core, Python, and the detached-heavy `stress_disk` had no out-of-process backstop. Plug it where it survives a SIGKILL — outside the test process: - scripts/test/reap_boxes.sh: home-scoped reaper. Kills shims by `<home>/**/boxes/<id>/shim.pid`, plus an argv-scan catch-all for shims whose pid file was deleted/corrupted (recovery marks them Stopped + pid=None, so `rm --force` has no pid to signal). Only touches processes under the given per-run root — never a blanket `pkill boxlite-shim`, so a developer's real boxes are safe. Idempotent, best-effort, exit 0. - make/test.mk: every box-spawning integration recipe (rust, cli, python, node, stress:disk) corrals its homes under a per-run `BOXLITE_TEST_HOME_BASE` temp dir and runs `trap 'reap_boxes.sh <root>; rm -rf <root>' EXIT`, firing on pass, failure, timeout, and interrupt. (C SDK tests use non-detached boxes — `--die-with-parent` reaps them with the test process — so no trap.) - test-utils/home.rs: `PerTestBoxHome::{new,isolated}` honor `BOXLITE_TEST_HOME_BASE` so rust/cli per-test homes land under the sweepable root (default `/tmp` unchanged). Drop's deliberate assert-don't-reap behavior is untouched. - node integration-setup.ts: honor the same env as its home root. Verified: reaper functionally tested (precise + corrupt-pid catch-all kill, unrelated process survives, idempotent); recipe corral+trap simulated end-to-end (leaked shim reaped, root removed); `cargo check`/`fmt` and the 4 `home.rs` unit tests (incl. `drop_panics_if_shim_still_alive`) pass. The VM integration suites themselves need a KVM runner and are not run here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-ai#870 reaper) recovery_with_{missing,corrupted}_pid_file deliberately destroy a live detached box's shim.pid to exercise recovery's `Absent` path. Recovery then clears the box's pid, so the teardown `remove(force)` has no pid to signal — leaking the live shim. It's a real on-pass orphan, and because shim.pid is gone, PerTestBoxHome's leak guard can't even see it. Reproduced directly on a host: every CLI command returns success yet the shim keeps running. Fix in the test, where the bug is: capture the recorded pid before destroying the file and SIGKILL it on drop (pass or panic) via a ShimReaper guard. Killing the outer bwrap pid tears down the box's pid namespace and the whole VM tree. This replaces the out-of-process reaper approach from this branch's earlier commit (scripts/test/reap_boxes.sh + make/test.mk EXIT traps + BOXLITE_TEST_HOME_BASE corral). That approach masked leaks at suite exit instead of fixing the tests that cause them — at odds with PerTestBoxHome's deliberate assert-don't-reap design. Those changes are reverted here; the net effect on the branch is just the recovery.rs fix. Audited every make test:integration suite — recovery.rs holds the only two tests of this kind: - pid_file.rs / sigstop_quiesce.rs / detach.rs reap via valid-pid stop/remove; - health_check.rs already reaps its survivor in teardown; - Python detached tests call box.stop() (kills the shim); - Node/C start no detached boxes. Verified: recovery test compiles (--features krun,gvproxy); cargo fmt clean. The leak and the SIGKILL-reap were both demonstrated on this host via a stand-in shim. The full cargo recovery run needs a KVM host with NPROC headroom — boxes can't boot here (jailer RLIMIT_NPROC=100 collides with the host's existing uid threads) — so that run was not performed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ed pid Correction to the previous commit on this branch. The first ShimReaper SIGKILL'd the pid captured from shim.pid — but that records the *outer* bwrap launcher, and since boxlite-ai#851 dropped --die-with-parent, killing it leaves the inner pid-ns tree (inner bwrap + shim + VM) alive. Running the test as root confirmed it: both tests passed yet 2 procs survived each box. Reap by box id instead: every box process carries the unique id in its bwrap bind paths, so `pkill -9 -f <id>` tears down the whole tree. ShimReaper now holds the box id (armed before the test strands the box) and reaps on drop — pass or panic. Verified as root on this host (boxes can boot once euid 0 bypasses the jailer's RLIMIT_NPROC ceiling), two-sided: - recorded-pid kill: 2 tests pass, 2 box procs leak; - pkill-by-id: 2 tests pass, 0 box procs leak. Compiles (--features krun,gvproxy); cargo fmt clean. Note (separate, not fixed here): the same root run showed production `boxlite stop` / `rm -f` also strand a detached box's inner tree — they call `kill_process(recorded_pid)`, which only signals the outer bwrap. `cgroup.kill` on the box's cgroup reaps it (3 -> 0). That's a runtime-side fix worth its own change; this commit only stops the recovery tests from leaking. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Repeated/root verification of the prior fix surfaced two more recovery tests that leak the same detached inner tree (2 procs each, every run): - recovery_with_live_process cleans up via production remove(force=true), which only signals the recorded outer bwrap; - recovery_with_dead_process manually SIGKILLs the recorded outer bwrap. Both leave the inner pid-ns bwrap + shim + VM alive (post-boxlite-ai#851). Arm a ShimReaper (pkill -9 -f <box id>) in both, matching the two pid-file tests, so each reaps its whole tree on drop. Verified as root, repeated: pid-file tests x10 -> 0 leak; full recovery suite x3 -> 9 pass, 0 leak each. Compiles (--features krun,gvproxy); cargo fmt clean. The underlying production leak (stop / rm -f only signal the outer bwrap, so a detached box's inner tree survives; cgroup.kill reaps it) is a separate runtime-side bug, handled in its own draft PR — this commit only keeps the recovery suite from leaking. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
902bfc1 to
0c4b13c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ensure pid-file recovery tests reap detached box process trees they intentionally strand.
Test plan:
pkill -f <box id>leaked 0.cargo fmt --checkcargo test -p boxlite recovery_with_missing_pid_file recovery_with_corrupted_pid_file --features krun,gvproxyas root on the VM test host