Skip to content

test(recovery): reap the whole detached box tree in pid-file recovery tests#870

Draft
G4614 wants to merge 4 commits into
boxlite-ai:mainfrom
G4614:test/plug-integration-box-leaks
Draft

test(recovery): reap the whole detached box tree in pid-file recovery tests#870
G4614 wants to merge 4 commits into
boxlite-ai:mainfrom
G4614:test/plug-integration-box-leaks

Conversation

@G4614

@G4614 G4614 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Ensure pid-file recovery tests reap detached box process trees they intentionally strand.

Test plan:

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8bbed74b-a4b4-4299-b6fa-9a3405c263e7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

G4614 added a commit to G4614/boxlite that referenced this pull request Jun 26, 2026
…e-ai#870 reaper)

recovery_with_{missing,corrupted}_pid_file deliberately destroy a live detached
box's shim.pid to exercise recovery's `Absent` path. Recovery then clears the
box's pid, so the teardown `remove(force)` has no pid to signal — leaking the
live shim. It's a real on-pass orphan, and because shim.pid is gone,
PerTestBoxHome's leak guard can't even see it. Reproduced directly on a host:
every CLI command returns success yet the shim keeps running.

Fix in the test, where the bug is: capture the recorded pid before destroying
the file and SIGKILL it on drop (pass or panic) via a ShimReaper guard. Killing
the outer bwrap pid tears down the box's pid namespace and the whole VM tree.

This replaces the out-of-process reaper approach from this branch's earlier
commit (scripts/test/reap_boxes.sh + make/test.mk EXIT traps +
BOXLITE_TEST_HOME_BASE corral). That approach masked leaks at suite exit
instead of fixing the tests that cause them — at odds with PerTestBoxHome's
deliberate assert-don't-reap design. Those changes are reverted here; the net
effect on the branch is just the recovery.rs fix.

Audited every make test:integration suite — recovery.rs holds the only two
tests of this kind:
- pid_file.rs / sigstop_quiesce.rs / detach.rs reap via valid-pid stop/remove;
- health_check.rs already reaps its survivor in teardown;
- Python detached tests call box.stop() (kills the shim);
- Node/C start no detached boxes.

Verified: recovery test compiles (--features krun,gvproxy); cargo fmt clean. The
leak and the SIGKILL-reap were both demonstrated on this host via a stand-in
shim. The full cargo recovery run needs a KVM host with NPROC headroom — boxes
can't boot here (jailer RLIMIT_NPROC=100 collides with the host's existing uid
threads) — so that run was not performed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@G4614 G4614 changed the title test(integration): reap leaked detached boxes on suite exit (timeout-safe) test(recovery): reap stranded shim in pid-file recovery tests Jun 26, 2026
@G4614 G4614 changed the title test(recovery): reap stranded shim in pid-file recovery tests test(recovery): reap the whole detached box tree in pid-file recovery tests Jun 26, 2026
G4614 and others added 4 commits June 26, 2026 14:13
…safe)

`make test:integration` / `test:stress:disk` could leak live detached shims
onto the runner. Since boxlite-ai#851 detached boxes drop bwrap's `--die-with-parent`,
so a leaked shim outlives the test binary, `make`, and the CI job — the
run-over-run accumulation that poisons the self-hosted e2e runner (boxlite-ai#137/boxlite-ai#141).

The existing guards cannot catch the dominant failure mode. The nextest `vm`
profile uses `slow-timeout = { terminate-after }`, so a hung test is SIGKILL'd
— bypassing every in-process cleanup: Rust `PerTestBoxHome` / `TestContext` /
`BoxCleanup` Drop, pytest fixtures, vitest `afterAll`. boxlite-ai#860 only added a
panic-time sweep to the CLI harness (unwind only, not timeout); Rust core,
Python, and the detached-heavy `stress_disk` had no out-of-process backstop.

Plug it where it survives a SIGKILL — outside the test process:

- scripts/test/reap_boxes.sh: home-scoped reaper. Kills shims by
  `<home>/**/boxes/<id>/shim.pid`, plus an argv-scan catch-all for shims
  whose pid file was deleted/corrupted (recovery marks them Stopped +
  pid=None, so `rm --force` has no pid to signal). Only touches processes
  under the given per-run root — never a blanket `pkill boxlite-shim`, so a
  developer's real boxes are safe. Idempotent, best-effort, exit 0.
- make/test.mk: every box-spawning integration recipe (rust, cli, python,
  node, stress:disk) corrals its homes under a per-run `BOXLITE_TEST_HOME_BASE`
  temp dir and runs `trap 'reap_boxes.sh <root>; rm -rf <root>' EXIT`, firing
  on pass, failure, timeout, and interrupt. (C SDK tests use non-detached
  boxes — `--die-with-parent` reaps them with the test process — so no trap.)
- test-utils/home.rs: `PerTestBoxHome::{new,isolated}` honor
  `BOXLITE_TEST_HOME_BASE` so rust/cli per-test homes land under the sweepable
  root (default `/tmp` unchanged). Drop's deliberate assert-don't-reap
  behavior is untouched.
- node integration-setup.ts: honor the same env as its home root.

Verified: reaper functionally tested (precise + corrupt-pid catch-all kill,
unrelated process survives, idempotent); recipe corral+trap simulated
end-to-end (leaked shim reaped, root removed); `cargo check`/`fmt` and the 4
`home.rs` unit tests (incl. `drop_panics_if_shim_still_alive`) pass. The VM
integration suites themselves need a KVM runner and are not run here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-ai#870 reaper)

recovery_with_{missing,corrupted}_pid_file deliberately destroy a live detached
box's shim.pid to exercise recovery's `Absent` path. Recovery then clears the
box's pid, so the teardown `remove(force)` has no pid to signal — leaking the
live shim. It's a real on-pass orphan, and because shim.pid is gone,
PerTestBoxHome's leak guard can't even see it. Reproduced directly on a host:
every CLI command returns success yet the shim keeps running.

Fix in the test, where the bug is: capture the recorded pid before destroying
the file and SIGKILL it on drop (pass or panic) via a ShimReaper guard. Killing
the outer bwrap pid tears down the box's pid namespace and the whole VM tree.

This replaces the out-of-process reaper approach from this branch's earlier
commit (scripts/test/reap_boxes.sh + make/test.mk EXIT traps +
BOXLITE_TEST_HOME_BASE corral). That approach masked leaks at suite exit
instead of fixing the tests that cause them — at odds with PerTestBoxHome's
deliberate assert-don't-reap design. Those changes are reverted here; the net
effect on the branch is just the recovery.rs fix.

Audited every make test:integration suite — recovery.rs holds the only two
tests of this kind:
- pid_file.rs / sigstop_quiesce.rs / detach.rs reap via valid-pid stop/remove;
- health_check.rs already reaps its survivor in teardown;
- Python detached tests call box.stop() (kills the shim);
- Node/C start no detached boxes.

Verified: recovery test compiles (--features krun,gvproxy); cargo fmt clean. The
leak and the SIGKILL-reap were both demonstrated on this host via a stand-in
shim. The full cargo recovery run needs a KVM host with NPROC headroom — boxes
can't boot here (jailer RLIMIT_NPROC=100 collides with the host's existing uid
threads) — so that run was not performed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ed pid

Correction to the previous commit on this branch. The first ShimReaper
SIGKILL'd the pid captured from shim.pid — but that records the *outer* bwrap
launcher, and since boxlite-ai#851 dropped --die-with-parent, killing it leaves the inner
pid-ns tree (inner bwrap + shim + VM) alive. Running the test as root confirmed
it: both tests passed yet 2 procs survived each box.

Reap by box id instead: every box process carries the unique id in its bwrap
bind paths, so `pkill -9 -f <id>` tears down the whole tree. ShimReaper now
holds the box id (armed before the test strands the box) and reaps on drop —
pass or panic.

Verified as root on this host (boxes can boot once euid 0 bypasses the jailer's
RLIMIT_NPROC ceiling), two-sided:
  - recorded-pid kill: 2 tests pass, 2 box procs leak;
  - pkill-by-id:       2 tests pass, 0 box procs leak.
Compiles (--features krun,gvproxy); cargo fmt clean.

Note (separate, not fixed here): the same root run showed production
`boxlite stop` / `rm -f` also strand a detached box's inner tree — they call
`kill_process(recorded_pid)`, which only signals the outer bwrap. `cgroup.kill`
on the box's cgroup reaps it (3 -> 0). That's a runtime-side fix worth its own
change; this commit only stops the recovery tests from leaking.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Repeated/root verification of the prior fix surfaced two more recovery tests
that leak the same detached inner tree (2 procs each, every run):
- recovery_with_live_process cleans up via production remove(force=true), which
  only signals the recorded outer bwrap;
- recovery_with_dead_process manually SIGKILLs the recorded outer bwrap.
Both leave the inner pid-ns bwrap + shim + VM alive (post-boxlite-ai#851).

Arm a ShimReaper (pkill -9 -f <box id>) in both, matching the two pid-file
tests, so each reaps its whole tree on drop.

Verified as root, repeated: pid-file tests x10 -> 0 leak; full recovery suite
x3 -> 9 pass, 0 leak each. Compiles (--features krun,gvproxy); cargo fmt clean.

The underlying production leak (stop / rm -f only signal the outer bwrap, so a
detached box's inner tree survives; cgroup.kill reaps it) is a separate
runtime-side bug, handled in its own draft PR — this commit only keeps the
recovery suite from leaking.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@G4614 G4614 force-pushed the test/plug-integration-box-leaks branch from 902bfc1 to 0c4b13c Compare June 26, 2026 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant