Clean up orphaned VM data dirs from crashed VMs#198
Open
JAORMX wants to merge 1 commit into
Open
Conversation
When a VM crashes or the bbox process is killed (SIGKILL, OOM) before WithCleanDataDir() runs, the COW-cloned rootfs under ~/.config/broodbox/vms/<name>/data/rootfs-work/ survives — potentially hundreds of MB per orphan, accumulating unbounded. CleanupStaleLogs already removes a whole VM directory when its bbox sentinel's PID is dead, but no sentinel is written for runs that use --log-file, so those data clones leak. Add CleanupStaleVMData, keyed off go-microvm's per-VM state file (state.Manager.Load) rather than the sentinel: a data dir whose state is still active but whose runner PID is dead was orphaned and is reclaimed. A live runner PID is left untouched, so it is safe to run at startup alongside concurrent VMs. After removing the data dir it best-effort removes the now-empty parent (reclaiming sentinel-less --log-file runs) while leaving parents that still hold logs to CleanupStaleLogs. Wire it into the composition root next to CleanupStaleLogs and CleanupStaleSnapshots. Fixes #87 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VP887qH8BMW4PMUXBuEqGc
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #87 — orphaned COW rootfs clones from crashed VMs leak disk indefinitely.
When a VM crashes or
bboxis killed (SIGKILL, OOM) beforeWithCleanDataDir()runs, the COW-cloned rootfs at~/.config/broodbox/vms/<name>/data/rootfs-work/survives — potentially hundreds of MB per orphan, accumulating unbounded across crashed runs.CleanupStaleLogsalready removes a whole VM directory when its bbox sentinel PID is dead, but no sentinel is written for runs that pass--log-file, so those data clones leak.Approach
Add
CleanupStaleVMData, keyed off go-microvm's per-VM state file (state.Manager.Load(), which readsgo-microvm-state.jsonwithout locking) rather than the bbox sentinel:~/.config/broodbox/vms/*/data/.active: truebut whose recorded runner PID is dead was orphaned → remove it.--log-fileruns) via a non-recursiveRemove, which only succeeds on an empty dir — so it never deletes lingering logs or a sentinel thatCleanupStaleLogsstill owns.Wired into the composition root next to
CleanupStaleLogs/CleanupStaleSnapshots.Tests
New table of cases in
cleanup_test.go(state fixtures written via the canonicalstatepackage):Verified:
task lint(0 issues), fulltask testpasses (incl.-race).🤖 Generated with Claude Code