Skip to content

Clean up orphaned VM data dirs from crashed VMs#198

Open
JAORMX wants to merge 1 commit into
mainfrom
fix/87-cleanup-orphaned-vm-data
Open

Clean up orphaned VM data dirs from crashed VMs#198
JAORMX wants to merge 1 commit into
mainfrom
fix/87-cleanup-orphaned-vm-data

Conversation

@JAORMX

@JAORMX JAORMX commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #87 — orphaned COW rootfs clones from crashed VMs leak disk indefinitely.

When a VM crashes or bbox is killed (SIGKILL, OOM) before WithCleanDataDir() runs, the COW-cloned rootfs at ~/.config/broodbox/vms/<name>/data/rootfs-work/ survives — potentially hundreds of MB per orphan, accumulating unbounded across crashed runs.

CleanupStaleLogs already removes a whole VM directory when its bbox sentinel PID is dead, but no sentinel is written for runs that pass --log-file, so those data clones leak.

Approach

Add CleanupStaleVMData, keyed off go-microvm's per-VM state file (state.Manager.Load(), which reads go-microvm-state.json without locking) rather than the bbox sentinel:

  • Scan ~/.config/broodbox/vms/*/data/.
  • A data dir whose state is active: true but whose recorded runner PID is dead was orphaned → remove it.
  • A live runner PID is left untouched, so it's safe to run at startup alongside concurrent VMs (each VM has its own uniquely named dir). A recycled PID now hosting an unrelated live process is conservatively treated as alive and skipped; the orphan is reclaimed on a later run.
  • After removing the data dir, best-effort removes the now-empty parent (reclaiming sentinel-less --log-file runs) via a non-recursive Remove, which only succeeds on an empty dir — so it never deletes lingering logs or a sentinel that CleanupStaleLogs still owns.

Wired into the composition root next to CleanupStaleLogs / CleanupStaleSnapshots.

Note: the issue referenced state.json; the actual go-microvm state file is go-microvm-state.json. Reusing state.Manager.Load() means we don't hardcode the filename or schema.

Tests

New table of cases in cleanup_test.go (state fixtures written via the canonical state package):

  • orphaned (active + dead PID) → data dir and empty parent removed
  • live runner (active + our own PID) → preserved
  • inactive state (clean shutdown) → preserved
  • orphan with parent logs → only data dir removed, parent left for log cleanup
  • no state file → preserved
  • nonexistent vms dir → no panic

Verified: task lint (0 issues), full task test passes (incl. -race).

🤖 Generated with Claude Code

When a VM crashes or the bbox process is killed (SIGKILL, OOM) before
WithCleanDataDir() runs, the COW-cloned rootfs under
~/.config/broodbox/vms/<name>/data/rootfs-work/ survives — potentially
hundreds of MB per orphan, accumulating unbounded.

CleanupStaleLogs already removes a whole VM directory when its bbox
sentinel's PID is dead, but no sentinel is written for runs that use
--log-file, so those data clones leak.

Add CleanupStaleVMData, keyed off go-microvm's per-VM state file
(state.Manager.Load) rather than the sentinel: a data dir whose state is
still active but whose runner PID is dead was orphaned and is reclaimed.
A live runner PID is left untouched, so it is safe to run at startup
alongside concurrent VMs. After removing the data dir it best-effort
removes the now-empty parent (reclaiming sentinel-less --log-file runs)
while leaving parents that still hold logs to CleanupStaleLogs.

Wire it into the composition root next to CleanupStaleLogs and
CleanupStaleSnapshots.

Fixes #87

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VP887qH8BMW4PMUXBuEqGc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clean up orphaned rootfs-work dirs from crashed VMs

1 participant