From d429bd0c13c8416c3bf2c595a3d526d4ba62cbf3 Mon Sep 17 00:00:00 2001 From: Skylar Simoncelli Date: Tue, 9 Jun 2026 16:00:57 +0100 Subject: [PATCH] ci: prune stale earthly buildkit cache on self-hosted host The shared earthly-buildkitd on the self-hosted runners accumulates build cache (pulled base images such as nixos/nix for compactc, plus cargo/target cache mounts) across all slots. cache_size_pct is a lazy high-water mark that overshoots, so the earthly-cache volume creeps toward the 1.7 TB /var ceiling and large link steps fail with "LLVM ERROR: IO failure on output stream: No space left on device" (seen on the +test-pallet-fixtures link step). Add an always() step to the local-environment-tests job that runs `buildctl prune --keep-duration=24h` against the shared earthly-buildkitd, alongside the existing per-slot local-env stack teardown. buildctl prune is concurrency-safe (records held by in-flight builds are skipped) and does not change the daemon settings hash, so it never restarts buildkitd. The host-side reaper timer (shielded-iac runner role) is the backstop for hard-killed slots. --- .github/workflows/continuous-integration.yml | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/.github/workflows/continuous-integration.yml b/.github/workflows/continuous-integration.yml index fffe26cf8..1c0e90e2d 100644 --- a/.github/workflows/continuous-integration.yml +++ b/.github/workflows/continuous-integration.yml @@ -1166,6 +1166,24 @@ jobs: docker volume ls -q --filter "label=com.docker.compose.project=${proj}" \ | xargs -r docker volume rm -f || true + # Bound the shared earthly-buildkitd cache. It accumulates pulled base + # images and cache mounts (e.g. nixos/nix for compactc) across all slots + # on the self-hosted host, and is NOT covered by the local-env teardown + # above. Left unbounded it creeps toward the 1.7 TB /var ceiling and large + # link steps die with "No space left on device". buildctl prune is + # concurrency-safe (it skips records held by in-flight builds), and + # --keep-duration keeps a day of hot cache so other PRs still hit it. This + # does not change the daemon's settings hash, so it never restarts the + # shared buildkitd. The host-side reaper timer (shielded-iac runner role) + # is the backstop for slots whose job is hard-killed before this runs. + - name: Prune stale earthly buildkit cache (defensive) + if: always() + shell: bash + run: | + if docker inspect earthly-buildkitd >/dev/null 2>&1; then + docker exec earthly-buildkitd buildctl prune --keep-duration=24h || true + fi + - uses: ./.github/actions/tree-cache-guard/save if: steps.guard.outputs.hit != 'true' with: