ci: prune stale earthly buildkit cache on self-hosted host#1665
ci: prune stale earthly buildkit cache on self-hosted host#1665skylar-simoncelli wants to merge 4 commits into
Conversation
The shared earthly-buildkitd on the self-hosted runners accumulates build cache (pulled base images such as nixos/nix for compactc, plus cargo/target cache mounts) across all slots. cache_size_pct is a lazy high-water mark that overshoots, so the earthly-cache volume creeps toward the 1.7 TB /var ceiling and large link steps fail with "LLVM ERROR: IO failure on output stream: No space left on device" (seen on the +test-pallet-fixtures link step). Add an always() step to the local-environment-tests job that runs `buildctl prune --keep-duration=24h` against the shared earthly-buildkitd, alongside the existing per-slot local-env stack teardown. buildctl prune is concurrency-safe (records held by in-flight builds are skipped) and does not change the daemon settings hash, so it never restarts buildkitd. The host-side reaper timer (shielded-iac runner role) is the backstop for hard-killed slots.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d429bd0c13
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d8a737169a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| shell: bash | ||
| run: | | ||
| if docker inspect earthly-buildkitd >/dev/null 2>&1; then | ||
| docker exec earthly-buildkitd buildctl prune --keep-duration=24h || true |
There was a problem hiding this comment.
Point buildctl at the TCP daemon
On self-hosted runs where Earthly manages earthly-buildkitd with TCP transport (the repo’s self-hosted config disables TLS for that shared daemon), this docker exec invokes buildctl without --addr/BUILDKIT_HOST, so it falls back to the default Unix socket inside the container instead of the daemon’s TCP listener. The prune then fails and || true hides it, leaving the shared cache unbounded in exactly the disk-pressure scenario this step is meant to fix; pass the daemon address explicitly (for example the container-local TCP endpoint). Fresh evidence: the current .earthly/config.selfhosted.yml still configures the shared no-TLS Earthly daemon, while the added command still relies on buildctl defaults.
Useful? React with 👍 / 👎.
Problem
+test-pallet-fixtures(and other large link steps) intermittently fail on the self-hosted runners with:Root cause is the shared
earthly-buildkitdcache on the self-hosted host. It accumulates build cache across all 8 runner slots — pulled base images (e.g.nixos/nix:latest, observed cached 69× from the compactc submodule build) plus cargo/target cache mounts. Thecache_size_pct: 50cap is a lazy buildkit high-water mark that overshoots (seen at ~70% of the 1.7 TB/var), so theearthly-cachevolume creeps toward the disk ceiling and the linker runs out of scratch space mid-build.A manual
buildctl prune --keep-duration=30mreclaimed 764 GB and took/varfrom 77% to 33%, with zero impact on in-flight builds.Change
Add an
always()step to thelocal-environment-testsjob (self-hosted) that runsbuildctl prune --keep-duration=24hagainst the sharedearthly-buildkitd, right after the existing per-slot local-env stack teardown.buildctl pruneskips records held by in-flight builds, so it never disturbs concurrent jobs. Worst case for a build is a cache miss (recompile), never a failure.docker rm -f/ restart the shared buildkitd (which would cancel other jobs).--keep-duration=24hkeeps a full day of hot cache so other PRs still benefit.The host-side reaper timer in
shielded-iac(runner role) is the backstop for slots whose job is hard-killed before this step runs — companion PR there.