Skip to content

ci: prune stale earthly buildkit cache on self-hosted host#1665

Open
skylar-simoncelli wants to merge 4 commits into
mainfrom
skylar/ci-prune-earthly-buildkit-cache
Open

ci: prune stale earthly buildkit cache on self-hosted host#1665
skylar-simoncelli wants to merge 4 commits into
mainfrom
skylar/ci-prune-earthly-buildkit-cache

Conversation

@skylar-simoncelli

Copy link
Copy Markdown
Contributor

Problem

+test-pallet-fixtures (and other large link steps) intermittently fail on the self-hosted runners with:

LLVM ERROR: IO failure on output stream: No space left on device
collect2: fatal error: ld terminated with signal 7 [Bus error], core dumped
ERROR Earthfile:985

Root cause is the shared earthly-buildkitd cache on the self-hosted host. It accumulates build cache across all 8 runner slots — pulled base images (e.g. nixos/nix:latest, observed cached 69× from the compactc submodule build) plus cargo/target cache mounts. The cache_size_pct: 50 cap is a lazy buildkit high-water mark that overshoots (seen at ~70% of the 1.7 TB /var), so the earthly-cache volume creeps toward the disk ceiling and the linker runs out of scratch space mid-build.

A manual buildctl prune --keep-duration=30m reclaimed 764 GB and took /var from 77% to 33%, with zero impact on in-flight builds.

Change

Add an always() step to the local-environment-tests job (self-hosted) that runs buildctl prune --keep-duration=24h against the shared earthly-buildkitd, right after the existing per-slot local-env stack teardown.

  • Concurrency-safe: buildctl prune skips records held by in-flight builds, so it never disturbs concurrent jobs. Worst case for a build is a cache miss (recompile), never a failure.
  • No daemon restart: it does not alter the daemon's settings hash, so it won't docker rm -f / restart the shared buildkitd (which would cancel other jobs).
  • --keep-duration=24h keeps a full day of hot cache so other PRs still benefit.

The host-side reaper timer in shielded-iac (runner role) is the backstop for slots whose job is hard-killed before this step runs — companion PR there.

The shared earthly-buildkitd on the self-hosted runners accumulates build
cache (pulled base images such as nixos/nix for compactc, plus cargo/target
cache mounts) across all slots. cache_size_pct is a lazy high-water mark that
overshoots, so the earthly-cache volume creeps toward the 1.7 TB /var ceiling
and large link steps fail with "LLVM ERROR: IO failure on output stream: No
space left on device" (seen on the +test-pallet-fixtures link step).

Add an always() step to the local-environment-tests job that runs
`buildctl prune --keep-duration=24h` against the shared earthly-buildkitd,
alongside the existing per-slot local-env stack teardown. buildctl prune is
concurrency-safe (records held by in-flight builds are skipped) and does not
change the daemon settings hash, so it never restarts buildkitd. The host-side
reaper timer (shielded-iac runner role) is the backstop for hard-killed slots.
@skylar-simoncelli skylar-simoncelli requested a review from a team as a code owner June 9, 2026 15:01

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d429bd0c13

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread .github/workflows/continuous-integration.yml

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d8a737169a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

shell: bash
run: |
if docker inspect earthly-buildkitd >/dev/null 2>&1; then
docker exec earthly-buildkitd buildctl prune --keep-duration=24h || true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Point buildctl at the TCP daemon

On self-hosted runs where Earthly manages earthly-buildkitd with TCP transport (the repo’s self-hosted config disables TLS for that shared daemon), this docker exec invokes buildctl without --addr/BUILDKIT_HOST, so it falls back to the default Unix socket inside the container instead of the daemon’s TCP listener. The prune then fails and || true hides it, leaving the shared cache unbounded in exactly the disk-pressure scenario this step is meant to fix; pass the daemon address explicitly (for example the container-local TCP endpoint). Fresh evidence: the current .earthly/config.selfhosted.yml still configures the shared no-TLS Earthly daemon, while the added command still relies on buildctl defaults.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants