From 5036d22a02c33befbdc0ce5930ba2c61d3e9b341 Mon Sep 17 00:00:00 2001 From: Kris Kowal Date: Thu, 14 May 2026 20:29:44 -0700 Subject: [PATCH] ci(ocapn-guile-interop): cache the Guix runtime store across runs Iteration III of the guix-CI resilience pattern. PR #82 (iteration I) established the two-substitute-server pattern and the installer-tarball cache; PR #255 (iteration II) reordered the substitute URLs and widened the polling and timeout windows after a Bordeaux outage exposed the "first server unreachable" slow path. The 2026-05-14 outage exposed a third failure mode that neither prior iteration addressed: both substitute servers degraded simultaneously, with the result that the daemon could not resolve the runtime closure (guile + fibers + websocket + gnutls + gcrypt) from either upstream. Reorder and wider timeouts did not help because no amount of waiting brings up a server that is down. The existing cache step amortizes only the installer tarball, not the runtime store the daemon resolves at runtime. Each run pays the full substitute-fetch cost end-to-end, and a both-servers-degraded day means that cost cannot be paid at all. This change adds a second `actions/cache` step that caches the daemon's runtime store (`/gnu/store`) together with the daemon database (`/var/guix/db`). Both paths are root-owned with strict permissions, so the cache targets a runner-owned staging directory containing a zstd tarball of the two store paths; paired `sudo tar` shell steps wrap the actual extract and create, matching the install step's existing `sudo tar --extract` pattern. On a cache hit the daemon's next `guix build` finds the resolved closure already on disk and the daemon DB already records it as valid, so the substitute round-trip is short-circuited entirely. A degraded-substitute-server day no longer blocks the workflow's runtime path. The cache key includes the pinned Guix version (a version bump may change the on-disk DB schema) and a hash of the workflow file (the package set and daemon configuration both live in the workflow, so any change to either forces a fresh snapshot). A `restore-keys` prefix lets a workflow edit that does not actually invalidate the store (a comment tweak, a timeout bump) still seed from the prior snapshot and re-save at job end. The daemon is stopped across the restore extract so the on-disk store and the daemon's in-memory view of it cannot diverge mid-flight; a divergence would surface later as missing-store-item errors from `guix build`. The snapshot step at the end does not need to stop the daemon because tar captures the SQLite DB as a point-in-time copy at the filesystem layer. The change is additive to iteration II's reorder + widen; both mitigations stay in place and remain load-bearing for the one-server-degraded case where the cache key changes (a workflow edit that flushes the snapshot leaves the daemon dependent on substitute fetches for the next run). --- .github/workflows/ocapn-guile-interop.yml | 98 +++++++++++++++++++++++ 1 file changed, 98 insertions(+) diff --git a/.github/workflows/ocapn-guile-interop.yml b/.github/workflows/ocapn-guile-interop.yml index ae02fc24b3..71cd274148 100644 --- a/.github/workflows/ocapn-guile-interop.yml +++ b/.github/workflows/ocapn-guile-interop.yml @@ -98,6 +98,41 @@ jobs: # Cache key includes the version and sha256 so a version # bump or upstream re-publish forces a fresh download. key: guix-binary-x86_64-linux-${{ env.GUIX_VERSION }}-${{ env.GUIX_TARBALL_SHA256 }} + - name: Restore Guix store cache + # Iteration III of the guix-CI resilience pattern (#82 was I, + # #255 was II). The tarball cache above amortizes the + # installer download; this second cache amortizes the + # *runtime store* the daemon resolves on every run. Without + # it, both substitute servers being degraded means each + # `guix build` re-fetches the full guile + fibers + websocket + # + gnutls + gcrypt closure end-to-end. With it, the daemon + # finds the resolved store paths already present and short- + # circuits the substitute round-trip, so a degraded-server + # day no longer blocks the workflow's runtime path. + # + # `/gnu/store` and `/var/guix/db` are both root-owned with + # strict permissions, so `actions/cache` (which runs as + # `runner`) cannot read or write them directly. We cache a + # runner-owned staging directory containing a zstd tarball + # of the two store paths; the *Restore* and *Snapshot* shell + # steps below wrap the actual `sudo tar` extract and create. + # The same pattern matches the install step's existing + # `sudo tar --extract` of the installer tarball. + id: guix-store-cache + uses: actions/cache@0057852bfaa89a56745cba8c7296529d2fc39830 # v4.3.0 + with: + path: ~/guix-store-cache + # The key includes the pinned Guix version (a version bump + # may change the on-disk DB schema) and a hash of the + # workflow file (the package set and daemon configuration + # both live here, so any change to either forces a fresh + # snapshot). `restore-keys` lets a workflow edit that does + # not actually invalidate the store (a comment tweak, a + # timeout bump) still seed from the prior snapshot and + # re-save at job end. + key: guix-store-${{ env.GUIX_VERSION }}-${{ hashFiles('.github/workflows/ocapn-guile-interop.yml') }} + restore-keys: | + guix-store-${{ env.GUIX_VERSION }}- - name: Download Guix stable tarball # `ftp.gnu.org/gnu/guix/` is the GNU project's canonical # mirror for release binaries. It is operationally independent @@ -191,6 +226,38 @@ jobs: sudo systemctl enable --now gnu-store.mount guix-daemon.service echo "$GUIX_PATH/bin" >> "$GITHUB_PATH" + - name: Restore Guix store from cache snapshot + # Paired with the `Restore Guix store cache` step above. When + # a prior run's snapshot is present, extract it on top of the + # just-installed store so the daemon's next `guix build` + # finds the resolved closure already on disk and skips the + # substitute-server round-trip entirely. `--no-overwrite-dir` + # preserves the installer-laid `/gnu` and `/var` directory + # entries themselves. Files inside those directories overlap + # in two ways: store paths that the installer ships are + # byte-identical to the same store paths in the snapshot + # (Guix store contents are content-addressed by hash), so + # the default overwrite is safe; the daemon database under + # `/var/guix/db` in the snapshot is a strict superset of the + # installer's blank DB (it records every store path the + # daemon resolved on the prior run), so the overwrite is + # exactly what makes the cache effective. + # + # The daemon is stopped across the extract so the on-disk + # store and the daemon's in-memory view of it cannot diverge + # mid-flight. A divergence would surface later as missing- + # store-item errors from `guix build`. + if: steps.guix-store-cache.outputs.cache-matched-key != '' + run: | + set -euo pipefail + if [ ! -f ~/guix-store-cache/store.tar.zst ]; then + echo "cache key matched but no snapshot present at ~/guix-store-cache/store.tar.zst" + exit 0 + fi + sudo systemctl stop guix-daemon.service + sudo tar --extract --file ~/guix-store-cache/store.tar.zst \ + --directory / --no-overwrite-dir + sudo systemctl start guix-daemon.service - name: Authorize build farm # Authorizes both ci.guix.gnu.org and bordeaux.guix.gnu.org; # both .pub files ship inside the Guix tarball under @@ -268,6 +335,37 @@ jobs: } > "$scratch" cat "$scratch" >> "$GITHUB_ENV" + - name: Snapshot Guix store for cache + # Paired with the `Restore Guix store cache` step at the top + # of the job. Snapshots `/gnu/store` and `/var/guix/db` + # together so a future run restores a coherent view: the + # store paths exist on disk *and* the daemon's DB records + # them as valid. Caching `/gnu/store` alone would leave the + # daemon re-resolving substitutes anyway because it would + # not trust the unrecorded paths. + # + # Skipped on an exact-key cache hit; the snapshot we would + # write is byte-equivalent to the one we restored. The post- + # job save in `actions/cache` then picks up the directory + # contents (the tarball we just wrote) and uploads it under + # the current key. + if: steps.guix-store-cache.outputs.cache-hit != 'true' + run: | + set -euo pipefail + mkdir -p ~/guix-store-cache + # zstd compression keeps the snapshot well inside the 10 GB + # per-repo cache budget; the test's runtime closure is on + # the order of a few hundred megabytes uncompressed. The + # daemon does not need to be stopped for a snapshot: tar + # reads at the filesystem layer and the DB is a SQLite + # file that tar captures as a point-in-time copy. + sudo tar --create --file ~/guix-store-cache/store.tar.zst --zstd \ + /gnu/store /var/guix/db + # `actions/cache` runs as the `runner` user and cannot + # upload a root-owned file. Hand the tarball to the runner + # so the post-job save sees a readable path. + sudo chown -R "$(id -u):$(id -g)" ~/guix-store-cache + - name: Start Guile goblin-chat host working-directory: endo env: