Skip to content

Bake Mathlib build cache into CI image#307

Merged
Eduardogbg merged 8 commits into
mainfrom
edu/ci-mathlib-image
Jun 12, 2026
Merged

Bake Mathlib build cache into CI image#307
Eduardogbg merged 8 commits into
mainfrom
edu/ci-mathlib-image

Conversation

@Eduardogbg

Copy link
Copy Markdown
Collaborator

Mathlib cache migrated to Azure and they stopped serving our version's build cache. That and a couple of other reasons (e.g. cache get having a few types of transient errors, github itself being an unreliable mess) makes our CI pipelines very finicky, especially when you consider how long a cacheless Mathlib build takes.

This PR bakes the Mathlib build cache on our CI image. Since we rebuild the image whenever Mathlib (among other core dependencies) changes version, this has basically no downside.

A couple of other changes here like:

  • having another free space step, since regardless of our image, non-self-hosted Github Actions runners include a bunch of unwanted software like Android SDK which occupies a lot of disk space
  • provides lake manifests for the test packages

Comment thread .github/workflows/build-ci-image.yaml Outdated
Comment thread .github/workflows/build-ci-image.yaml Outdated
continue-on-error: true
run: python3 scripts/ci/verify_ci_image.py ghcr.io/reilabs/lampe-ci:${{ steps.versions.outputs.image_tag }}
- name: Free Runner Disk Space
if: steps.check.outcome == 'failure' && inputs.allow_push

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about always removing it if don't need it at all? is the step very slow and it's better to do it only when in need to rebuild the image?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also all other steps share same condition that could be a condition for another job in the pipeline (like run build image job only if there is no image already build).
https://github.com/orgs/community/discussions/25941

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok good suggestion: build-ci-image is now check/build/guard jobs with job-level conditions, and the disk-space step only exists inside the build job, so it only runs when an image actually gets built. did the same with ci.yaml, the per-step cache-hit guards are gone

Comment thread .github/workflows/ci.yaml Outdated
if: ${{ steps.lampe-build-cache.outputs.cache-hit != 'true' }}
working-directory: Lampe
run: lake exe cache get
run: lake exe cache get || echo "cache get failed; falling back to artifacts baked into the CI image"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not always use baked in docker image files?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this command reads the local archive (via an env var pointing to it) and unpackages. it shouldn't hit the network

Ship only the ~0.4GB ltar store at /opt/mathlib-cache:
bake_mathlib_cache.sh fills it from mathlib's cloud cache when it still
serves the pinned rev, else from a full source build plus
'lake exe cache pack', and fails the image build on an incomplete store.
The image tag gains the mathlib rev and a hash of the docker build
context, so editing the recipe or any COPY'd file retags the image.
Container jobs clear the bind-mounted /__t; host runners drop
/opt/hostedtoolcache, the Android SDK, and the .NET tree.
check probes GHCR for the tag with 'docker manifest inspect' (replacing
verify_ci_image.py), build runs under a single job-level condition
(image missing and pushes allowed), and guard fails fork PRs whose
image is missing. ci-noir passes allow_push: true since it only
triggers on schedule and dispatch.
scripts/ci/link_lake_packages.py points a workspace's .lake/packages at
the shared $LAKE_PKG_DIR; test.xsh and the CI jobs invoke the same
helper. The lakefile/manifest rewriters only write when something
actually changes, since CI cache keys hash the rewritten files.
'lake exe cache get' is best-effort: the baked store restores offline
and 'lake build' can always finish from source. Tests run in sorted
order, and the single-purpose helpers in utils.xsh fold into the
regex-based ones.
Checked-in lake-manifest.json files keep lake from re-resolving mathlib
through proven-zk's .git-suffixed URL spelling and discarding existing
clones. MerkleFromScratch drops its packagesDir override in favor of
the shared packages symlink.
A keys job computes every cache key in one place and probes them with
lookup-only restores; the rustfmt/clippy/test/lean/stdlib/e2e jobs skip
at the job level on an exact hit, with !failure() && !cancelled()
guards on jobs downstream of skippable ones. Lean jobs share dependency
clones via $LAKE_PKG_DIR and restore builds through restore-keys
fallbacks, so lake builds stay incremental on partial hits. The CLI
travels between jobs only as an artifact (upload zips drop the exec
bit, hence the chmod steps). The unused CI_IMAGE env var is gone.
@Eduardogbg Eduardogbg force-pushed the edu/ci-mathlib-image branch from 93093e4 to 60542d4 Compare June 11, 2026 19:18

@piohei piohei left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good work!

@Eduardogbg Eduardogbg merged commit ffe9dca into main Jun 12, 2026
13 checks passed
@Eduardogbg Eduardogbg deleted the edu/ci-mathlib-image branch June 12, 2026 03:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants