osdc/hf-cache: shared HuggingFace model cache for OSDC runners (gated, not enabled)#809
Closed
huydhn wants to merge 1 commit into
Closed
osdc/hf-cache: shared HuggingFace model cache for OSDC runners (gated, not enabled)#809huydhn wants to merge 1 commit into
huydhn wants to merge 1 commit into
Conversation
…enabled)
Adds a new `hf-cache` module giving OSDC runners a shared, read-only
HuggingFace model cache at /mnt/hf_cache — the OSDC equivalent of the old
EC2 CI mount. Jobs read model weights from a local cache instead of pulling
from the Hub on every run.
Design (cloud-portable, no metadata engine, no EFS):
- The cache is plain, symlink-free HF cache-layout files in a single shared,
private S3 bucket (the portable source of truth).
- A privileged per-node rclone FUSE DaemonSet mounts the bucket read-only at
the host /mnt/hf_cache, lazily fetching files on first read and caching
them on node-local NVMe (bounded by --vfs-cache-max-size + LRU), so a cold
Karpenter node only pulls the models its jobs touch.
- A refresh CronJob is the only writer: it downloads the curated model set
(models.txt) and publishes a symlink-free copy to S3 (rclone copy -L,
dropping blobs/).
- Per-cluster IRSA: hf-cache-mount (read-only), hf-cache-refresh (read/write).
Runner integration (ARC kubernetes mode) is gated behind a
`# BEGIN_HF_CACHE` block in the runner job-pod template, kept/stripped by
generate_runners.py based on whether the cluster enables the `hf-cache`
module — mirroring the existing pypi-cache pattern. This PR adds `hf_cache`
defaults to clusters.yaml but does NOT add `hf-cache` to any cluster's
modules list, so it changes behavior on zero clusters until explicitly
enabled.
Open items (see PR description): validate the symlink-free layout resolves
via from_pretrained/vLLM; confirm the privileged-FUSE DaemonSet is acceptable
under cluster Pod Security; single-region shared bucket vs per-region.
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a new
hf-cachemodule giving OSDC runners a shared, read-only HuggingFace model cache at/mnt/hf_cache— the OSDC equivalent of the old EC2 CI mount. Jobs read model weights from a local cache (HF_HOME=/mnt/hf_cache, offline mode) instead of pulling from the Hub on every run.Important
This is a draft and changes behavior on ZERO clusters. The runner integration is gated behind a
# BEGIN_HF_CACHEblock that is stripped unless a cluster listshf-cachein itsmodules:. No cluster'smodules:list is modified here.Design
Cloud-portable, no metadata engine, no EFS — just S3 + rclone:
pytorch-hf-model-cache). Any object store can host the same layout; migration is a plainrclone sync./mnt/hf_cache. Reads are lazy and cached on node-local NVMe (--vfs-cache-mode full, bounded by--vfs-cache-max-size+ LRU), so a cold Karpenter node only pulls the models its jobs touch — TB in S3, small subset on disk.models.txtset (withHF_TOKENfor gated models) and publishes a symlink-free copy to S3 (rclone copy -L, excludingblobs/).hf-cache-mount(read-only),hf-cache-refresh(read/write).Runner integration (ARC kubernetes mode)
generate_runners.pykeeps/strips a# BEGIN_HF_CACHEblock in the job-pod hook template based on module presence (same mechanism aspypi-cache). When enabled, every job pod gets the read-only/mnt/hf_cachehostPath mount (HostToContainerpropagation) +HF_HOME/HF_HUB_CACHE/offline env.from_pretrained(...)/vllm.LLM(model=...)resolve transparently, no code changes.Files
modules/hf-cache/— terraform (shared bucket + per-cluster IRSA),deploy.sh, mount DaemonSet + refresh CronJob templates,refresh_cache.py+ unit tests,models.txt, README.modules/arc-runners/templates/runner.yaml.tpl+generate_runners.py— gatedHF_CACHEblock + module gate.clusters.yaml—hf_cachedefaults (not enabled on any cluster).Testing
just lint— 13/13 pass.just test— pass, 98.71% coverage.Open items (intentionally a draft)
from_pretrained/vLLM resolve correctly from anrclone -L,blobs/-excluded layout (spike before enabling on a real cluster).us-east-2— cross-region S3 reads for other regions (node cache absorbs repeats); per-region buckets / replication is a follow-up.HF_HUB_OFFLINE=1): uncached model errors out (matches EC2). Online-fallback overlay is a possible enhancement.Design doc: (Google Doc link to be added)