osdc/hf-cache: shared HuggingFace model cache for OSDC runners (gated, not enabled) by huydhn · Pull Request #809 · pytorch/ci-infra

huydhn · 2026-06-23T22:56:58Z

What

Adds a new hf-cache module giving OSDC runners a shared, read-only HuggingFace model cache at /mnt/hf_cache — the OSDC equivalent of the old EC2 CI mount. Jobs read model weights from a local cache (HF_HOME=/mnt/hf_cache, offline mode) instead of pulling from the Hub on every run.

Important

This is a draft and changes behavior on ZERO clusters. The runner integration is gated behind a # BEGIN_HF_CACHE block that is stripped unless a cluster lists hf-cache in its modules:. No cluster's modules: list is modified here.

Design

Cloud-portable, no metadata engine, no EFS — just S3 + rclone:

Truth: plain, symlink-free HF cache-layout files in a single shared, private S3 bucket (pytorch-hf-model-cache). Any object store can host the same layout; migration is a plain rclone sync.
Mount: a privileged per-node rclone FUSE DaemonSet exposes the bucket read-only at host /mnt/hf_cache. Reads are lazy and cached on node-local NVMe (--vfs-cache-mode full, bounded by --vfs-cache-max-size + LRU), so a cold Karpenter node only pulls the models its jobs touch — TB in S3, small subset on disk.
Refresh: a CronJob is the only writer — downloads the curated models.txt set (with HF_TOKEN for gated models) and publishes a symlink-free copy to S3 (rclone copy -L, excluding blobs/).
IAM: per-cluster IRSA — hf-cache-mount (read-only), hf-cache-refresh (read/write).

S3 (plain model files) ──rclone FUSE DaemonSet (NVMe cache)──> /mnt/hf_cache (RO) ──hostPath──> job pods
        ▲── refresh CronJob (HF Hub → rclone copy -L)

Runner integration (ARC kubernetes mode)

generate_runners.py keeps/strips a # BEGIN_HF_CACHE block in the job-pod hook template based on module presence (same mechanism as pypi-cache). When enabled, every job pod gets the read-only /mnt/hf_cache hostPath mount (HostToContainer propagation) + HF_HOME/HF_HUB_CACHE/offline env. from_pretrained(...) / vllm.LLM(model=...) resolve transparently, no code changes.

Files

modules/hf-cache/ — terraform (shared bucket + per-cluster IRSA), deploy.sh, mount DaemonSet + refresh CronJob templates, refresh_cache.py + unit tests, models.txt, README.
modules/arc-runners/templates/runner.yaml.tpl + generate_runners.py — gated HF_CACHE block + module gate.
clusters.yaml — hf_cache defaults (not enabled on any cluster).

Testing

just lint — 13/13 pass.
just test — pass, 98.71% coverage.
Verified the gated template renders valid YAML both ways: disabled = no HF env/mount/volume (byte-identical job pod to today); enabled = mount + env present.

Open items (intentionally a draft)

Symlink-free layout — validate from_pretrained/vLLM resolve correctly from an rclone -L, blobs/-excluded layout (spike before enabling on a real cluster).
Privileged FUSE DaemonSet — confirm acceptable under the cluster's Pod Security posture.
Single shared bucket in us-east-2 — cross-region S3 reads for other regions (node cache absorbs repeats); per-region buckets / replication is a follow-up.
Strict-offline (HF_HUB_OFFLINE=1): uncached model errors out (matches EC2). Online-fallback overlay is a possible enhancement.

Design doc: (Google Doc link to be added)

…enabled) Adds a new `hf-cache` module giving OSDC runners a shared, read-only HuggingFace model cache at /mnt/hf_cache — the OSDC equivalent of the old EC2 CI mount. Jobs read model weights from a local cache instead of pulling from the Hub on every run. Design (cloud-portable, no metadata engine, no EFS): - The cache is plain, symlink-free HF cache-layout files in a single shared, private S3 bucket (the portable source of truth). - A privileged per-node rclone FUSE DaemonSet mounts the bucket read-only at the host /mnt/hf_cache, lazily fetching files on first read and caching them on node-local NVMe (bounded by --vfs-cache-max-size + LRU), so a cold Karpenter node only pulls the models its jobs touch. - A refresh CronJob is the only writer: it downloads the curated model set (models.txt) and publishes a symlink-free copy to S3 (rclone copy -L, dropping blobs/). - Per-cluster IRSA: hf-cache-mount (read-only), hf-cache-refresh (read/write). Runner integration (ARC kubernetes mode) is gated behind a `# BEGIN_HF_CACHE` block in the runner job-pod template, kept/stripped by generate_runners.py based on whether the cluster enables the `hf-cache` module — mirroring the existing pypi-cache pattern. This PR adds `hf_cache` defaults to clusters.yaml but does NOT add `hf-cache` to any cluster's modules list, so it changes behavior on zero clusters until explicitly enabled. Open items (see PR description): validate the symlink-free layout resolves via from_pretrained/vLLM; confirm the privileged-FUSE DaemonSet is acceptable under cluster Pod Security; single-region shared bucket vs per-region.

huydhn · 2026-06-23T23:17:13Z

Superseded by a ghstack stack for easier review: #811 (terraform) → #812 (hf-cache module) → #813 (clusters.yaml + arc-runners wiring).

huydhn closed this Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

osdc/hf-cache: shared HuggingFace model cache for OSDC runners (gated, not enabled)#809

osdc/hf-cache: shared HuggingFace model cache for OSDC runners (gated, not enabled)#809
huydhn wants to merge 1 commit into
pytorch:mainfrom
huydhn:huydhn/osdc-hf-cache

huydhn commented Jun 23, 2026

Uh oh!

huydhn commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

huydhn commented Jun 23, 2026

What

Design

Runner integration (ARC kubernetes mode)

Files

Testing

Open items (intentionally a draft)

Uh oh!

huydhn commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant