Skip to content

osdc/hf-cache: shared HuggingFace model cache for OSDC runners (gated, not enabled)#809

Closed
huydhn wants to merge 1 commit into
pytorch:mainfrom
huydhn:huydhn/osdc-hf-cache
Closed

osdc/hf-cache: shared HuggingFace model cache for OSDC runners (gated, not enabled)#809
huydhn wants to merge 1 commit into
pytorch:mainfrom
huydhn:huydhn/osdc-hf-cache

Conversation

@huydhn

@huydhn huydhn commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What

Adds a new hf-cache module giving OSDC runners a shared, read-only HuggingFace model cache at /mnt/hf_cache — the OSDC equivalent of the old EC2 CI mount. Jobs read model weights from a local cache (HF_HOME=/mnt/hf_cache, offline mode) instead of pulling from the Hub on every run.

Important

This is a draft and changes behavior on ZERO clusters. The runner integration is gated behind a # BEGIN_HF_CACHE block that is stripped unless a cluster lists hf-cache in its modules:. No cluster's modules: list is modified here.

Design

Cloud-portable, no metadata engine, no EFS — just S3 + rclone:

  • Truth: plain, symlink-free HF cache-layout files in a single shared, private S3 bucket (pytorch-hf-model-cache). Any object store can host the same layout; migration is a plain rclone sync.
  • Mount: a privileged per-node rclone FUSE DaemonSet exposes the bucket read-only at host /mnt/hf_cache. Reads are lazy and cached on node-local NVMe (--vfs-cache-mode full, bounded by --vfs-cache-max-size + LRU), so a cold Karpenter node only pulls the models its jobs touch — TB in S3, small subset on disk.
  • Refresh: a CronJob is the only writer — downloads the curated models.txt set (with HF_TOKEN for gated models) and publishes a symlink-free copy to S3 (rclone copy -L, excluding blobs/).
  • IAM: per-cluster IRSA — hf-cache-mount (read-only), hf-cache-refresh (read/write).
S3 (plain model files) ──rclone FUSE DaemonSet (NVMe cache)──> /mnt/hf_cache (RO) ──hostPath──> job pods
        ▲── refresh CronJob (HF Hub → rclone copy -L)

Runner integration (ARC kubernetes mode)

generate_runners.py keeps/strips a # BEGIN_HF_CACHE block in the job-pod hook template based on module presence (same mechanism as pypi-cache). When enabled, every job pod gets the read-only /mnt/hf_cache hostPath mount (HostToContainer propagation) + HF_HOME/HF_HUB_CACHE/offline env. from_pretrained(...) / vllm.LLM(model=...) resolve transparently, no code changes.

Files

  • modules/hf-cache/ — terraform (shared bucket + per-cluster IRSA), deploy.sh, mount DaemonSet + refresh CronJob templates, refresh_cache.py + unit tests, models.txt, README.
  • modules/arc-runners/templates/runner.yaml.tpl + generate_runners.py — gated HF_CACHE block + module gate.
  • clusters.yamlhf_cache defaults (not enabled on any cluster).

Testing

  • just lint — 13/13 pass.
  • just test — pass, 98.71% coverage.
  • Verified the gated template renders valid YAML both ways: disabled = no HF env/mount/volume (byte-identical job pod to today); enabled = mount + env present.

Open items (intentionally a draft)

  1. Symlink-free layout — validate from_pretrained/vLLM resolve correctly from an rclone -L, blobs/-excluded layout (spike before enabling on a real cluster).
  2. Privileged FUSE DaemonSet — confirm acceptable under the cluster's Pod Security posture.
  3. Single shared bucket in us-east-2 — cross-region S3 reads for other regions (node cache absorbs repeats); per-region buckets / replication is a follow-up.
  4. Strict-offline (HF_HUB_OFFLINE=1): uncached model errors out (matches EC2). Online-fallback overlay is a possible enhancement.

Design doc: (Google Doc link to be added)

…enabled)

Adds a new `hf-cache` module giving OSDC runners a shared, read-only
HuggingFace model cache at /mnt/hf_cache — the OSDC equivalent of the old
EC2 CI mount. Jobs read model weights from a local cache instead of pulling
from the Hub on every run.

Design (cloud-portable, no metadata engine, no EFS):
  - The cache is plain, symlink-free HF cache-layout files in a single shared,
    private S3 bucket (the portable source of truth).
  - A privileged per-node rclone FUSE DaemonSet mounts the bucket read-only at
    the host /mnt/hf_cache, lazily fetching files on first read and caching
    them on node-local NVMe (bounded by --vfs-cache-max-size + LRU), so a cold
    Karpenter node only pulls the models its jobs touch.
  - A refresh CronJob is the only writer: it downloads the curated model set
    (models.txt) and publishes a symlink-free copy to S3 (rclone copy -L,
    dropping blobs/).
  - Per-cluster IRSA: hf-cache-mount (read-only), hf-cache-refresh (read/write).

Runner integration (ARC kubernetes mode) is gated behind a
`# BEGIN_HF_CACHE` block in the runner job-pod template, kept/stripped by
generate_runners.py based on whether the cluster enables the `hf-cache`
module — mirroring the existing pypi-cache pattern. This PR adds `hf_cache`
defaults to clusters.yaml but does NOT add `hf-cache` to any cluster's
modules list, so it changes behavior on zero clusters until explicitly
enabled.

Open items (see PR description): validate the symlink-free layout resolves
via from_pretrained/vLLM; confirm the privileged-FUSE DaemonSet is acceptable
under cluster Pod Security; single-region shared bucket vs per-region.
@huydhn

huydhn commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Superseded by a ghstack stack for easier review: #811 (terraform) → #812 (hf-cache module) → #813 (clusters.yaml + arc-runners wiring).

@huydhn huydhn closed this Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant