[vla-fine-tuning] perf: ~5× per-step speedup; zero data spillage by irradiantlife · Pull Request #705 · anyscale/templates

irradiantlife · 2026-05-20T14:08:13Z

Summary

ENG-level perf cleanup of the VLA fine-tuning template. No model / numerical changes -- removes synchronous host stalls in the training loop and the producer overruns they caused. Primary motivation is removing CPU:GPU sync with an async helper. Cost of helper is reserving 1 batch size of GPU RAM.

Smoke benchmark, Anyscale workspace, 4× L4, MAX_TRAIN_STEPS=100, freshly-restarted cluster:

Metric	Before	After	Delta
Per-step training body	3.17 s	0.60 s	-81 %
Dataset producer time	316.87 s	60.37 s	-81 %
Object-store spillage (peak)	262 GB	0 GB	100%

Per-step is computed as dataset_exec_time / num_steps -- under this template's overlapped producer/consumer pipeline, the dataset producer runs as long as consumers are pulling, so this captures the steady-state training-body cost cleanly (excluding cluster setup and checkpoint upload).

(Total wall-clock improvement is workload-dependent: dominated by the ~5× per-step speedup once setup/teardown amortizes. On the @100-step smoke this is ~40 % wall-clock; longer runs converge toward the 5× ratio. Numbers reproduced on a fresh workspace -- back-to-back runs on the same Ray cluster show contamination as per-node spill files persist between runs.)

Why

The template's GPU consumers were never starved, but the consumer-side plumbing forced repeated host syncs:

The collate did synchronous (default-stream) H2D copies, so even with non_blocking=True the H2D serialized against compute on the device.
Per-step .item() / .as_py() calls forced host syncs and Arrow scalar boxing inside the hot loops.
No GPU prefetch -- compute and H2D fought for the default stream, no overlap.

The 262 GB of object-store spillage in the baseline was a symptom of the slow consumer giving Ray Data producers a wide window to overrun. Once the consumer-side stalls are removed, Ray's backpressure system reaches equilibrium on its own.

Changes

util.py
- NumpyToTorchCollate: produce pinned-CPU tensors (no H2D); pair with new cuda_prefetcher for device-level overlap.
- cuda_prefetcher: 1-batch GPU prefetch on a dedicated CUDA stream so batch N+1's H2D copy overlaps with batch N's fwd/bwd.
- enable_gpu_perf_flags: TF32 + cudnn.benchmark, called from the worker.
- make_trainable_optimizer: fused-CUDA AdamW + caches trainable param list so clip_grad_norm_ doesn't re-walk every step.
- train_step: return loss as a 0-d device tensor (was float()), eliminating the per-step host sync.
vla.py (mirrored in README.ipynb)
- Fuse transpose_images (stack + transpose + astype) into a single pre-allocated float32 buffer.
- Loss accumulator stays on-device as a 0-d tensor; .item() only at log boundary + end-of-epoch.
lerobot_datasource.py
- Hoist Arrow .as_py() boxing out of the per-row read loop -- convert columns to python lists once per parquet table.

Test plan

pre-commit run --files <changed> clean
python ci/validate_build_yaml.py --no-network passes
Anyscale workspace smoke (4× L4) reproduced cleanly.
/test-template vla-fine-tuning on the PR for the Buildkite smoke (currently tests/vla-fine-tuning/tests.sh is fully commented out, so this only verifies the workspace+notebook pipeline path).

To reproduce on a freshly-restarted Anyscale workspace:

export HF_TOKEN=hf_...                # needs "Read access to gated repos"
export MAX_TRAIN_STEPS=100
time uv run papermill README.ipynb /tmp/out.ipynb -k python3 --log-output

Look for "Dataset train_ execution finished in N.NN seconds" -- divide by MAX_TRAIN_STEPS for an apples-to-apples per-step number. Object-store spillage shows up as "Spilled N MiB" log lines (should be absent on the perf branch).

## Summary ENG-level perf cleanup of the VLA fine-tuning template. No model / numerical changes -- removes synchronous host stalls in the training loop and the producer overruns they caused. Smoke benchmark, Anyscale workspace, 4× L4, MAX_TRAIN_STEPS=100, freshly-restarted cluster: | Metric | Before | After | Delta | |--------------------------------|----------|---------|--------| | Per-step training body | 3.17 s | 0.60 s | -81 % | | Dataset producer time | 316.87 s | 60.37 s | -81 % | | Object-store spillage (peak) | 262 GB | 0 GB | gone | Per-step is computed as `dataset_exec_time / num_steps` -- under this template's overlapped producer/consumer pipeline, the dataset producer runs as long as consumers are pulling, so this captures the steady-state training-body cost cleanly (excluding cluster setup and checkpoint upload). (Total wall-clock improvement is workload-dependent: dominated by the ~5× per-step speedup once setup/teardown amortizes. On the @100-step smoke this is ~40 % wall-clock; longer runs converge toward the 5× ratio. Numbers reproduced on a fresh workspace -- back-to-back runs on the same Ray cluster show contamination as per-node spill files persist between runs.) ## Why The template's GPU consumers were never starved, but the consumer-side plumbing forced repeated host syncs: 1. The collate did synchronous (default-stream) H2D copies, so even with `non_blocking=True` the H2D serialized against compute on the device. 2. Per-step `.item()` / `.as_py()` calls forced host syncs and Arrow scalar boxing inside the hot loops. 3. No GPU prefetch -- compute and H2D fought for the default stream, no overlap. The 262 GB of object-store spillage in the baseline was a symptom of the slow consumer giving Ray Data producers a wide window to overrun. Once the consumer-side stalls are removed, Ray's backpressure system reaches equilibrium on its own; no producer concurrency cap needed. ## Changes - **`util.py`** - `NumpyToTorchCollate`: produce pinned-CPU tensors (no H2D); pair with new `cuda_prefetcher` for device-level overlap. - `cuda_prefetcher`: 1-batch GPU prefetch on a dedicated CUDA stream so batch N+1's H2D copy overlaps with batch N's fwd/bwd. - `enable_gpu_perf_flags`: TF32 + cudnn.benchmark, called from the worker. - `make_trainable_optimizer`: fused-CUDA AdamW + caches trainable param list so `clip_grad_norm_` doesn't re-walk every step. - `train_step`: return loss as a 0-d device tensor (was `float()`), eliminating the per-step host sync. - **`vla.py`** (mirrored in `README.ipynb`) - Fuse `transpose_images` (stack + transpose + astype) into a single pre-allocated float32 buffer. - Loss accumulator stays on-device as a 0-d tensor; `.item()` only at log boundary + end-of-epoch. - **`lerobot_datasource.py`** - Hoist Arrow `.as_py()` boxing out of the per-row read loop -- convert columns to python lists once per parquet table. ## Test plan - [x] `pre-commit run --files <changed>` clean - [x] `python ci/validate_build_yaml.py --no-network` passes - [x] Anyscale workspace smoke (4× L4) reproduced cleanly. - [ ] `/test-template vla-fine-tuning` on the PR for the Buildkite smoke (currently `tests/vla-fine-tuning/tests.sh` is fully commented out, so this only verifies the workspace+notebook pipeline path). To reproduce on a freshly-restarted Anyscale workspace: ```bash export HF_TOKEN=hf_... # needs "Read access to gated repos" export MAX_TRAIN_STEPS=100 time uv run papermill README.ipynb /tmp/out.ipynb -k python3 --log-output ``` Look for "Dataset train_<id> execution finished in N.NN seconds" -- divide by MAX_TRAIN_STEPS for an apples-to-apples per-step number. Object-store spillage shows up as "Spilled N MiB" log lines (should be absent on the perf branch).

Aydin-ab · 2026-05-21T21:22:00Z

/test-template vla-fine-tuning

Aydin-ab · 2026-05-21T21:24:28Z

Thank you for the contribution

Hi @shorbaji i believe you wrote the template, can you give a review ?

Aydin-ab · 2026-05-21T21:24:52Z

/test-template vla-fine-tuning

irradiantlife and others added 2 commits May 20, 2026 07:53

Merge branch 'main' into perf/vla-gpu-cpu-bottlenecks

bcaebac

Aydin-ab requested a review from shorbaji May 21, 2026 21:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[vla-fine-tuning] perf: ~5× per-step speedup; zero data spillage#705

[vla-fine-tuning] perf: ~5× per-step speedup; zero data spillage#705
irradiantlife wants to merge 2 commits into
anyscale:mainfrom
irradiantlife:perf/vla-gpu-cpu-bottlenecks

irradiantlife commented May 20, 2026

Uh oh!

Aydin-ab commented May 21, 2026

Uh oh!

Aydin-ab commented May 21, 2026

Uh oh!

Aydin-ab commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

irradiantlife commented May 20, 2026

Summary

Why

Changes

Test plan

Uh oh!

Aydin-ab commented May 21, 2026

Uh oh!

Aydin-ab commented May 21, 2026

Uh oh!

Aydin-ab commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants