Update base image to CUDA 13.3 / Ubuntu 24.04 and raise CI timeouts#158
Open
hannahli-nv wants to merge 19 commits into
Open
Update base image to CUDA 13.3 / Ubuntu 24.04 and raise CI timeouts#158hannahli-nv wants to merge 19 commits into
hannahli-nv wants to merge 19 commits into
Conversation
537eb30 to
9a56281
Compare
xjmxyt
approved these changes
Jun 29, 2026
8b8374d to
969f17e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Move the test/benchmark image to a newer base and make the GPU CI scale as the
test and benchmark suites grow.
Base image
cuda:13.2.0-devel-ubuntu22.04tocuda:13.3.0-devel-ubuntu24.04(newer default toolchain: GCC 13 + Python 3.12).ubuntu2404path and add--break-system-packagesto theuvbootstrap pip install (PEP 668 on 24.04).test-ops (sharded + balanced)
test-opsacross parallel GPU runners withpytest-split(--splits/--group),balanced by a committed
.github/.test_durationsmap; scale out by adding entries tothe matrix.
-n 8and setPYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.8to boundpeak GPU memory and avoid intermittent CUDA OOM. (
expandable_segmentswas tried anddropped — it conflicts with cuTile kernel launches.)
chunk_gated_delta_rulecuTile case (tracked separately).test-benchmark (sharded + aggregate)
test-benchmarkinto a run-shard matrix (files assigned round-robin) plus abenchmark-aggregatejob that merges all shard results and runs the existing regressioncheck / summary / baseline update over the union.
JIT-compiled per shape and sweeping the full shape grid made the benchmark run for tens
of minutes (the backend remains covered by the ops tests).
longer leak into the results table / summary.
Nightly maintenance
update-test-durationsjob that, when the ops test set changes (testsadded/removed — not on timing drift), opens a reviewable PR refreshing
.github/.test_durations.CI Configuration
Checklist
./format.sh)