Skip to content

Update base image to CUDA 13.3 / Ubuntu 24.04 and raise CI timeouts#158

Open
hannahli-nv wants to merge 19 commits into
mainfrom
update-docker-cuda13.3
Open

Update base image to CUDA 13.3 / Ubuntu 24.04 and raise CI timeouts#158
hannahli-nv wants to merge 19 commits into
mainfrom
update-docker-cuda13.3

Conversation

@hannahli-nv

@hannahli-nv hannahli-nv commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Description

Move the test/benchmark image to a newer base and make the GPU CI scale as the
test and benchmark suites grow.

Base image

  • Update the transformers Dockerfile base from cuda:13.2.0-devel-ubuntu22.04 to
    cuda:13.3.0-devel-ubuntu24.04 (newer default toolchain: GCC 13 + Python 3.12).
  • Update the nsight-systems devtools apt repo to the ubuntu2404 path and add
    --break-system-packages to the uv bootstrap pip install (PEP 668 on 24.04).

test-ops (sharded + balanced)

  • Split test-ops across parallel GPU runners with pytest-split (--splits/--group),
    balanced by a committed .github/.test_durations map; scale out by adding entries to
    the matrix.
  • Run at -n 8 and set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.8 to bound
    peak GPU memory and avoid intermittent CUDA OOM. (expandable_segments was tried and
    dropped — it conflicts with cuTile kernel launches.)
  • Skip the known-failing chunk_gated_delta_rule cuTile case (tracked separately).

test-benchmark (sharded + aggregate)

  • Split test-benchmark into a run-shard matrix (files assigned round-robin) plus a
    benchmark-aggregate job that merges all shard results and runs the existing regression
    check / summary / baseline update over the union.
  • Drop the cuTile backend from the layernorm/rmsnorm benchmarks: those kernels are
    JIT-compiled per shape and sweeping the full shape grid made the benchmark run for tens
    of minutes (the backend remains covered by the ops tests).
  • Fix the benchmark output parser to ignore non-data diagnostic stdout so malformed rows no
    longer leak into the results table / summary.

Nightly maintenance

  • Add a nightly-only update-test-durations job that, when the ops test set changes (tests
    added/removed — not on timing drift), opens a reviewable PR refreshing
    .github/.test_durations.

CI Configuration

config:
  build: true
  # valid options are "ops", "benchmark", and "sanity"
  test: ["ops", "benchmark"]

Checklist

  • Code formatted and imports sorted via repo specifications (./format.sh)
  • Documentation updated (if needed)
  • CI configuration reviewed

@copy-pr-bot

copy-pr-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@hannahli-nv hannahli-nv force-pushed the update-docker-cuda13.3 branch from 537eb30 to 9a56281 Compare June 28, 2026 15:26
@hannahli-nv hannahli-nv force-pushed the update-docker-cuda13.3 branch from 8b8374d to 969f17e Compare June 29, 2026 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants