Skip to content

fix(driver/cuda): gate FlashInfer Mamba SM90 SSU on min arch, not any arch#438

Merged
ingim merged 1 commit into
mainfrom
task/16257
Jun 21, 2026
Merged

fix(driver/cuda): gate FlashInfer Mamba SM90 SSU on min arch, not any arch#438
ingim merged 1 commit into
mainfrom
task/16257

Conversation

@shsym

@shsym shsym commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Problem

The slim CUDA Docker image (added in #363) no longer builds on main. A fresh e2e rebuild on yecl-gpu-02 (RTX 4090, CUDA 12.9) fails compiling src/ops/flashinfer_mamba.cu:

flashinfer/mamba/kernel_selective_state_update_stp.cuh:
  error: namespace "cuda::device::experimental" has no member
         "cp_async_bulk_tensor_4d_global_to_shared"   (+ ~30 more)

Routine CI never caught it: the driver-cuda job is gated behind a self-hosted runner + ci-cuda label, so the CUDA kernels are not compiled in normal PR CI.

Root cause

The SM90 SSU vertical/horizontal kernels (new in v0.4.0; absent at the #363 merge) use CCCL's experimental TMA intrinsics, which <cuda/barrier> exposes only behind __cccl_lib_experimental_ctk12_cp_async_exposure — defined solely when __CUDA_MINIMUM_ARCH__ >= 900. __CUDA_MINIMUM_ARCH__ is the minimum of the whole CUDA_ARCHITECTURES list and is constant across device passes, so the fat 80;86;89;90 Docker build pins it to 800 and #ifdefs the intrinsics out of every pass — even the sm_90 one.

The existing gate set FLASHINFER_MAMBA_ENABLE_SM90 whenever any targeted arch was 90/100 — a condition a mixed build can never compile.

Fix

Enable the SM90 SSU compile gate only when every targeted arch is sm_90+ (a pure Hopper/Blackwell build — e.g. the per-compute-capability release binaries). Mixed/fat builds fall back to the simple SSU algorithm: flashinfer's kAuto dispatch already selects kSimple on every device incl. Hopper when FLASHINFER_MAMBA_ENABLE_SM90 is unset (verified in invokeSelectiveStateUpdate), so behavior is correct — just not the TMA-optimized path in a fat binary.

Hopper/FA3 attention gating is unchanged (it compiles fine in fat builds; it uses CUTLASS TMA, not the CCCL experimental gate).

Verification

e2e on yecl-gpu-02 (RTX 4090, driver 575.57.08, CUDA 12.9) — current main + this fix:

  • image builds, 4.11 GB runtime stage
  • pie doctor → GPU 0 detected, cuda_native compiled in
  • pie run text-completion --prompt "The capital of France is"The capital of France is **Paris**.

Summary by CodeRabbit

Summary by CodeRabbit

  • Chores
    • Refined CUDA build configuration for FlashInfer Mamba SM90 optimization so the SM90+ path is enabled only when the build targets SM90+ exclusively, improving correctness and consistency for mixed-architecture builds.

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 10dfccdd-bb5c-420b-a285-4c16172f7b71

📥 Commits

Reviewing files that changed from the base of the PR and between 7c01867 and 219488f.

📒 Files selected for processing (1)
  • driver/cuda/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (1)
  • driver/cuda/CMakeLists.txt

Walkthrough

The CMake script changes how PIE_CUDA_FLASHINFER_MAMBA_SM90 is enabled. Instead of per-architecture loop assignments, two aggregate flags (_PIE_CUDA_HAS_SM90PLUS, _PIE_CUDA_HAS_PRE_SM90) are computed from the full CMAKE_CUDA_ARCHITECTURES list, and the define is enabled only when all targets are SM90 or newer.

Changes

Mamba SM90 compile gate refinement

Layer / File(s) Summary
SM90 exclusive-target gate
driver/cuda/CMakeLists.txt
Replaces per-iteration PIE_CUDA_FLASHINFER_MAMBA_SM90 assignments (ON for 90 and 100) with two whole-list flags: _PIE_CUDA_HAS_SM90PLUS and _PIE_CUDA_HAS_PRE_SM90. The Hopper FA3 source selection remains tied to arch == 90, while PIE_CUDA_FLASHINFER_MAMBA_SM90 is now enabled only when the target list has at least one SM90+ arch and zero pre-SM90 arches. Adds inline comments explaining mixed vs. pure SM90+ build behavior.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: correcting the FlashInfer Mamba SM90 compile gate from checking 'any arch' to checking 'min arch' for proper multi-architecture builds.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch task/16257

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

… arch

The SM90 SSU vertical/horizontal kernels use CCCL's experimental TMA
intrinsics (cuda::device::experimental::cp_async_bulk_tensor_* /
fence_proxy_async_shared_cta), which <cuda/barrier> exposes only behind
__cccl_lib_experimental_ctk12_cp_async_exposure — defined solely when
__CUDA_MINIMUM_ARCH__ >= 900. That macro is the *minimum* of the whole
CUDA_ARCHITECTURES list and is constant across device passes, so a fat
multi-arch build (e.g. the Docker image's 80;86;89;90) pins it to 800 and
#ifdef's the intrinsics out of every pass — even the sm_90 one — breaking
the build with "namespace cuda::device::experimental has no member ...".

flashinfer_mamba.cu built fine at v0.3.0 (no mamba kernel); the SSU kernels
arrived in v0.4.0 and the gate enabled FLASHINFER_MAMBA_ENABLE_SM90 whenever
*any* targeted arch was 90/100, which a mixed build can never satisfy.
Enable it only when *every* targeted arch is sm_90+ (a pure Hopper/Blackwell
build, e.g. the per-compute-capability release binaries); mixed builds fall
back to the simple SSU algorithm. flashinfer's kAuto dispatch already selects
kSimple on every device (incl. Hopper) when FLASHINFER_MAMBA_ENABLE_SM90 is
unset, so this is correct — just not the TMA-optimized path in a fat binary.

Hopper/FA3 attention gating is unchanged (it compiles fine in fat builds; it
uses CUTLASS TMA, not the CCCL experimental gate).

Unbreaks the slim CUDA Docker image build added in #363. Verified e2e on
yecl-gpu-02 (RTX 4090, CUDA 12.9): image builds (4.11 GB runtime), pie doctor
detects the GPU with cuda_native compiled in, and `pie run text-completion`
returns a coherent completion.
@ingim ingim merged commit b21aa5f into main Jun 21, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants