fix(driver/cuda): gate FlashInfer Mamba SM90 SSU on min arch, not any arch by shsym · Pull Request #438 · pie-project/pie

shsym · 2026-06-20T02:31:40Z

Problem

The slim CUDA Docker image (added in #363) no longer builds on main. A fresh e2e rebuild on yecl-gpu-02 (RTX 4090, CUDA 12.9) fails compiling src/ops/flashinfer_mamba.cu:

flashinfer/mamba/kernel_selective_state_update_stp.cuh:
  error: namespace "cuda::device::experimental" has no member
         "cp_async_bulk_tensor_4d_global_to_shared"   (+ ~30 more)

Routine CI never caught it: the driver-cuda job is gated behind a self-hosted runner + ci-cuda label, so the CUDA kernels are not compiled in normal PR CI.

Root cause

The SM90 SSU vertical/horizontal kernels (new in v0.4.0; absent at the #363 merge) use CCCL's experimental TMA intrinsics, which <cuda/barrier> exposes only behind __cccl_lib_experimental_ctk12_cp_async_exposure — defined solely when __CUDA_MINIMUM_ARCH__ >= 900. __CUDA_MINIMUM_ARCH__ is the minimum of the whole CUDA_ARCHITECTURES list and is constant across device passes, so the fat 80;86;89;90 Docker build pins it to 800 and #ifdefs the intrinsics out of every pass — even the sm_90 one.

The existing gate set FLASHINFER_MAMBA_ENABLE_SM90 whenever any targeted arch was 90/100 — a condition a mixed build can never compile.

Fix

Enable the SM90 SSU compile gate only when every targeted arch is sm_90+ (a pure Hopper/Blackwell build — e.g. the per-compute-capability release binaries). Mixed/fat builds fall back to the simple SSU algorithm: flashinfer's kAuto dispatch already selects kSimple on every device incl. Hopper when FLASHINFER_MAMBA_ENABLE_SM90 is unset (verified in invokeSelectiveStateUpdate), so behavior is correct — just not the TMA-optimized path in a fat binary.

Hopper/FA3 attention gating is unchanged (it compiles fine in fat builds; it uses CUTLASS TMA, not the CCCL experimental gate).

Verification

e2e on yecl-gpu-02 (RTX 4090, driver 575.57.08, CUDA 12.9) — current main + this fix:

image builds, 4.11 GB runtime stage
pie doctor → GPU 0 detected, cuda_native compiled in
pie run text-completion --prompt "The capital of France is" → The capital of France is **Paris**.

Summary by CodeRabbit

Chores
- Refined CUDA build configuration for FlashInfer Mamba SM90 optimization so the SM90+ path is enabled only when the build targets SM90+ exclusively, improving correctness and consistency for mixed-architecture builds.

coderabbitai · 2026-06-20T02:31:52Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 10dfccdd-bb5c-420b-a285-4c16172f7b71

📥 Commits

Reviewing files that changed from the base of the PR and between 7c01867 and 219488f.

📒 Files selected for processing (1)

driver/cuda/CMakeLists.txt

🚧 Files skipped from review as they are similar to previous changes (1)

driver/cuda/CMakeLists.txt

Walkthrough

The CMake script changes how PIE_CUDA_FLASHINFER_MAMBA_SM90 is enabled. Instead of per-architecture loop assignments, two aggregate flags (_PIE_CUDA_HAS_SM90PLUS, _PIE_CUDA_HAS_PRE_SM90) are computed from the full CMAKE_CUDA_ARCHITECTURES list, and the define is enabled only when all targets are SM90 or newer.

Changes

Mamba SM90 compile gate refinement

Layer / File(s)	Summary
SM90 exclusive-target gate `driver/cuda/CMakeLists.txt`	Replaces per-iteration `PIE_CUDA_FLASHINFER_MAMBA_SM90` assignments (ON for `90` and `100`) with two whole-list flags: `_PIE_CUDA_HAS_SM90PLUS` and `_PIE_CUDA_HAS_PRE_SM90`. The Hopper FA3 source selection remains tied to `arch == 90`, while `PIE_CUDA_FLASHINFER_MAMBA_SM90` is now enabled only when the target list has at least one SM90+ arch and zero pre-SM90 arches. Adds inline comments explaining mixed vs. pure SM90+ build behavior.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: correcting the FlashInfer Mamba SM90 compile gate from checking 'any arch' to checking 'min arch' for proper multi-architecture builds.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch task/16257

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

… arch The SM90 SSU vertical/horizontal kernels use CCCL's experimental TMA intrinsics (cuda::device::experimental::cp_async_bulk_tensor_* / fence_proxy_async_shared_cta), which <cuda/barrier> exposes only behind __cccl_lib_experimental_ctk12_cp_async_exposure — defined solely when __CUDA_MINIMUM_ARCH__ >= 900. That macro is the *minimum* of the whole CUDA_ARCHITECTURES list and is constant across device passes, so a fat multi-arch build (e.g. the Docker image's 80;86;89;90) pins it to 800 and #ifdef's the intrinsics out of every pass — even the sm_90 one — breaking the build with "namespace cuda::device::experimental has no member ...". flashinfer_mamba.cu built fine at v0.3.0 (no mamba kernel); the SSU kernels arrived in v0.4.0 and the gate enabled FLASHINFER_MAMBA_ENABLE_SM90 whenever *any* targeted arch was 90/100, which a mixed build can never satisfy. Enable it only when *every* targeted arch is sm_90+ (a pure Hopper/Blackwell build, e.g. the per-compute-capability release binaries); mixed builds fall back to the simple SSU algorithm. flashinfer's kAuto dispatch already selects kSimple on every device (incl. Hopper) when FLASHINFER_MAMBA_ENABLE_SM90 is unset, so this is correct — just not the TMA-optimized path in a fat binary. Hopper/FA3 attention gating is unchanged (it compiles fine in fat builds; it uses CUTLASS TMA, not the CCCL experimental gate). Unbreaks the slim CUDA Docker image build added in #363. Verified e2e on yecl-gpu-02 (RTX 4090, CUDA 12.9): image builds (4.11 GB runtime), pie doctor detects the GPU with cuda_native compiled in, and `pie run text-completion` returns a coherent completion.

shsym force-pushed the task/16257 branch from c44458d to 7c01867 Compare June 20, 2026 03:03

shsym force-pushed the task/16257 branch from 7c01867 to 219488f Compare June 20, 2026 04:31

ingim merged commit b21aa5f into main Jun 21, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(driver/cuda): gate FlashInfer Mamba SM90 SSU on min arch, not any arch#438

fix(driver/cuda): gate FlashInfer Mamba SM90 SSU on min arch, not any arch#438
ingim merged 1 commit into
mainfrom
task/16257

shsym commented Jun 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

shsym commented Jun 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Verification

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shsym commented Jun 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 20, 2026 •

edited

Loading