fix(driver/cuda): gate FlashInfer Mamba SM90 SSU on min arch, not any arch#438
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughThe CMake script changes how ChangesMamba SM90 compile gate refinement
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
… arch The SM90 SSU vertical/horizontal kernels use CCCL's experimental TMA intrinsics (cuda::device::experimental::cp_async_bulk_tensor_* / fence_proxy_async_shared_cta), which <cuda/barrier> exposes only behind __cccl_lib_experimental_ctk12_cp_async_exposure — defined solely when __CUDA_MINIMUM_ARCH__ >= 900. That macro is the *minimum* of the whole CUDA_ARCHITECTURES list and is constant across device passes, so a fat multi-arch build (e.g. the Docker image's 80;86;89;90) pins it to 800 and #ifdef's the intrinsics out of every pass — even the sm_90 one — breaking the build with "namespace cuda::device::experimental has no member ...". flashinfer_mamba.cu built fine at v0.3.0 (no mamba kernel); the SSU kernels arrived in v0.4.0 and the gate enabled FLASHINFER_MAMBA_ENABLE_SM90 whenever *any* targeted arch was 90/100, which a mixed build can never satisfy. Enable it only when *every* targeted arch is sm_90+ (a pure Hopper/Blackwell build, e.g. the per-compute-capability release binaries); mixed builds fall back to the simple SSU algorithm. flashinfer's kAuto dispatch already selects kSimple on every device (incl. Hopper) when FLASHINFER_MAMBA_ENABLE_SM90 is unset, so this is correct — just not the TMA-optimized path in a fat binary. Hopper/FA3 attention gating is unchanged (it compiles fine in fat builds; it uses CUTLASS TMA, not the CCCL experimental gate). Unbreaks the slim CUDA Docker image build added in #363. Verified e2e on yecl-gpu-02 (RTX 4090, CUDA 12.9): image builds (4.11 GB runtime), pie doctor detects the GPU with cuda_native compiled in, and `pie run text-completion` returns a coherent completion.
Problem
The slim CUDA Docker image (added in #363) no longer builds on
main. A fresh e2e rebuild on yecl-gpu-02 (RTX 4090, CUDA 12.9) fails compilingsrc/ops/flashinfer_mamba.cu:Routine CI never caught it: the
driver-cudajob is gated behind a self-hosted runner +ci-cudalabel, so the CUDA kernels are not compiled in normal PR CI.Root cause
The SM90 SSU vertical/horizontal kernels (new in v0.4.0; absent at the #363 merge) use CCCL's experimental TMA intrinsics, which
<cuda/barrier>exposes only behind__cccl_lib_experimental_ctk12_cp_async_exposure— defined solely when__CUDA_MINIMUM_ARCH__ >= 900.__CUDA_MINIMUM_ARCH__is the minimum of the wholeCUDA_ARCHITECTURESlist and is constant across device passes, so the fat80;86;89;90Docker build pins it to 800 and#ifdefs the intrinsics out of every pass — even the sm_90 one.The existing gate set
FLASHINFER_MAMBA_ENABLE_SM90whenever any targeted arch was 90/100 — a condition a mixed build can never compile.Fix
Enable the SM90 SSU compile gate only when every targeted arch is sm_90+ (a pure Hopper/Blackwell build — e.g. the per-compute-capability release binaries). Mixed/fat builds fall back to the simple SSU algorithm: flashinfer's
kAutodispatch already selectskSimpleon every device incl. Hopper whenFLASHINFER_MAMBA_ENABLE_SM90is unset (verified ininvokeSelectiveStateUpdate), so behavior is correct — just not the TMA-optimized path in a fat binary.Hopper/FA3 attention gating is unchanged (it compiles fine in fat builds; it uses CUTLASS TMA, not the CCCL experimental gate).
Verification
e2e on yecl-gpu-02 (RTX 4090, driver 575.57.08, CUDA 12.9) — current
main+ this fix:pie doctor→ GPU 0 detected,cuda_nativecompiled inpie run text-completion --prompt "The capital of France is"→The capital of France is **Paris**.Summary by CodeRabbit
Summary by CodeRabbit