Motivation
Currently, Alpha-MoE supports FP8 quantization with specific granularity requirements:
Activations (x): Expects per-token scale factors.
Weights (w, w2): Expects interleaved scales (implied per-channel or block-wise).
While per-token/per-channel quantization offers higher precision, per-tensor quantization (a single scalar scale for the entire tensor) is a standard format supported by many frameworks (e.g., standard FP8 GEMMs, TE, MS-AMP).
Adding support for per-tensor quantization would:
Reduce memory overhead by removing the need to store and load scale vectors.
Simplify integration with inference pipelines that rely on static, per-tensor calibration.
Improve performance by reducing global memory reads for scale factors inside the kernel.
Current Behavior
The kernel interface fused_moe_w8a8_up_down in alpha_moe_ops.py defines:
Python
x_scale: Per-token scale factors for x
w_scale: Scale factors for w # (Interleaved)
Passing a single scalar or a per-tensor scale tensor currently fails or results in undefined behavior because the kernel expects specific dimensions for loading and broadcasting.
Proposed Change
Update fused_moe_w8a8_up_down and the underlying CUDA megakernel to support per-tensor scales.
The logic should ideally:
Allow x_scale to be a scalar or tensor of shape [1].
Allow w_scale / w2_scale to be scalars or shape [1].
Update the CUDA kernel to detect the scalar shape (or receive a stride of 0) to broadcast the single scale factor across the GEMM operation instead of loading vectors.
Implementation Plan
[ ] Modify alpha_moe_ops.py bindings to accept scalar/single-element tensors for scales.
[ ] Update the C++ / CUDA kernel logic to handle per-tensor broadcasting for x_scale and w_scale.
[ ] Add a unit test in test/ comparing per-tensor quantization output against a reference implementation (or validating against the per-token path with constant scales).
Additional Context
Repo: Aleph-Alpha/Alpha-MoE
Relevant File: alpha_moe_ops.py (Python interface) / csrc/ (CUDA kernels)
Profiling results
Based on the initial implementation I saw 3-8% improvement in inter token latency when compared to VLLM 0.13.0 for the static quant, I also validated the quality tests succeed with mean tensor diff is < 1% ar e-3 .
Motivation
Currently, Alpha-MoE supports FP8 quantization with specific granularity requirements:
While per-token/per-channel quantization offers higher precision, per-tensor quantization (a single scalar scale for the entire tensor) is a standard format supported by many frameworks (e.g., standard FP8 GEMMs, TE, MS-AMP).
Adding support for per-tensor quantization would:
Current Behavior
The kernel interface fused_moe_w8a8_up_down in alpha_moe_ops.py defines:
Python
x_scale: Per-token scale factors for x
w_scale: Scale factors for w # (Interleaved)
Passing a single scalar or a per-tensor scale tensor currently fails or results in undefined behavior because the kernel expects specific dimensions for loading and broadcasting.
Proposed Change
Update fused_moe_w8a8_up_down and the underlying CUDA megakernel to support per-tensor scales.
The logic should ideally:
Implementation Plan
Additional Context
Profiling results
Based on the initial implementation I saw 3-8% improvement in inter token latency when compared to VLLM 0.13.0 for the static quant, I also validated the quality tests succeed with mean tensor diff is < 1% ar e-3 .