Skip to content

[fp8] Adding support for per tensor quantization #1

@hbikki

Description

@hbikki

Motivation

Currently, Alpha-MoE supports FP8 quantization with specific granularity requirements:

Activations (x): Expects per-token scale factors.

Weights (w, w2): Expects interleaved scales (implied per-channel or block-wise).

While per-token/per-channel quantization offers higher precision, per-tensor quantization (a single scalar scale for the entire tensor) is a standard format supported by many frameworks (e.g., standard FP8 GEMMs, TE, MS-AMP).

Adding support for per-tensor quantization would:

Reduce memory overhead by removing the need to store and load scale vectors.

Simplify integration with inference pipelines that rely on static, per-tensor calibration.

Improve performance by reducing global memory reads for scale factors inside the kernel.

Current Behavior

The kernel interface fused_moe_w8a8_up_down in alpha_moe_ops.py defines:
Python

x_scale: Per-token scale factors for x
w_scale: Scale factors for w # (Interleaved)

Passing a single scalar or a per-tensor scale tensor currently fails or results in undefined behavior because the kernel expects specific dimensions for loading and broadcasting.
Proposed Change

Update fused_moe_w8a8_up_down and the underlying CUDA megakernel to support per-tensor scales.

The logic should ideally:

Allow x_scale to be a scalar or tensor of shape [1].

Allow w_scale / w2_scale to be scalars or shape [1].

Update the CUDA kernel to detect the scalar shape (or receive a stride of 0) to broadcast the single scale factor across the GEMM operation instead of loading vectors.

Implementation Plan

[ ] Modify alpha_moe_ops.py bindings to accept scalar/single-element tensors for scales.

[ ] Update the C++ / CUDA kernel logic to handle per-tensor broadcasting for x_scale and w_scale.

[ ] Add a unit test in test/ comparing per-tensor quantization output against a reference implementation (or validating against the per-token path with constant scales).

Additional Context

Repo: Aleph-Alpha/Alpha-MoE

Relevant File: alpha_moe_ops.py (Python interface) / csrc/ (CUDA kernels)

Profiling results

Based on the initial implementation I saw 3-8% improvement in inter token latency when compared to VLLM 0.13.0 for the static quant, I also validated the quality tests succeed with mean tensor diff is < 1% ar e-3 .

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions