[fp8] Adding support for per tensor quantization


**Motivation**

Currently, Alpha-MoE supports FP8 quantization with specific granularity requirements:

    Activations (x): Expects per-token scale factors.

    Weights (w, w2): Expects interleaved scales (implied per-channel or block-wise).

While per-token/per-channel quantization offers higher precision, per-tensor quantization (a single scalar scale for the entire tensor) is a standard format supported by many frameworks (e.g., standard FP8 GEMMs, TE, MS-AMP).

Adding support for per-tensor quantization would:

    Reduce memory overhead by removing the need to store and load scale vectors.

    Simplify integration with inference pipelines that rely on static, per-tensor calibration.

    Improve performance by reducing global memory reads for scale factors inside the kernel.

**Current Behavior**

The kernel interface fused_moe_w8a8_up_down in alpha_moe_ops.py defines:
Python

x_scale: Per-token scale factors for x
w_scale: Scale factors for w  # (Interleaved)

Passing a single scalar or a per-tensor scale tensor currently fails or results in undefined behavior because the kernel expects specific dimensions for loading and broadcasting.
Proposed Change

Update fused_moe_w8a8_up_down and the underlying CUDA megakernel to support per-tensor scales.

**The logic should ideally:**

    Allow x_scale to be a scalar or tensor of shape [1].

    Allow w_scale / w2_scale to be scalars or shape [1].

    Update the CUDA kernel to detect the scalar shape (or receive a stride of 0) to broadcast the single scale factor across the GEMM operation instead of loading vectors.

**Implementation Plan**

    [ ] Modify alpha_moe_ops.py bindings to accept scalar/single-element tensors for scales.

    [ ] Update the C++ / CUDA kernel logic to handle per-tensor broadcasting for x_scale and w_scale.

    [ ] Add a unit test in test/ comparing per-tensor quantization output against a reference implementation (or validating against the per-token path with constant scales).

**Additional Context**

    Repo: Aleph-Alpha/Alpha-MoE

    Relevant File: alpha_moe_ops.py (Python interface) / csrc/ (CUDA kernels)
**Profiling  results**

_Based on the initial implementation I saw 3-8% improvement in **inter token latency** when compared to VLLM 0.13.0 for the static quant, I also validated the quality tests succeed with mean tensor diff is < 1% ar e-3 ._



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fp8] Adding support for per tensor quantization #1

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[fp8] Adding support for per tensor quantization #1

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions