fix(fp8): patch FP8 expert scale_inv after layerwise reload by aoshen02 · Pull Request #248 · vllm-project/vime

aoshen02 · 2026-06-14T03:12:40Z

Summary

Root cause: vLLM 0.22 layerwise reload's get_layer_size() includes FP8 expert scale_inv numel in load_numel_total, but FusedMoE's weight_loader is never invoked for these tensors during online_process_loader replay. The torch.empty-allocated scale buffers retain uninitialized NaN, causing all FP8 MoE rollout requests to fail with HTTP 400 (out of range float).
Fix: Worker extension (vLLMColocateWorkerExtension) stashes trainer-produced scale_inv tensors during update_weights_chunk. After finish_weight_update (which calls finalize_layerwise_reload), a new vime_apply_fp8_scales RPC copies the correct scales into FusedMoE fused buffers (w13_weight_scale_inv / w2_weight_scale_inv).
Evidence chain: 3-layer probe system (NANPROBE → NANTRACE → NANTRACE2) confirmed: source weights clean, converter output clean, n_scale_loads=0 in layerwise, NaN at POST-REPLAY stage. Full analysis in agent_run/reports/fp8-nan-bug-deep-dive.md.

Changes

update_weight_from_tensor.py: Add _apply_fp8_expert_scales() helper + scale stashing in update_weights_chunk + vime_apply_fp8_scales worker-ext method + trainer-side RPC call after finish_weight_update
vllm_engine.py: Add vime_apply_fp8_scales() HTTP route to collective_rpc

Test plan

FP8 Qwen3-30B-A3B rollout smoke on GB300: 0 NaN, throughput ≥ 160 tok/s/gpu
bf16 Qwen3-30B-A3B (no FP8): verify no regression (scale list empty → no-op)
Dense model (Qwen3-4B): verify no regression (no FusedMoE buffers → no-op)

🤖 Generated with Claude Code

read-the-docs-community · 2026-06-14T03:13:32Z

Documentation build overview

📚 vime | 🛠️ Build #33129928 | 📁 Comparing 2be0933 against latest (fa0b6e9)

🔍 Preview build

26 files changed · ± 26 modified

± Modified

gemini-code-assist

Code Review

This pull request introduces a workaround for a vLLM layerwise reload bug where FP8 expert scale_inv tensors are silently dropped. It stashes these scales during weight updates and applies them to FusedMoE buffers afterwards. The review feedback highlights a critical bug in the slicing logic where target.shape[-2] is incorrectly used instead of target.shape[-1], which would cause runtime errors. Additionally, it is recommended to use public PyTorch APIs (named_buffers and named_parameters) instead of accessing private attributes directly.

gemini-code-assist · 2026-06-14T03:14:01Z

+        if proj in ("gate_proj", "up_proj"):
+            n_slice = target.shape[-2] // 2
+            offset = 0 if proj == "gate_proj" else n_slice
+            target.data[expert_id, offset:offset + scale_dev.shape[0]] = scale_dev
+        else:
+            target.data[expert_id] = scale_dev


There is a critical bug here: target.shape[-2] is used to compute n_slice. For a 2D scale tensor of shape [num_experts, 2] (per-tensor) or [num_experts, 2 * intermediate_size] (per-channel), target.shape[-2] corresponds to num_experts (e.g., 64), not the fused dimension size (2 or 2 * intermediate_size). This will cause an IndexError or incorrect slicing at runtime. Using target.shape[-1] correctly targets the fused dimension. Additionally, using .copy_() is more idiomatic and safer for in-place tensor updates in PyTorch.

Suggested change

if proj in ("gate_proj", "up_proj"):

n_slice = target.shape[-2] // 2

offset = 0 if proj == "gate_proj" else n_slice

target.data[expert_id, offset:offset + scale_dev.shape[0]] = scale_dev

else:

target.data[expert_id] = scale_dev

if proj in ("gate_proj", "up_proj"):

n_slice = target.shape[-1] // 2

offset = 0 if proj == "gate_proj" else n_slice

target.data[expert_id, offset:offset + scale_dev.shape[0]].copy_(scale_dev)

else:

target.data[expert_id].copy_(scale_dev)

gemini-code-assist · 2026-06-14T03:14:01Z

+    for name, module in model.named_modules():
+        for bname, buf in module._buffers.items():
+            if buf is not None and "weight_scale_inv" in bname:
+                fused_moe_buffers[f"{name}.{bname}"] = buf
+        for pname, param in module._parameters.items():
+            if param is not None and "weight_scale_inv" in pname:
+                fused_moe_buffers[f"{name}.{pname}"] = param


Accessing private attributes like _buffers and _parameters directly is discouraged in PyTorch. It is more idiomatic and robust to use the public APIs named_buffers(recurse=False) and named_parameters(recurse=False). These public APIs also guarantee that the returned buffers and parameters are not None, allowing us to simplify the conditional checks.

Suggested change

for name, module in model.named_modules():

for bname, buf in module._buffers.items():

if buf is not None and "weight_scale_inv" in bname:

fused_moe_buffers[f"{name}.{bname}"] = buf

for pname, param in module._parameters.items():

if param is not None and "weight_scale_inv" in pname:

fused_moe_buffers[f"{name}.{pname}"] = param

for name, module in model.named_modules():

for bname, buf in module.named_buffers(recurse=False):

if "weight_scale_inv" in bname:

fused_moe_buffers[f"{name}.{bname}"] = buf

for pname, param in module.named_parameters(recurse=False):

if "weight_scale_inv" in pname:

fused_moe_buffers[f"{name}.{pname}"] = param

vLLM 0.22's layerwise reload counts FP8 expert scale_inv numel in get_layer_size() but the FusedMoE weight_loader never receives these tensors during online_process_loader replay. This leaves torch.empty- allocated scale buffers with uninitialized NaN, causing all FP8 MoE rollout generate requests to fail (HTTP 400, out-of-range float). Fix: stash trainer-produced scale_inv tensors in the worker extension during update_weights_chunk, then apply them via a new vime_apply_fp8_scales RPC after finish_weight_update completes layerwise finalize. The scales are mapped from HF per-expert names (gate_proj/up_proj → w13, down_proj → w2) into FusedMoE fused buffers. Root cause analysis: agent_run/reports/fp8-nan-bug-deep-dive.md Verified: FP8 rollout 3x clean, 167 tok/s/gpu, 0 NaN on GB300. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 14, 2026

View reviewed changes

aoshen02 force-pushed the fix/fp8-expert-scale-patch branch from 2be0933 to 8f99b1e Compare June 16, 2026 03:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(fp8): patch FP8 expert scale_inv after layerwise reload#248

fix(fp8): patch FP8 expert scale_inv after layerwise reload#248
aoshen02 wants to merge 1 commit into
mainfrom
fix/fp8-expert-scale-patch

aoshen02 commented Jun 14, 2026

Uh oh!

read-the-docs-community Bot commented Jun 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

aoshen02 commented Jun 14, 2026

Summary

Changes

Test plan

Uh oh!

read-the-docs-community Bot commented Jun 14, 2026

Documentation build overview

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant