Skip to content

fix(fp8): patch FP8 expert scale_inv after layerwise reload#248

Open
aoshen02 wants to merge 1 commit into
mainfrom
fix/fp8-expert-scale-patch
Open

fix(fp8): patch FP8 expert scale_inv after layerwise reload#248
aoshen02 wants to merge 1 commit into
mainfrom
fix/fp8-expert-scale-patch

Conversation

@aoshen02

Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: vLLM 0.22 layerwise reload's get_layer_size() includes FP8 expert scale_inv numel in load_numel_total, but FusedMoE's weight_loader is never invoked for these tensors during online_process_loader replay. The torch.empty-allocated scale buffers retain uninitialized NaN, causing all FP8 MoE rollout requests to fail with HTTP 400 (out of range float).
  • Fix: Worker extension (vLLMColocateWorkerExtension) stashes trainer-produced scale_inv tensors during update_weights_chunk. After finish_weight_update (which calls finalize_layerwise_reload), a new vime_apply_fp8_scales RPC copies the correct scales into FusedMoE fused buffers (w13_weight_scale_inv / w2_weight_scale_inv).
  • Evidence chain: 3-layer probe system (NANPROBE → NANTRACE → NANTRACE2) confirmed: source weights clean, converter output clean, n_scale_loads=0 in layerwise, NaN at POST-REPLAY stage. Full analysis in agent_run/reports/fp8-nan-bug-deep-dive.md.

Changes

  • update_weight_from_tensor.py: Add _apply_fp8_expert_scales() helper + scale stashing in update_weights_chunk + vime_apply_fp8_scales worker-ext method + trainer-side RPC call after finish_weight_update
  • vllm_engine.py: Add vime_apply_fp8_scales() HTTP route to collective_rpc

Test plan

  • FP8 Qwen3-30B-A3B rollout smoke on GB300: 0 NaN, throughput ≥ 160 tok/s/gpu
  • bf16 Qwen3-30B-A3B (no FP8): verify no regression (scale list empty → no-op)
  • Dense model (Qwen3-4B): verify no regression (no FusedMoE buffers → no-op)

🤖 Generated with Claude Code

@read-the-docs-community

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a vLLM layerwise reload bug where FP8 expert scale_inv tensors are silently dropped. It stashes these scales during weight updates and applies them to FusedMoE buffers afterwards. The review feedback highlights a critical bug in the slicing logic where target.shape[-2] is incorrectly used instead of target.shape[-1], which would cause runtime errors. Additionally, it is recommended to use public PyTorch APIs (named_buffers and named_parameters) instead of accessing private attributes directly.

Comment on lines +439 to +444
if proj in ("gate_proj", "up_proj"):
n_slice = target.shape[-2] // 2
offset = 0 if proj == "gate_proj" else n_slice
target.data[expert_id, offset:offset + scale_dev.shape[0]] = scale_dev
else:
target.data[expert_id] = scale_dev

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical bug here: target.shape[-2] is used to compute n_slice. For a 2D scale tensor of shape [num_experts, 2] (per-tensor) or [num_experts, 2 * intermediate_size] (per-channel), target.shape[-2] corresponds to num_experts (e.g., 64), not the fused dimension size (2 or 2 * intermediate_size). This will cause an IndexError or incorrect slicing at runtime. Using target.shape[-1] correctly targets the fused dimension. Additionally, using .copy_() is more idiomatic and safer for in-place tensor updates in PyTorch.

Suggested change
if proj in ("gate_proj", "up_proj"):
n_slice = target.shape[-2] // 2
offset = 0 if proj == "gate_proj" else n_slice
target.data[expert_id, offset:offset + scale_dev.shape[0]] = scale_dev
else:
target.data[expert_id] = scale_dev
if proj in ("gate_proj", "up_proj"):
n_slice = target.shape[-1] // 2
offset = 0 if proj == "gate_proj" else n_slice
target.data[expert_id, offset:offset + scale_dev.shape[0]].copy_(scale_dev)
else:
target.data[expert_id].copy_(scale_dev)

Comment on lines +407 to +413
for name, module in model.named_modules():
for bname, buf in module._buffers.items():
if buf is not None and "weight_scale_inv" in bname:
fused_moe_buffers[f"{name}.{bname}"] = buf
for pname, param in module._parameters.items():
if param is not None and "weight_scale_inv" in pname:
fused_moe_buffers[f"{name}.{pname}"] = param

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing private attributes like _buffers and _parameters directly is discouraged in PyTorch. It is more idiomatic and robust to use the public APIs named_buffers(recurse=False) and named_parameters(recurse=False). These public APIs also guarantee that the returned buffers and parameters are not None, allowing us to simplify the conditional checks.

Suggested change
for name, module in model.named_modules():
for bname, buf in module._buffers.items():
if buf is not None and "weight_scale_inv" in bname:
fused_moe_buffers[f"{name}.{bname}"] = buf
for pname, param in module._parameters.items():
if param is not None and "weight_scale_inv" in pname:
fused_moe_buffers[f"{name}.{pname}"] = param
for name, module in model.named_modules():
for bname, buf in module.named_buffers(recurse=False):
if "weight_scale_inv" in bname:
fused_moe_buffers[f"{name}.{bname}"] = buf
for pname, param in module.named_parameters(recurse=False):
if "weight_scale_inv" in pname:
fused_moe_buffers[f"{name}.{pname}"] = param

vLLM 0.22's layerwise reload counts FP8 expert scale_inv numel in
get_layer_size() but the FusedMoE weight_loader never receives these
tensors during online_process_loader replay. This leaves torch.empty-
allocated scale buffers with uninitialized NaN, causing all FP8 MoE
rollout generate requests to fail (HTTP 400, out-of-range float).

Fix: stash trainer-produced scale_inv tensors in the worker extension
during update_weights_chunk, then apply them via a new
vime_apply_fp8_scales RPC after finish_weight_update completes
layerwise finalize. The scales are mapped from HF per-expert names
(gate_proj/up_proj → w13, down_proj → w2) into FusedMoE fused buffers.

Root cause analysis: agent_run/reports/fp8-nan-bug-deep-dive.md
Verified: FP8 rollout 3x clean, 167 tok/s/gpu, 0 NaN on GB300.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aoshen02 aoshen02 force-pushed the fix/fp8-expert-scale-patch branch from 2be0933 to 8f99b1e Compare June 16, 2026 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant