fix(fp8): patch FP8 expert scale_inv after layerwise reload#248
fix(fp8): patch FP8 expert scale_inv after layerwise reload#248aoshen02 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a vLLM layerwise reload bug where FP8 expert scale_inv tensors are silently dropped. It stashes these scales during weight updates and applies them to FusedMoE buffers afterwards. The review feedback highlights a critical bug in the slicing logic where target.shape[-2] is incorrectly used instead of target.shape[-1], which would cause runtime errors. Additionally, it is recommended to use public PyTorch APIs (named_buffers and named_parameters) instead of accessing private attributes directly.
| if proj in ("gate_proj", "up_proj"): | ||
| n_slice = target.shape[-2] // 2 | ||
| offset = 0 if proj == "gate_proj" else n_slice | ||
| target.data[expert_id, offset:offset + scale_dev.shape[0]] = scale_dev | ||
| else: | ||
| target.data[expert_id] = scale_dev |
There was a problem hiding this comment.
There is a critical bug here: target.shape[-2] is used to compute n_slice. For a 2D scale tensor of shape [num_experts, 2] (per-tensor) or [num_experts, 2 * intermediate_size] (per-channel), target.shape[-2] corresponds to num_experts (e.g., 64), not the fused dimension size (2 or 2 * intermediate_size). This will cause an IndexError or incorrect slicing at runtime. Using target.shape[-1] correctly targets the fused dimension. Additionally, using .copy_() is more idiomatic and safer for in-place tensor updates in PyTorch.
| if proj in ("gate_proj", "up_proj"): | |
| n_slice = target.shape[-2] // 2 | |
| offset = 0 if proj == "gate_proj" else n_slice | |
| target.data[expert_id, offset:offset + scale_dev.shape[0]] = scale_dev | |
| else: | |
| target.data[expert_id] = scale_dev | |
| if proj in ("gate_proj", "up_proj"): | |
| n_slice = target.shape[-1] // 2 | |
| offset = 0 if proj == "gate_proj" else n_slice | |
| target.data[expert_id, offset:offset + scale_dev.shape[0]].copy_(scale_dev) | |
| else: | |
| target.data[expert_id].copy_(scale_dev) |
| for name, module in model.named_modules(): | ||
| for bname, buf in module._buffers.items(): | ||
| if buf is not None and "weight_scale_inv" in bname: | ||
| fused_moe_buffers[f"{name}.{bname}"] = buf | ||
| for pname, param in module._parameters.items(): | ||
| if param is not None and "weight_scale_inv" in pname: | ||
| fused_moe_buffers[f"{name}.{pname}"] = param |
There was a problem hiding this comment.
Accessing private attributes like _buffers and _parameters directly is discouraged in PyTorch. It is more idiomatic and robust to use the public APIs named_buffers(recurse=False) and named_parameters(recurse=False). These public APIs also guarantee that the returned buffers and parameters are not None, allowing us to simplify the conditional checks.
| for name, module in model.named_modules(): | |
| for bname, buf in module._buffers.items(): | |
| if buf is not None and "weight_scale_inv" in bname: | |
| fused_moe_buffers[f"{name}.{bname}"] = buf | |
| for pname, param in module._parameters.items(): | |
| if param is not None and "weight_scale_inv" in pname: | |
| fused_moe_buffers[f"{name}.{pname}"] = param | |
| for name, module in model.named_modules(): | |
| for bname, buf in module.named_buffers(recurse=False): | |
| if "weight_scale_inv" in bname: | |
| fused_moe_buffers[f"{name}.{bname}"] = buf | |
| for pname, param in module.named_parameters(recurse=False): | |
| if "weight_scale_inv" in pname: | |
| fused_moe_buffers[f"{name}.{pname}"] = param |
vLLM 0.22's layerwise reload counts FP8 expert scale_inv numel in get_layer_size() but the FusedMoE weight_loader never receives these tensors during online_process_loader replay. This leaves torch.empty- allocated scale buffers with uninitialized NaN, causing all FP8 MoE rollout generate requests to fail (HTTP 400, out-of-range float). Fix: stash trainer-produced scale_inv tensors in the worker extension during update_weights_chunk, then apply them via a new vime_apply_fp8_scales RPC after finish_weight_update completes layerwise finalize. The scales are mapped from HF per-expert names (gate_proj/up_proj → w13, down_proj → w2) into FusedMoE fused buffers. Root cause analysis: agent_run/reports/fp8-nan-bug-deep-dive.md Verified: FP8 rollout 3x clean, 167 tok/s/gpu, 0 NaN on GB300. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2be0933 to
8f99b1e
Compare
Summary
get_layer_size()includes FP8 expertscale_invnumel inload_numel_total, but FusedMoE'sweight_loaderis never invoked for these tensors duringonline_process_loaderreplay. Thetorch.empty-allocated scale buffers retain uninitialized NaN, causing all FP8 MoE rollout requests to fail with HTTP 400 (out of range float).vLLMColocateWorkerExtension) stashes trainer-producedscale_invtensors duringupdate_weights_chunk. Afterfinish_weight_update(which callsfinalize_layerwise_reload), a newvime_apply_fp8_scalesRPC copies the correct scales into FusedMoE fused buffers (w13_weight_scale_inv/w2_weight_scale_inv).n_scale_loads=0in layerwise, NaN at POST-REPLAY stage. Full analysis inagent_run/reports/fp8-nan-bug-deep-dive.md.Changes
update_weight_from_tensor.py: Add_apply_fp8_expert_scales()helper + scale stashing inupdate_weights_chunk+vime_apply_fp8_scalesworker-ext method + trainer-side RPC call afterfinish_weight_updatevllm_engine.py: Addvime_apply_fp8_scales()HTTP route tocollective_rpcTest plan
🤖 Generated with Claude Code