[parallel] fix: NPU Hang/Deadlock during DTensor parameter loading in FSDP2 by First-Frost-code · Pull Request #642 · ByteDance-Seed/VeOmni

First-Frost-code · 2026-04-10T08:28:42Z

Describe the bug
When training VLM/MoE models on certain hardware backends (especially Ascend NPUs with HCCL), the program frequently hangs indefinitely during the weight loading phase.

Root Cause
The original _dispatch_parameter uses dtensor_factory and then calls .copy_() on a flat parameter. When assigning a DTensor view to a flat parameter, PyTorch's __torch_dispatch__ implicitly triggers a Redistribute collective communication. This uncoordinated implicit network communication disrupts the backend streams on NPU, causing permanent deadlocks.

Proposed Solution
This PR adds a defensive feature toggle for NPU environments to bypass implicit DTensor network communications:

Extracts the local physical tensor using .to_local().
Manually parses the placements to perform pure mathematical .chunk(), assigning the correct shard based on the local mesh coordinate.
Performs a pure local physical .copy_().

This "physical bypass" effectively decouples the weight assignment from the distributed communication graph, completely resolving the deadlock while leaving the default GPU execution path untouched.

CLAassistant · 2026-04-10T08:29:25Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request updates the _dispatch_parameter function in veomni/models/module_utils.py to include a specialized handling path for NPU devices when a parallel plan is active. The change manually implements sharding for DTensor placements on NPUs. Feedback indicates that the current implementation moves the full tensor to the NPU before sharding, which could cause Out-Of-Memory (OOM) errors for large parameters; it is recommended to perform sharding on the CPU first. Additionally, a safety check should be added to verify that the tensor dimensions are large enough for the requested sharding to prevent runtime errors.

gemini-code-assist · 2026-04-10T08:31:18Z

+            local_device = orig_tensor.device
+            tensor = tensor.to(device=local_device, dtype=orig_tensor.dtype)
+            target_local = orig_tensor.to_local()
+
+            for mesh_dim, p in enumerate(placements):
+                if p.__class__.__name__ == "Shard":
+                    shard_dim = p.dim
+                    my_mesh_rank = device_mesh.get_coordinate()[mesh_dim]
+                    world_size = device_mesh.size(mesh_dim)
+
+                    shards = tensor.chunk(world_size, dim=shard_dim)
+                    tensor = shards[my_mesh_rank].contiguous()


The current implementation moves the entire tensor (or the EP-sharded version) to the NPU before performing the FSDP sharding. For large parameters (e.g., in MoE models), this can lead to unnecessary NPU memory spikes and potential OOM. It is more efficient to perform the sharding on the CPU first and only move the resulting local shard to the NPU. Additionally, it's safer to check if the tensor size is sufficient for the requested sharding to avoid an IndexError from torch.chunk if the dimension size is smaller than the mesh size.

for mesh_dim, p in enumerate(placements): if p.__class__.__name__ == "Shard": shard_dim = p.dim my_mesh_rank = device_mesh.get_coordinate()[mesh_dim] world_size = device_mesh.size(mesh_dim) if tensor.size(shard_dim) < world_size: raise ValueError(f"Tensor size {tensor.size(shard_dim)} on dim {shard_dim} is too small for world_size {world_size} on mesh dimension {mesh_dim} for parameter {full_param_name}") shards = tensor.chunk(world_size, dim=shard_dim) tensor = shards[my_mesh_rank] tensor = tensor.to(device=orig_tensor.device, dtype=orig_tensor.dtype).contiguous() target_local = orig_tensor.to_local()

…arameter dispatch Bypass implicit redistribute collective sync in DTensor copy by replacing dtensor_factory with manual chunking and local physical copy. This resolves critical Hang issues on Ascend NPUs when dispatching replicated or sharded weights like MoE gates.

…ventions

FoolPlayer · 2026-04-14T04:39:52Z

Thanks for you PR, can you help show the deadlock case (like model / size / training args), in fsdp2 we use rank0 load and broadcast to other ranks to avoid OOM. In this way OOM will happen only one tensor is larger than the device max memory

VeOmni/veomni/distributed/torch_parallelize.py

Line 566 in 187ac87

    
           rank0_load_and_broadcast_weights(model, weights_path, get_device_type(), dtensor_factory=distribute_tensor)

. And I notice that the fixed code is fsdp1 but the title is fsdp2.

First-Frost-code · 2026-04-15T02:33:45Z

Thanks for pointing that out! You brought up excellent points. Let me clarify the context and the root cause:

Regarding FSDP1 vs FSDP2:
You are completely right. I put FSDP2 in the title because my YAML config explicitly uses fsdp_mode: fsdp2. I didn't realize that under the hood, the MoE weight dispatching routing in torch_parallelize.py falls back to the FSDP1 logic. That's a great clarification, and I have updated the PR title to FSDP1.
The Deadlock Case (How to Reproduce):

Model: Qwen3-VL-30B-A3B-Instruct

Hardware: 16-card Ascend NPU cluster (e.g., 910C with HCCL)

Why the hang happens when EP > 1:
When ep_size is greater than 1, parallel_plan is generated. During rank0_load_and_broadcast_weights, the broadcasted tensor (like gate.weight) enters dispatch_parameter. When this DTensor view is assigned via .copy(), PyTorch's torch_dispatch silently attempts an implicit collective Redistribute. Because HCCL streams are highly sensitive, this unplanned implicit communication instantly causes an AclrtSynchronizeStreamWithTimeout deadlock across the cluster.

Why CPU Chunking?
By moving the .chunk() operation to the CPU and doing a pure local physical copy, we strictly bypass dtensor_factory and eradicate the implicit network collective, fixing the deadlock.
(You are absolutely correct that rank0_load_and_broadcast yields one tensor at a time. The CPU chunking I added in my latest commit was indeed to address the gemini-code-assist bot's feedback, ensuring we don't cause an NPU memory spike when sharding a single massive expert tensor, but the core of this PR is to solve the HCCL hang).

Thanks again for the rigorous review! Please let me know if any further adjustments are needed.

FoolPlayer · 2026-04-15T12:10:33Z

+
+        else:
+            # Default execution path for GPUs or non-EP scenarios
+            tensor =


GPU case missed

FoolPlayer · 2026-04-15T12:19:43Z

Sry, it's my mistake only fsdp2 use _dispatch_parameter. And can you help check whether this PR fix your issue #648.

First-Frost-code · 2026-04-17T08:50:56Z

My apologies! 😅 My code editor completely messed up the copy-paste during my last commit, which accidentally truncated the GPU execution path and broke the syntax. I just pulled the branch back and pushed a clean fix to restore the logic. Thanks for catching that!

Thanks for double-checking and confirming that it is indeed the FSDP2 routing. I have updated the PR title back to FSDP2 to keep it accurate.

Regarding PR #648, I have reviewed its code and logic carefully. It does NOT fix my issue, and this PR (#642) is still absolutely necessary. Here is why:

Different Execution Paths: PR #648 explicitly targets and fixes the load_model_weights (all-ranks-read) path. As the author of #648 noted in their PR description: "The rank0_load_and_broadcast_weights path keeps src_data_rank=0 — it legitimately needs scatter since only rank 0 reads."

My Trigger Condition: In my training setup (loading HF format initially), the framework falls back to the broadcast path. My training logs explicitly show: >> Loading model weights from disk on rank0 then broadcasting to other ranks...

The Deadlock Remains: Because PR #648 leaves the rank0_load_and_broadcast_weights path unchanged, my setup still executes the old logic. It still passes the tensor to dtensor_factory, which triggers the implicit Redistribute collective via PyTorch's torch_dispatch. This uncoordinated implicit collective is exactly what destroys the HCCL streams and causes the permanent hang on Ascend NPUs.

Conclusion: PR #648 is a fantastic fix for the direct-read scenario, but this PR (#642) acts as the essential safety net (physical bypass) for the broadcast scenario on NPUs. They complement each other perfectly!

Let me know if you need any more logs or tests from my side!

github-actions Bot added ascend everything about Ascend support fix labels Apr 10, 2026

gemini-code-assist Bot reviewed Apr 10, 2026

View reviewed changes

First-Frost-code force-pushed the fix/npu-dtensor-deadlock branch from 9a476f0 to 3ff6669 Compare April 10, 2026 08:33

First-Frost-code closed this Apr 10, 2026

First-Frost-code reopened this Apr 10, 2026

style: translate inline comments to English and match open-source con…

df754a8

…ventions

First-Frost-code changed the title ~~fix: NPU Hang/Deadlock during DTensor parameter loading in FSDP2~~ fix: NPU Hang/Deadlock during DTensor parameter loading in FSDP1 Apr 15, 2026

FoolPlayer reviewed Apr 15, 2026

View reviewed changes

fix: resolve syntax error and restore missing GPU execution path

5f35813

First-Frost-code changed the title ~~fix: NPU Hang/Deadlock during DTensor parameter loading in FSDP1~~ fix: NPU Hang/Deadlock during DTensor parameter loading in FSDP2 Apr 17, 2026

FoolPlayer changed the title ~~fix: NPU Hang/Deadlock during DTensor parameter loading in FSDP2~~ [parallel] fix: NPU Hang/Deadlock during DTensor parameter loading in FSDP2 Apr 18, 2026

code format

d1b4f46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[parallel] fix: NPU Hang/Deadlock during DTensor parameter loading in FSDP2#642

[parallel] fix: NPU Hang/Deadlock during DTensor parameter loading in FSDP2#642
First-Frost-code wants to merge 4 commits into
ByteDance-Seed:mainfrom
First-Frost-code:fix/npu-dtensor-deadlock

First-Frost-code commented Apr 10, 2026

Uh oh!

CLAassistant commented Apr 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

FoolPlayer commented Apr 14, 2026 •

edited

Loading

Uh oh!

First-Frost-code commented Apr 15, 2026

Uh oh!

FoolPlayer Apr 15, 2026

Uh oh!

FoolPlayer commented Apr 15, 2026

Uh oh!

First-Frost-code commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

First-Frost-code commented Apr 10, 2026

Uh oh!

CLAassistant commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

FoolPlayer commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

First-Frost-code commented Apr 15, 2026

Uh oh!

FoolPlayer Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

FoolPlayer commented Apr 15, 2026

Uh oh!

First-Frost-code commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Apr 10, 2026 •

edited

Loading

FoolPlayer commented Apr 14, 2026 •

edited

Loading