Skip to content

Modify ep_moe_fused 's backward to run with ep_size < 8#171

Open
WhatGhost wants to merge 1 commit into
ByteDance-Seed:mainfrom
WhatGhost:dev-test1
Open

Modify ep_moe_fused 's backward to run with ep_size < 8#171
WhatGhost wants to merge 1 commit into
ByteDance-Seed:mainfrom
WhatGhost:dev-test1

Conversation

@WhatGhost

Copy link
Copy Markdown

When I run the test_ep_moe_fused.py within 4 GPUS.

NVSHMEM_DISABLE_CUDA_VMM=0  bash ./scripts/launch.sh --nproc_per_node=4 python/triton_dist/test/nvidia/test_ep_moe_fused.py --ntokens 8192 --hidden_dim 1536 --ffn_dim 480 --topk 8 --num_experts 64

I met the error

[rank1]: Traceback (most recent call last):
[rank1]:   File "/target/Triton-distributed/python/triton_dist/test/nvidia/test_ep_moe_fused.py", line 380, in <module>
[rank1]:     main()
[rank1]:   File "/target/Triton-distributed/python/triton_dist/test/nvidia/test_ep_moe_fused.py", line 301, in main
[rank1]:     triton_dist_fwd_bwd_time, triton_dist_fwd_bwd_mem = benchmark_latency_memory(
[rank1]:                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/target/Triton-distributed/python/triton_dist/profiler_utils.py", line 376, in benchmark_latency_memory
[rank1]:     func()
[rank1]:   File "/target/Triton-distributed/python/triton_dist/test/nvidia/test_ep_moe_fused.py", line 292, in triton_dist_fwd_bwd
[rank1]:     output.backward(grad_output)
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/_tensor.py", line 648, in backward
[rank1]:     torch.autograd.backward(
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/__init__.py", line 353, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 307, in apply
[rank1]:     return user_fn(self, *args)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/amp/autocast_mode.py", line 556, in decorate_bwd
[rank1]:     return bwd(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/target/Triton-distributed/python/triton_dist/function/nvidia/ep_moe_fused.py", line 207, in backward
[rank1]:     assert triton_dist_ep_ctx.ep_group.size() == 8  # only for intra-node
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: AssertionError

It seems like that the assert statement enforces the constraint that ep_group_size must be 8.

So I change it from "==8" to "<=8" .So it can run on 4GPUS.

Once the changes were made, I ran the tests again and It worked~.

‘’‘

==================================================================================================================================
Expert Parallel MoE Benchmark Summary (SM_margin=0, topk=8, num_experts=64) (format: latency(ms)/peak_memory(MB)/precision)
==================================================================================================================================
 Ntokens   Hidden      FFN triton_dist_fwd triton_dist_fwd_bwd
==============================================================
    1024     1536      480   1.979/24.45/✅      4.659/105.76/✅
    2048     1536      480   2.317/36.53/✅      4.450/130.06/✅
    4096     1536      480   2.386/62.08/✅      5.133/181.14/✅
    8192     1536      480  2.472/110.80/✅      5.488/278.16/✅

’‘’

@CLAassistant

CLAassistant commented May 26, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants