Merge with upstream main (up to 4-5 June 2025) | Add some FP8 metric-tracking by dhia680 · Pull Request #79 · swiss-ai/Megatron-LM

dhia680 · 2025-06-06T12:13:46Z

This PR merges swiss-ai/main with upstream main (up to 4-5 June 2025), including FP8 blockwise scaling support (for Hopper).
It also adds some FP8 metric-tracking.

Fix the sync issue in `TemporalAsyncWorker` See merge request ADLR/megatron-lm!3155

Co-authored-by: Chenhan Yu <chenhany@nvidia.com> Co-authored-by: Chen-Han Yu <chenhany@cw-dfw-cs-001-login-01.cm.cluster> Co-authored-by: Ye Yu <yeyu@cw-dfw-cs-001-dc-01.cm.cluster>

Add ModelOpt speculative decoding finetune See merge request ADLR/megatron-lm!2971

Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: Chenhan Yu <chenhany@nvidia.com>

Moe fix for Llama4 See merge request ADLR/megatron-lm!3083

…DeepSeek-v3 Co-authored-by: jianbinc <shjwudp@gmail.com>

[custom FSDP] Support EP + FSDP training for DeepSeek-v3 See merge request ADLR/megatron-lm!2910

Fix extra tokens in returned generation Closes dl/JoC/nemo-ci#2075 See merge request ADLR/megatron-lm!3178

…o 2.2.0.dev0

Update current scaling supported TE version to 2.2.0.dev0 See merge request ADLR/megatron-lm!3160

Co-authored-by: Shanmugam Ramasamy <shanmugamr@cw-dfw-cs-001-vscode-01.cm.cluster> Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: Shanmugam Ramasamy <shanmugamr@shanmugamr-mlt.client.nvidia.com> Co-authored-by: Shanmugam Ramasamy <shanmugamr@cw-dfw-cs-001-login-01.cm.cluster> Co-authored-by: Vijay Korthikanti <vkorthikanti@cw-dfw-cs-001-login-01.cm.cluster> Co-authored-by: root <root@cw-dfw-h100-004-279-012.cm.cluster> Co-authored-by: root <root@cw-dfw-h100-004-316-012.cm.cluster> Co-authored-by: root <root@cw-dfw-h100-004-258-026.cm.cluster> Co-authored-by: root <root@cw-dfw-h100-004-008-033.cm.cluster> Co-authored-by: root <root@cw-dfw-h100-004-236-026.cm.cluster> Co-authored-by: root <root@cw-dfw-h100-004-267-012.cm.cluster>

Seperate chunk allocator See merge request ADLR/megatron-lm!3121

…inference_context.sequence_len_offset > 0

Revert inference_context.is_decode_only() to inference_context.sequence_len_offset > 0 See merge request ADLR/megatron-lm!3180

…-fusion will throw an exception when topk/num_local_experts is not the power of 2.

[BUG FIX]: fix the bug of indices-to-multihot-fusion will throw an exception when topk/num_local_experts is not the power of 2. See merge request ADLR/megatron-lm!3058

…g global ones with optional local ones for better parallelism flexibility Co-authored-by: Zhiyu Li <zhiyul@NVIDIA.com>

Refactor Inference Process Groups by replacing global ones with optional local ones for better parallelism flexibility See merge request ADLR/megatron-lm!3015

Update te patch to include 1626 See merge request ADLR/megatron-lm!3179

…yTorch FSDP2"" This reverts commit 1eaed21.

Co-authored-by: root <root@cw-dfw-h100-004-211-013.cm.cluster> Co-authored-by: Vijay Korthikanti <vkorthikanti@cw-dfw-cs-001-login-01.cm.cluster> Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: root <root@cw-dfw-h100-004-279-012.cm.cluster> Co-authored-by: root <root@cw-dfw-h100-004-316-012.cm.cluster> Co-authored-by: root <root@cw-dfw-h100-004-258-026.cm.cluster> Co-authored-by: root <root@cw-dfw-h100-004-008-033.cm.cluster> Co-authored-by: root <root@cw-dfw-h100-004-236-026.cm.cluster> Co-authored-by: root <root@cw-dfw-h100-004-267-012.cm.cluster>

Use FlashAttention 3 for inference See merge request ADLR/megatron-lm!3120

No RoPE for Llama4 See merge request ADLR/megatron-lm!3167

…recipe

Enable --fp8-param-gather for NV sub-channel recipe See merge request ADLR/megatron-lm!3010

…swiglu perf Co-authored-by: lit <lit@nvidia.com>

tests: Update frozen-checkpoints See merge request ADLR/megatron-lm!3363

…eration Co-authored-by: root <root@pool0-00755.cm.cluster>

Consolidate eval methods across train and generation See merge request ADLR/megatron-lm!3375

ci: Auto-restart on nan See merge request ADLR/megatron-lm!3388

…YARN embedding cache Co-authored-by: xuwenc <xuwenc@nvidia.com>

perf(mla, experimental): MLA RoPE fusion and YARN embedding cache Closes NVIDIA#429 See merge request ADLR/megatron-lm!2949

Co-authored-by: jianbinc <shjwudp@gmail.com>

Fix custom FSDP float8 tensor set_item See merge request ADLR/megatron-lm!3280

ci: Move queue blocker See merge request ADLR/megatron-lm!3401

Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>

ci: Improve error-handling of missing logs See merge request ADLR/megatron-lm!3400

Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>

ci: Control job concurrency See merge request ADLR/megatron-lm!3408

ci: Catch missing logs See merge request ADLR/megatron-lm!3412

ci: Remove tests from A100 See merge request ADLR/megatron-lm!3411

…of ChainedOptimizer

Add an option to skip counting zeros in grad of ChainedOptimizer See merge request ADLR/megatron-lm!3393

…groups

Add an interface to set high priority stream groups See merge request ADLR/megatron-lm!3326

sbak5 and others added 30 commits April 25, 2025 17:11

ADLR/megatron-lm!3155 - Fix the sync issue in TemporalAsyncWorker

48cc46f

Merge branch 'sbak/ckpt_sync_issue' into 'main'

222adb8

Fix the sync issue in `TemporalAsyncWorker` See merge request ADLR/megatron-lm!3155

ADLR/megatron-lm!2971 - Add ModelOpt speculative decoding finetune

c8f6279

Co-authored-by: Chenhan Yu <chenhany@nvidia.com> Co-authored-by: Chen-Han Yu <chenhany@cw-dfw-cs-001-login-01.cm.cluster> Co-authored-by: Ye Yu <yeyu@cw-dfw-cs-001-dc-01.cm.cluster>

Merge branch 'yeyu/finetune' into 'main'

154a7a8

Add ModelOpt speculative decoding finetune See merge request ADLR/megatron-lm!2971

ADLR/megatron-lm!3083 - Moe fix for Llama4

4ca4309

Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: Chenhan Yu <chenhany@nvidia.com>

Merge branch 'yuya/moe_sigmoid_fix' into 'main'

aab56ce

Moe fix for Llama4 See merge request ADLR/megatron-lm!3083

ADLR/megatron-lm!2910 - [custom FSDP] Support EP + FSDP training for …

5fe1eeb

…DeepSeek-v3 Co-authored-by: jianbinc <shjwudp@gmail.com>

Merge branch 'custom_fsdp_dsv3' into 'main'

f7a25e5

[custom FSDP] Support EP + FSDP training for DeepSeek-v3 See merge request ADLR/megatron-lm!2910

ADLR/megatron-lm!3178 - Fix extra tokens in returned generation

a1843ac

Merge branch 'helenn-fix-seqlen-chopping' into 'main'

ceed1b7

Fix extra tokens in returned generation Closes dl/JoC/nemo-ci#2075 See merge request ADLR/megatron-lm!3178

ADLR/megatron-lm!3160 - Update current scaling supported TE version t…

b764f2d

…o 2.2.0.dev0

Merge branch 'donghyukc/te_min_version' into 'main'

57d21c3

Update current scaling supported TE version to 2.2.0.dev0 See merge request ADLR/megatron-lm!3160

Merge branch 'seperate_chunk_allocator' into 'main'

e733d7d

Seperate chunk allocator See merge request ADLR/megatron-lm!3121

ADLR/megatron-lm!3180 - Revert inference_context.is_decode_only() to …

4f16de3

…inference_context.sequence_len_offset > 0

Merge branch 'helenn-fix-seqlenoffset' into 'main'

33a193d

Revert inference_context.is_decode_only() to inference_context.sequence_len_offset > 0 See merge request ADLR/megatron-lm!3180

ADLR/megatron-lm!3058 - [BUG FIX]: fix the bug of indices-to-multihot…

bc70535

…-fusion will throw an exception when topk/num_local_experts is not the power of 2.

Merge branch 'incidices_to_multihot' into 'main'

885a245

[BUG FIX]: fix the bug of indices-to-multihot-fusion will throw an exception when topk/num_local_experts is not the power of 2. See merge request ADLR/megatron-lm!3058

ADLR/megatron-lm!3015 - Refactor Inference Process Groups by replacin…

8208937

…g global ones with optional local ones for better parallelism flexibility Co-authored-by: Zhiyu Li <zhiyul@NVIDIA.com>

Merge branch 'zhiyul/orthotope/inference' into 'main'

7118d88

Refactor Inference Process Groups by replacing global ones with optional local ones for better parallelism flexibility See merge request ADLR/megatron-lm!3015

ADLR/megatron-lm!3179 - Update te patch to include 1626

9bb34bf

Merge branch 'donghyukc/te_patch_update' into 'main'

2f4463e

Update te patch to include 1626 See merge request ADLR/megatron-lm!3179

Revert "Revert "ADLR/megatron-lm!2581 - Add support for ZeRO-2 with P…

4429e8e

…yTorch FSDP2"" This reverts commit 1eaed21.

Merge branch 'fa3_inference' into 'main'

47e3bd3

Use FlashAttention 3 for inference See merge request ADLR/megatron-lm!3120

ADLR/megatron-lm!3167 - No RoPE for Llama4

8d1367f

Merge branch 'aot/no_rope_llama4' into 'main'

5807d1c

No RoPE for Llama4 See merge request ADLR/megatron-lm!3167

ADLR/megatron-lm!3010 - Enable --fp8-param-gather for NV sub-channel …

72afd63

…recipe

Merge branch 'nv_subchannel_native_fp8' into 'main'

1eb5fe5

Enable --fp8-param-gather for NV sub-channel recipe See merge request ADLR/megatron-lm!3010

ADLR/megatron-lm!3133 - fix: fix FP8 support in recompute; fix fused …

cf6d208

…swiglu perf Co-authored-by: lit <lit@nvidia.com>

AleHD and others added 30 commits May 29, 2025 17:22

metrics tracker

31468d8

ADLR/megatron-lm!3363 - tests: Update frozen-checkpoints

c6b08c2

Merge branch 'ko3n1g/tests/frozen-cpkt' into 'main'

8a39761

tests: Update frozen-checkpoints See merge request ADLR/megatron-lm!3363

ADLR/megatron-lm!3375 - Consolidate eval methods across train and gen…

8d08685

…eration Co-authored-by: root <root@pool0-00755.cm.cluster>

Merge branch 'matthieul/consolidate_eval' into 'main'

13898cb

Consolidate eval methods across train and generation See merge request ADLR/megatron-lm!3375

ADLR/megatron-lm!3388 - ci: Auto-restart on nan

de245df

Merge branch 'ko3n1g/ci/restart-on-nan' into 'main'

0a438ed

ci: Auto-restart on nan See merge request ADLR/megatron-lm!3388

swissai/fp8 merged into nvidia/main

e132238

ADLR/megatron-lm!2949 - perf(mla, experimental): MLA RoPE fusion and …

23e6471

…YARN embedding cache Co-authored-by: xuwenc <xuwenc@nvidia.com>

Merge branch 'hongxiaob/mla_rope' into 'main'

9c1a535

perf(mla, experimental): MLA RoPE fusion and YARN embedding cache Closes NVIDIA#429 See merge request ADLR/megatron-lm!2949

ADLR/megatron-lm!3280 - Fix custom FSDP float8 tensor set_item

da3f0ff

Co-authored-by: jianbinc <shjwudp@gmail.com>

Merge branch 'fix_cfsdp_fp8_param_load' into 'main'

549d637

Fix custom FSDP float8 tensor set_item See merge request ADLR/megatron-lm!3280

fix circular imports (due to xielu) + add missing import in argument.py

7461ef2

tiny merge fix

1650f79

ADLR/megatron-lm!3401 - ci: Move queue blocker

24c60db

Merge branch 'ko3n1g/ci/move-queue-blocker' into 'main'

cfea2ea

ci: Move queue blocker See merge request ADLR/megatron-lm!3401

ADLR/megatron-lm!3400 - ci: Improve error-handling of missing logs

37b0afd

Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>

Merge branch 'ko3n1g/ci/better-log-failure-handling' into 'main'

6a62a54

ci: Improve error-handling of missing logs See merge request ADLR/megatron-lm!3400

ADLR/megatron-lm!3408 - ci: Control job concurrency

4648912

Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>

Merge branch 'ko3n1g/ci/job-concurrency' into 'main'

cde60ce

ci: Control job concurrency See merge request ADLR/megatron-lm!3408

ADLR/megatron-lm!3412 - ci: Catch missing logs

eab047c

Merge branch 'ko3n1g/ci/fix-no-log' into 'main'

25a26ca

ci: Catch missing logs See merge request ADLR/megatron-lm!3412

ADLR/megatron-lm!3411 - ci: Remove tests from A100

9bdfe31

Merge branch 'ko3n1g/ci/move-tests' into 'main'

ff64f96

ci: Remove tests from A100 See merge request ADLR/megatron-lm!3411

ADLR/megatron-lm!3393 - Add an option to skip counting zeros in grad …

d960800

…of ChainedOptimizer

Merge branch 'no_count_zeros' into 'main'

b47a9bb

Add an option to skip counting zeros in grad of ChainedOptimizer See merge request ADLR/megatron-lm!3393

ADLR/megatron-lm!3326 - Add an interface to set high priority stream …

bc80491

…groups

Merge branch 'comm-priority-setting' into 'main'

957f348

Add an interface to set high priority stream groups See merge request ADLR/megatron-lm!3326

fix missing import due to merge

940ccb5

merging latest commits from nvidia/main

10e9c5f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge with upstream main (up to 4-5 June 2025) | Add some FP8 metric-tracking#79

Merge with upstream main (up to 4-5 June 2025) | Add some FP8 metric-tracking#79
dhia680 wants to merge 908 commits into
mainfrom
sai-into-nvidia

dhia680 commented Jun 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

dhia680 commented Jun 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants