Skip to content

Merge with upstream main (up to 4-5 June 2025) | Add some FP8 metric-tracking#79

Open
dhia680 wants to merge 908 commits into
mainfrom
sai-into-nvidia
Open

Merge with upstream main (up to 4-5 June 2025) | Add some FP8 metric-tracking#79
dhia680 wants to merge 908 commits into
mainfrom
sai-into-nvidia

Conversation

@dhia680

@dhia680 dhia680 commented Jun 6, 2025

Copy link
Copy Markdown
Member

This PR merges swiss-ai/main with upstream main (up to 4-5 June 2025), including FP8 blockwise scaling support (for Hopper).
It also adds some FP8 metric-tracking.

sbak5 and others added 30 commits April 25, 2025 17:11
Fix the sync issue in `TemporalAsyncWorker`

See merge request ADLR/megatron-lm!3155
Co-authored-by: Chenhan Yu <chenhany@nvidia.com>
Co-authored-by: Chen-Han Yu <chenhany@cw-dfw-cs-001-login-01.cm.cluster>
Co-authored-by: Ye Yu <yeyu@cw-dfw-cs-001-dc-01.cm.cluster>
Add ModelOpt speculative decoding finetune

See merge request ADLR/megatron-lm!2971
Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com>
Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>
Co-authored-by: Chenhan Yu <chenhany@nvidia.com>
Moe fix for Llama4

See merge request ADLR/megatron-lm!3083
…DeepSeek-v3

Co-authored-by: jianbinc <shjwudp@gmail.com>
[custom FSDP] Support EP + FSDP training for DeepSeek-v3

See merge request ADLR/megatron-lm!2910
Fix extra tokens in returned generation

Closes dl/JoC/nemo-ci#2075

See merge request ADLR/megatron-lm!3178
Update current scaling supported TE version to 2.2.0.dev0

See merge request ADLR/megatron-lm!3160
Co-authored-by: Shanmugam Ramasamy <shanmugamr@cw-dfw-cs-001-vscode-01.cm.cluster>
Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>
Co-authored-by: Shanmugam Ramasamy <shanmugamr@shanmugamr-mlt.client.nvidia.com>
Co-authored-by: Shanmugam Ramasamy <shanmugamr@cw-dfw-cs-001-login-01.cm.cluster>
Co-authored-by: Vijay Korthikanti <vkorthikanti@cw-dfw-cs-001-login-01.cm.cluster>
Co-authored-by: root <root@cw-dfw-h100-004-279-012.cm.cluster>
Co-authored-by: root <root@cw-dfw-h100-004-316-012.cm.cluster>
Co-authored-by: root <root@cw-dfw-h100-004-258-026.cm.cluster>
Co-authored-by: root <root@cw-dfw-h100-004-008-033.cm.cluster>
Co-authored-by: root <root@cw-dfw-h100-004-236-026.cm.cluster>
Co-authored-by: root <root@cw-dfw-h100-004-267-012.cm.cluster>
Seperate chunk allocator

See merge request ADLR/megatron-lm!3121
Revert inference_context.is_decode_only() to inference_context.sequence_len_offset > 0

See merge request ADLR/megatron-lm!3180
…-fusion will throw an exception when topk/num_local_experts is not the power of 2.
[BUG FIX]: fix the bug of indices-to-multihot-fusion will throw an exception when topk/num_local_experts is not the power of 2.

See merge request ADLR/megatron-lm!3058
…g global ones with optional local ones for better parallelism flexibility

Co-authored-by: Zhiyu Li <zhiyul@NVIDIA.com>
Refactor Inference Process Groups by replacing global ones with optional local ones for better parallelism flexibility

See merge request ADLR/megatron-lm!3015
Update te patch to include 1626

See merge request ADLR/megatron-lm!3179
Co-authored-by: root <root@cw-dfw-h100-004-211-013.cm.cluster>
Co-authored-by: Vijay Korthikanti <vkorthikanti@cw-dfw-cs-001-login-01.cm.cluster>
Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>
Co-authored-by: root <root@cw-dfw-h100-004-279-012.cm.cluster>
Co-authored-by: root <root@cw-dfw-h100-004-316-012.cm.cluster>
Co-authored-by: root <root@cw-dfw-h100-004-258-026.cm.cluster>
Co-authored-by: root <root@cw-dfw-h100-004-008-033.cm.cluster>
Co-authored-by: root <root@cw-dfw-h100-004-236-026.cm.cluster>
Co-authored-by: root <root@cw-dfw-h100-004-267-012.cm.cluster>
Use FlashAttention 3 for inference

See merge request ADLR/megatron-lm!3120
No RoPE for Llama4

See merge request ADLR/megatron-lm!3167
Enable --fp8-param-gather for NV sub-channel recipe

See merge request ADLR/megatron-lm!3010
…swiglu perf

Co-authored-by: lit <lit@nvidia.com>
AleHD and others added 30 commits May 29, 2025 17:22
tests: Update frozen-checkpoints

See merge request ADLR/megatron-lm!3363
…eration

Co-authored-by: root <root@pool0-00755.cm.cluster>
Consolidate eval methods across train and generation

See merge request ADLR/megatron-lm!3375
ci: Auto-restart on nan

See merge request ADLR/megatron-lm!3388
…YARN embedding cache

Co-authored-by: xuwenc <xuwenc@nvidia.com>
perf(mla, experimental): MLA RoPE fusion and YARN embedding cache

Closes NVIDIA#429

See merge request ADLR/megatron-lm!2949
Co-authored-by: jianbinc <shjwudp@gmail.com>
Fix custom FSDP float8 tensor set_item

See merge request ADLR/megatron-lm!3280
ci: Move queue blocker

See merge request ADLR/megatron-lm!3401
Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>
ci: Improve error-handling of missing logs

See merge request ADLR/megatron-lm!3400
Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>
ci: Control job concurrency

See merge request ADLR/megatron-lm!3408
ci: Catch missing logs

See merge request ADLR/megatron-lm!3412
ci: Remove tests from A100

See merge request ADLR/megatron-lm!3411
Add an option to skip counting zeros in grad of ChainedOptimizer

See merge request ADLR/megatron-lm!3393
Add an interface to set high priority stream groups

See merge request ADLR/megatron-lm!3326
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.