Skip to content

Partial upstream update#69

Closed
AleHD wants to merge 308 commits into
mainfrom
partial-upstream-update
Closed

Partial upstream update#69
AleHD wants to merge 308 commits into
mainfrom
partial-upstream-update

Conversation

@AleHD

@AleHD AleHD commented Apr 10, 2025

Copy link
Copy Markdown
Collaborator

This PR includes all of the commits from upstream, up to https://github.com/NVIDIA/Megatron-LM/tree/3db741160bb77ea31af60d2fb57aaee8d76e612f. This specific commit introduces various fp8 options which are high priority for the ongoing training. This PR should be easier to merge compared with the full update in #34.

cyme and others added 30 commits February 14, 2025 19:00
Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>
Co-authored-by: Cyril Meurillon <cmeurillon@cs-oci-ord-vscode-02.cm.cluster>
Various improvements to RerunStateMachine

See merge request ADLR/megatron-lm!2659
Re-enable MoE flaky unit tests.

See merge request ADLR/megatron-lm!2676
build: Guard NVRX

See merge request ADLR/megatron-lm!2679
Fix DDP over-param-gather issue when param order and forward mismatch

See merge request ADLR/megatron-lm!2673
ci: Remove triton

See merge request ADLR/megatron-lm!2664
ci: Read `package_info.py`

See merge request ADLR/megatron-lm!2693
docs: Add changelog

See merge request ADLR/megatron-lm!2694
Guard against 'common_step'=None

See merge request ADLR/megatron-lm!2682
ci: Set legacy suite

See merge request ADLR/megatron-lm!2699
ci: Fix release

See merge request ADLR/megatron-lm!2696
Fix the PP backend error in cpu-only case

See merge request ADLR/megatron-lm!2697
Fix Distributed Checkpointing for Backward Compatibility Tests

See merge request ADLR/megatron-lm!2701
Add a fallback when tiktoken.offsets fail during generation

See merge request ADLR/megatron-lm!2688
…prevent memory issue in some memory intensive scenario

Co-authored-by: Slawek Kierat <skierat@nvidia.com>
Co-authored-by: Youngeun Kwon <youngeunk@nvidia.com>
Co-authored-by: Oliver Koenig <okoenig@nvidia.com>
Co-authored-by: Maanu Grover <maanug@nvidia.com>
Co-authored-by: Dennis Liu <denliu@nvidia.com>
Co-authored-by: Zijie Yan <zijiey@nvidia.com>
Co-authored-by: Cyril Meurillon <cmeurillon@nvidia.com>
Co-authored-by: Keshav Santhanam <ksanthanam@nvidia.com>
Co-authored-by: Jack Chang <jianbinc@nvidia.com>
Co-authored-by: Tuomas Rintamaki <trintamaki@nvidia.com>
Co-authored-by: Parth Mannan <pmannan@nvidia.com>
Persistent Asynchronous Checkpoint Worker to prevent memory issue in some memory intensive scenario

See merge request ADLR/megatron-lm!2524
build: Add one-logger

See merge request ADLR/megatron-lm!2700
ko3n1g and others added 27 commits March 18, 2025 16:28
ci: Cosmetic changes

See merge request ADLR/megatron-lm!2908
ci: Refactor notification script

See merge request ADLR/megatron-lm!2909
ci: Configure OneLogger

See merge request ADLR/megatron-lm!2906
Co-authored-by: Robin Zhang <robinz@draco-oci-login-01.cm.cluster>
Add conditional cudagraph support for MoE models

See merge request ADLR/megatron-lm!2204
Co-authored-by: Boxin Wang <boxin.wbx@gmail.com>
Add ckpt step args in the test

See merge request ADLR/megatron-lm!2905
ci: Different image for notifications

See merge request ADLR/megatron-lm!2921
ci: Cosmetic changes 2

See merge request ADLR/megatron-lm!2917
ci: Allow for lightweight tests (but not use yet)

See merge request ADLR/megatron-lm!2918
ci: non blocking steps

See merge request ADLR/megatron-lm!2924
Co-authored-by: oliver könig <okoenig@nvidia.com>
test(moe): Add merge train test.

See merge request ADLR/megatron-lm!2919
Co-authored-by: Zijie Yan <zijiey@nvidia.com>
Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>
ci: Lightweight functional tests

See merge request ADLR/megatron-lm!2881
Fix typo in cudagraphs arg check

See merge request ADLR/megatron-lm!2916
ci: Fix typo in workflow rule

See merge request ADLR/megatron-lm!2928
…rst-last-bf16 also supported

Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>
@AleHD

AleHD commented Apr 10, 2025

Copy link
Copy Markdown
Collaborator Author

Surprisingly, it wasn't too hard to solve merge conflicts. However, there are still two things that might not work: Ademamix (because of the distrib_opt.py changes) and the HF conversion (because of the loader_core.py changes). I'll test these two features soon.

@AleHD

AleHD commented Jun 25, 2025

Copy link
Copy Markdown
Collaborator Author

Dropped in favour of #82

@AleHD AleHD closed this Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.