Partial upstream update by AleHD · Pull Request #69 · swiss-ai/Megatron-LM

AleHD · 2025-04-10T15:08:38Z

This PR includes all of the commits from upstream, up to https://github.com/NVIDIA/Megatron-LM/tree/3db741160bb77ea31af60d2fb57aaee8d76e612f. This specific commit introduces various fp8 options which are high priority for the ongoing training. This PR should be easier to merge compared with the full update in #34.

Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com> Co-authored-by: Cyril Meurillon <cmeurillon@cs-oci-ord-vscode-02.cm.cluster>

Various improvements to RerunStateMachine See merge request ADLR/megatron-lm!2659

Re-enable MoE flaky unit tests. See merge request ADLR/megatron-lm!2676

build: Guard NVRX See merge request ADLR/megatron-lm!2679

…der and forward mismatch

Fix DDP over-param-gather issue when param order and forward mismatch See merge request ADLR/megatron-lm!2673

ci: Remove triton See merge request ADLR/megatron-lm!2664

ci: Read `package_info.py` See merge request ADLR/megatron-lm!2693

docs: Add changelog See merge request ADLR/megatron-lm!2694

Guard against 'common_step'=None See merge request ADLR/megatron-lm!2682

ci: Set legacy suite See merge request ADLR/megatron-lm!2699

ci: Fix release See merge request ADLR/megatron-lm!2696

Fix the PP backend error in cpu-only case See merge request ADLR/megatron-lm!2697

…mpatibility Tests

Fix Distributed Checkpointing for Backward Compatibility Tests See merge request ADLR/megatron-lm!2701

…ing generation

Add a fallback when tiktoken.offsets fail during generation See merge request ADLR/megatron-lm!2688

…prevent memory issue in some memory intensive scenario Co-authored-by: Slawek Kierat <skierat@nvidia.com> Co-authored-by: Youngeun Kwon <youngeunk@nvidia.com> Co-authored-by: Oliver Koenig <okoenig@nvidia.com> Co-authored-by: Maanu Grover <maanug@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com> Co-authored-by: Zijie Yan <zijiey@nvidia.com> Co-authored-by: Cyril Meurillon <cmeurillon@nvidia.com> Co-authored-by: Keshav Santhanam <ksanthanam@nvidia.com> Co-authored-by: Jack Chang <jianbinc@nvidia.com> Co-authored-by: Tuomas Rintamaki <trintamaki@nvidia.com> Co-authored-by: Parth Mannan <pmannan@nvidia.com>

Persistent Asynchronous Checkpoint Worker to prevent memory issue in some memory intensive scenario See merge request ADLR/megatron-lm!2524

build: Add one-logger See merge request ADLR/megatron-lm!2700

ci: Cosmetic changes See merge request ADLR/megatron-lm!2908

ci: Refactor notification script See merge request ADLR/megatron-lm!2909

ci: Configure OneLogger See merge request ADLR/megatron-lm!2906

Co-authored-by: Robin Zhang <robinz@draco-oci-login-01.cm.cluster>

Add conditional cudagraph support for MoE models See merge request ADLR/megatron-lm!2204

Co-authored-by: Boxin Wang <boxin.wbx@gmail.com>

Add ckpt step args in the test See merge request ADLR/megatron-lm!2905

ci: Different image for notifications See merge request ADLR/megatron-lm!2921

ci: Cosmetic changes 2 See merge request ADLR/megatron-lm!2917

…yet)

ci: Allow for lightweight tests (but not use yet) See merge request ADLR/megatron-lm!2918

ci: non blocking steps See merge request ADLR/megatron-lm!2924

Co-authored-by: oliver könig <okoenig@nvidia.com>

test(moe): Add merge train test. See merge request ADLR/megatron-lm!2919

Co-authored-by: Zijie Yan <zijiey@nvidia.com> Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>

ci: Lightweight functional tests See merge request ADLR/megatron-lm!2881

Fix typo in cudagraphs arg check See merge request ADLR/megatron-lm!2916

ci: Fix typo in workflow rule See merge request ADLR/megatron-lm!2928

…rst-last-bf16 also supported Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>

AleHD · 2025-04-10T15:54:58Z

Surprisingly, it wasn't too hard to solve merge conflicts. However, there are still two things that might not work: Ademamix (because of the distrib_opt.py changes) and the HF conversion (because of the loader_core.py changes). I'll test these two features soon.

AleHD · 2025-06-25T12:13:40Z

Dropped in favour of #82

cyme and others added 30 commits February 14, 2025 19:00

ADLR/megatron-lm!2659 - Various improvements to RerunStateMachine

4dc6b71

Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com> Co-authored-by: Cyril Meurillon <cmeurillon@cs-oci-ord-vscode-02.cm.cluster>

Merge branch 'fix-backward-checkpoint' into 'main'

9a496c9

Various improvements to RerunStateMachine See merge request ADLR/megatron-lm!2659

ADLR/megatron-lm!2676 - Re-enable MoE flaky unit tests.

c1e71cc

Merge branch 'zijiey/enable_moe_flaky_ut' into 'main'

fe7f28a

Re-enable MoE flaky unit tests. See merge request ADLR/megatron-lm!2676

ADLR/megatron-lm!2679 - build: Guard NVRX

81f3cd1

Merge branch 'ko3n1g/build/guard-nvrx' into 'main'

b06494c

build: Guard NVRX See merge request ADLR/megatron-lm!2679

ADLR/megatron-lm!2673 - Fix DDP over-param-gather issue when param or…

a0430bf

…der and forward mismatch

Merge branch 'denliu/fix_ddp_param_gather' into 'main'

a0365bc

Fix DDP over-param-gather issue when param order and forward mismatch See merge request ADLR/megatron-lm!2673

ADLR/megatron-lm!2664 - ci: Remove triton

34fa7b4

Merge branch 'ko3n1g/build/triton' into 'main'

7dfd00b

ci: Remove triton See merge request ADLR/megatron-lm!2664

ADLR/megatron-lm!2693 - ci: Read package_info.py

b997545

Merge branch 'ko3n1g/ci/package-info' into 'main'

020cb6e

ci: Read `package_info.py` See merge request ADLR/megatron-lm!2693

ADLR/megatron-lm!2694 - docs: Add changelog

96b1c07

Merge branch 'ko3n1g/docs/changelog' into 'main'

ae82b26

docs: Add changelog See merge request ADLR/megatron-lm!2694

ADLR/megatron-lm!2682 - Guard against 'common_step'=None

86b157e

Merge branch 'maanug/common-step-guard' into 'main'

677382e

Guard against 'common_step'=None See merge request ADLR/megatron-lm!2682

ADLR/megatron-lm!2699 - ci: Set legacy suite

a551421

Merge branch 'ko3n1g/ci/legacy-suite' into 'main'

addeb0d

ci: Set legacy suite See merge request ADLR/megatron-lm!2699

ADLR/megatron-lm!2696 - ci: Fix release

cbc9be6

Merge branch 'ko3n1g/ci/fix-release' into 'main'

3312b08

ci: Fix release See merge request ADLR/megatron-lm!2696

ADLR/megatron-lm!2697 - Fix the PP backend error in cpu-only case

830a086

Merge branch 'fix_ucc_mr_for_cpu_only' into 'main'

61b2c4f

Fix the PP backend error in cpu-only case See merge request ADLR/megatron-lm!2697

ADLR/megatron-lm!2701 - Fix Distributed Checkpointing for Backward Co…

c8780d5

…mpatibility Tests

Merge branch 'skierat/direct_args' into 'main'

e1586c2

Fix Distributed Checkpointing for Backward Compatibility Tests See merge request ADLR/megatron-lm!2701

ADLR/megatron-lm!2688 - Add a fallback when tiktoken.offsets fail dur…

5477d06

…ing generation

Merge branch 'sasatheesh/tiktoken-offsets' into 'main'

48dd00a

Add a fallback when tiktoken.offsets fail during generation See merge request ADLR/megatron-lm!2688

Merge branch 'sbak/dist_persistent' into 'main'

1396c1d

Persistent Asynchronous Checkpoint Worker to prevent memory issue in some memory intensive scenario See merge request ADLR/megatron-lm!2524

ADLR/megatron-lm!2700 - build: Add one-logger

b9799f7

Merge branch 'ko3n1g/ci/onboard-logger' into 'main'

373b99c

build: Add one-logger See merge request ADLR/megatron-lm!2700

ko3n1g and others added 27 commits March 18, 2025 16:28

Merge branch 'ko3n1g/ci/cosmetic-changes' into 'main'

58279d0

ci: Cosmetic changes See merge request ADLR/megatron-lm!2908

ADLR/megatron-lm!2909 - ci: Refactor notification script

a0932f0

Merge branch 'ko3n1g/ci/better-notifications' into 'main'

8a10bf3

ci: Refactor notification script See merge request ADLR/megatron-lm!2909

ADLR/megatron-lm!2906 - ci: Configure OneLogger

9f7ba0a

Merge branch 'ko3n1g/ci/set-onelogger' into 'main'

dd5e811

ci: Configure OneLogger See merge request ADLR/megatron-lm!2906

ADLR/megatron-lm!2204 - Add conditional cudagraph support for MoE models

a606486

Co-authored-by: Robin Zhang <robinz@draco-oci-login-01.cm.cluster>

Merge branch 'moe_cudagraph' into 'main'

e8ecbbb

Add conditional cudagraph support for MoE models See merge request ADLR/megatron-lm!2204

ADLR/megatron-lm!2905 - Add ckpt step args in the test

c15df6e

Co-authored-by: Boxin Wang <boxin.wbx@gmail.com>

Merge branch 'boxin/fix_ckpt_test' into 'main'

50aa17d

Add ckpt step args in the test See merge request ADLR/megatron-lm!2905

ADLR/megatron-lm!2921 - ci: Different image for notifications

490ef20

Merge branch 'ko3n1g/ci/notifications' into 'main'

7b176f0

ci: Different image for notifications See merge request ADLR/megatron-lm!2921

ADLR/megatron-lm!2917 - ci: Cosmetic changes 2

9ec1ebf

Merge branch 'ko3n1g/ci/cosmetic-changes-2' into 'main'

037c198

ci: Cosmetic changes 2 See merge request ADLR/megatron-lm!2917

ADLR/megatron-lm!2918 - ci: Allow for lightweight tests (but not use …

a8353d6

…yet)

Merge branch 'ko3n1g/ci/allow-for-lightweight-tests' into 'main'

622ebdb

ci: Allow for lightweight tests (but not use yet) See merge request ADLR/megatron-lm!2918

ADLR/megatron-lm!2924 - ci: non blocking steps

457d34d

Merge branch 'ko3n1g/ci/non-blocking-steps' into 'main'

4e08006

ci: non blocking steps See merge request ADLR/megatron-lm!2924

ADLR/megatron-lm!2919 - test(moe): Add merge train test.

241b91a

Co-authored-by: oliver könig <okoenig@nvidia.com>

Merge branch 'zijiey/moe_merge_train_test' into 'main'

ea1ace9

test(moe): Add merge train test. See merge request ADLR/megatron-lm!2919

ADLR/megatron-lm!2881 - ci: Lightweight functional tests

6ca0e0a

Co-authored-by: Zijie Yan <zijiey@nvidia.com> Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>

Merge branch 'ko3n1g/ci/lightweight-integration-tests' into 'main'

16ec748

ci: Lightweight functional tests See merge request ADLR/megatron-lm!2881

ADLR/megatron-lm!2916 - Fix typo in cudagraphs arg check

483512d

Merge branch 'helenn-cudagraphs-argcheck-typo' into 'main'

dd44628

Fix typo in cudagraphs arg check See merge request ADLR/megatron-lm!2916

ADLR/megatron-lm!2928 - ci: Fix typo in workflow rule

c84a153

Merge branch 'ko3n1g/ci/typo-in-rule' into 'main'

d19b344

ci: Fix typo in workflow rule See merge request ADLR/megatron-lm!2928

ADLR/megatron-lm!2766 - Add FP8 recipe selection to arguments with fi…

3db7411

…rst-last-bf16 also supported Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>

Merge branch 'main' into partial-upstream-update

36234d3

AleHD closed this Jun 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Partial upstream update#69

Partial upstream update#69
AleHD wants to merge 308 commits into
mainfrom
partial-upstream-update

AleHD commented Apr 10, 2025

Uh oh!

AleHD commented Apr 10, 2025

Uh oh!

AleHD commented Jun 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

AleHD commented Apr 10, 2025

Uh oh!

AleHD commented Apr 10, 2025

Uh oh!

AleHD commented Jun 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants