swiss by xrsrke · Pull Request #73 · swiss-ai/Megatron-LM

xrsrke · 2025-05-23T10:22:03Z

No description provided.

ci: Fix publish notify job See merge request ADLR/megatron-lm!3117

ci: Upload pipeline telemetrics See merge request ADLR/megatron-lm!3106

…_interface`

Fix `post_training/test_get_gpt_modelopt_spec_interface` See merge request ADLR/megatron-lm!3118

Co-authored-by: Shanmugam Ramasamy <shanmugamr@shanmugamr-mlt.client.nvidia.com>

Remove legacy bert tests See merge request ADLR/megatron-lm!3023

Co-authored-by: Ali Taghibakhshi <ataghibakhsh@cw-dfw-cs-001-vscode-01.cm.cluster> Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>

Alit/config mamba head See merge request ADLR/megatron-lm!2601

…y for QAT.

Update CODEOWNERS to make modelopt review only for QAT. See merge request ADLR/megatron-lm!3125

Run nemo2 tests instead of nemo1 See merge request ADLR/megatron-lm!3119

…attn for dynamic batching. Co-authored-by: Shanmugam Ramasamy <shanmugamr@shanmugamr-mlt.client.nvidia.com> Co-authored-by: root <root@cw-dfw-h100-004-211-013.cm.cluster> Co-authored-by: Vijay Korthikanti <vkorthikanti@cw-dfw-cs-001-login-01.cm.cluster> Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: root <root@cw-dfw-h100-004-279-012.cm.cluster> Co-authored-by: root <root@cw-dfw-h100-004-316-012.cm.cluster> Co-authored-by: root <root@cw-dfw-h100-004-258-026.cm.cluster> Co-authored-by: root <root@cw-dfw-h100-004-008-033.cm.cluster> Co-authored-by: root <root@cw-dfw-h100-004-236-026.cm.cluster> Co-authored-by: root <root@cw-dfw-h100-004-267-012.cm.cluster>

Integrating paged attention feature of flash_attn for dynamic batching. See merge request ADLR/megatron-lm!2955

Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Chenhan Yu <chenhany@nvidia.com>

add l2 norm in torch_norm.py for LLAMA-4 support See merge request ADLR/megatron-lm!2960

fix: Improvements to the auto-reminder bot See merge request ADLR/megatron-lm!3126

Fix Gemma TRTLLM export See merge request ADLR/megatron-lm!2475

Co-authored-by: Yuzhong Wang <yuzhongw@nvidia.com> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

Fix MLA THD format support See merge request ADLR/megatron-lm!2691

…t load strictness.

Dynamic inference example | Control checkpoint load strictness. See merge request ADLR/megatron-lm!2914

Co-authored-by: jianbinc <shjwudp@gmail.com>

patch for fp8 primary weight custom fsdp support See merge request ADLR/megatron-lm!3057

ci: Track info about MR See merge request ADLR/megatron-lm!3129

Adapt _write_item call to new signature with 'serialization_format' See merge request ADLR/megatron-lm!3243

Co-authored-by: Russell Hewett <rhewett@nvidia.com>

Add in-process restart See merge request ADLR/megatron-lm!2711

This reverts commit d87ba91.

ci: Run on multiple clusters See merge request ADLR/megatron-lm!3292

ci: Allow specific TE-ref See merge request ADLR/megatron-lm!3302

ci(fix): Write logs to log_dir See merge request ADLR/megatron-lm!3299

Address dist checkpointing PyT 24.08 failure See merge request ADLR/megatron-lm!3253

ci(hotfix): Downstream pipeline See merge request ADLR/megatron-lm!3307

…nal argparse flag to clear GPU... Co-authored-by: Szymon Migacz <smigacz@nvidia.com>

MR feedback: added units for arguments, optional argparse flag to clear GPU... See merge request ADLR/megatron-lm!3308

…mamba class constructor Co-authored-by: Zhiyu Li <zhiyul@NVIDIA.com>

Allow process group as optional argument for mamba class constructor See merge request ADLR/megatron-lm!2966

Add NVTX ranges to categorize execution See merge request ADLR/megatron-lm!2588

Move fsdp 2 import from _composable to public See merge request ADLR/megatron-lm!3116

…image`

ci: Add nemo-image to `ci-rebuild-mcore-nemo-image` See merge request ADLR/megatron-lm!3321

ci: Re-enable tests that failed on memory See merge request ADLR/megatron-lm!3197

Signed-off-by: oliver könig <okoenig@nvidia.com>

Co-authored-by: Shanmugam Ramasamy <shanmugamr@cw-dfw-cs-001-vscode-01.cm.cluster> Co-authored-by: Shanmugam Ramasamy <shanmugamr@shanmugamr-mlt.client.nvidia.com>

Engine updates See merge request ADLR/megatron-lm!3254

ko3n1g and others added 30 commits April 15, 2025 14:17

Merge branch 'ko3n1g/ci/fix-publish-notify-job' into 'main'

65cf8d5

ci: Fix publish notify job See merge request ADLR/megatron-lm!3117

ADLR/megatron-lm!3106 - ci: Upload pipeline telemetrics

34b3723

Merge branch 'ko3n1g/ci/dashboard-functional-runs' into 'main'

9b9e374

ci: Upload pipeline telemetrics See merge request ADLR/megatron-lm!3106

ADLR/megatron-lm!3118 - Fix `post_training/test_get_gpt_modelopt_spec…

69e284d

…_interface`

Merge branch 'fix_mo_spec_test' into 'main'

d46f999

Fix `post_training/test_get_gpt_modelopt_spec_interface` See merge request ADLR/megatron-lm!3118

ADLR/megatron-lm!3023 - Remove legacy bert tests

671f254

Co-authored-by: Shanmugam Ramasamy <shanmugamr@shanmugamr-mlt.client.nvidia.com>

Merge branch 'remove-legacy-bert-test' into 'main'

8579a5d

Remove legacy bert tests See merge request ADLR/megatron-lm!3023

ADLR/megatron-lm!2601 - Alit/config mamba head

f5a57fe

Co-authored-by: Ali Taghibakhshi <ataghibakhsh@cw-dfw-cs-001-vscode-01.cm.cluster> Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>

Merge branch 'alit/config_mamba_head' into 'main'

ecf8a10

Alit/config mamba head See merge request ADLR/megatron-lm!2601

ADLR/megatron-lm!3125 - Update CODEOWNERS to make modelopt review onl…

cbbbacb

…y for QAT.

Merge branch 'shanmugamr-main-patch-70610' into 'main'

4597aaa

Update CODEOWNERS to make modelopt review only for QAT. See merge request ADLR/megatron-lm!3125

ADLR/megatron-lm!3119 - Run nemo2 tests instead of nemo1

f26cc41

Merge branch 'chtruong/update-functional-for-nemo2' into 'main'

0f52851

Run nemo2 tests instead of nemo1 See merge request ADLR/megatron-lm!3119

Merge branch 'vijay/unify_static_dynamic' into 'main'

d2e3ffc

Integrating paged attention feature of flash_attn for dynamic batching. See merge request ADLR/megatron-lm!2955

ADLR/megatron-lm!2960 - add l2 norm in torch_norm.py for LLAMA-4 support

d0534e9

Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Chenhan Yu <chenhany@nvidia.com>

Merge branch 'yuya/add_l2_norm' into 'main'

e6bd64c

add l2 norm in torch_norm.py for LLAMA-4 support See merge request ADLR/megatron-lm!2960

ADLR/megatron-lm!3126 - fix: Improvements to the auto-reminder bot

202ad22

Merge branch 'ko3n1g/fix/reminder-bot-final-review-date' into 'main'

8e0215c

fix: Improvements to the auto-reminder bot See merge request ADLR/megatron-lm!3126

ADLR/megatron-lm!2475 - Fix Gemma TRTLLM export

966bb9a

Merge branch 'bobchen/fix_nemo2' into 'main'

9db6e55

Fix Gemma TRTLLM export See merge request ADLR/megatron-lm!2475

ADLR/megatron-lm!2691 - Fix MLA THD format support

c0b5c91

Co-authored-by: Yuzhong Wang <yuzhongw@nvidia.com> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

Merge branch 'mla_PackedSeqParams' into 'main'

0bbb642

Fix MLA THD format support See merge request ADLR/megatron-lm!2691

ADLR/megatron-lm!2914 - Dynamic inference example | Control checkpoin…

02c6d64

…t load strictness.

Merge branch 'lmcafee/ifb-broken-example-25.02' into 'main'

370508c

Dynamic inference example | Control checkpoint load strictness. See merge request ADLR/megatron-lm!2914

ADLR/megatron-lm!3057 - patch for fp8 primary weight custom fsdp support

b799e3f

Co-authored-by: jianbinc <shjwudp@gmail.com>

Merge branch 'fp8_patch_for_cfsdp' into 'main'

76d6bcf

patch for fp8 primary weight custom fsdp support See merge request ADLR/megatron-lm!3057

ADLR/megatron-lm!3129 - ci: Track info about MR

35fd148

Merge branch 'ko3n1g/feat/track-info-about-merge-request' into 'main'

7935dcf

ci: Track info about MR See merge request ADLR/megatron-lm!3129

ADLR/megatron-lm!3105 - ci: Handle nargs

04a5957

ko3n1g and others added 30 commits May 12, 2025 14:06

Merge branch 'skierat/write_item_signature' into 'main'

a3609ee

Adapt _write_item call to new signature with 'serialization_format' See merge request ADLR/megatron-lm!3243

ADLR/megatron-lm!2711 - Add in-process restart

d87ba91

Co-authored-by: Russell Hewett <rhewett@nvidia.com>

Merge branch 'inprocess_mr' into 'main'

0bdebc0

Add in-process restart See merge request ADLR/megatron-lm!2711

ci(hotfix): Update Dockerfile.ci.dev

5c7ecad

Revert "ADLR/megatron-lm!2711 - Add in-process restart"

e41dde6

This reverts commit d87ba91.

ADLR/megatron-lm!3292 - ci: Run on multiple clusters

f61b17c

Merge branch 'ko3n1g/ci/multi-cluster' into 'main'

c552e21

ci: Run on multiple clusters See merge request ADLR/megatron-lm!3292

ADLR/megatron-lm!3302 - ci: Allow specific TE-ref

55343df

Merge branch 'ko3n1g/ci/te-nightly' into 'main'

d50e830

ci: Allow specific TE-ref See merge request ADLR/megatron-lm!3302

ADLR/megatron-lm!3299 - ci(fix): Write logs to log_dir

8c4875f

Merge branch 'ko3n1g/ci/unit-tests-locally' into 'main'

d6eb60b

ci(fix): Write logs to log_dir See merge request ADLR/megatron-lm!3299

ADLR/megatron-lm!3253 - Address dist checkpointing PyT 24.08 failure

c58e57f

Merge branch 'dist-ckpt-2408' into 'main'

4a114e6

Address dist checkpointing PyT 24.08 failure See merge request ADLR/megatron-lm!3253

ADLR/megatron-lm!3307 - ci(hotfix): Downstream pipeline

d2cbe5a

Merge branch 'ko3n1g/ci/fix-downstream-pipeline' into 'main'

53d55fb

ci(hotfix): Downstream pipeline See merge request ADLR/megatron-lm!3307

ADLR/megatron-lm!3308 - MR feedback: added units for arguments, optio…

9c586bf

…nal argparse flag to clear GPU... Co-authored-by: Szymon Migacz <smigacz@nvidia.com>

Merge branch 'inprocess_mr' into 'main'

8416bff

MR feedback: added units for arguments, optional argparse flag to clear GPU... See merge request ADLR/megatron-lm!3308

ADLR/megatron-lm!2966 - Allow process group as optional argument for …

07b1992

…mamba class constructor Co-authored-by: Zhiyu Li <zhiyul@NVIDIA.com>

Merge branch 'zhiyul/orthotope/ssm' into 'main'

175497e

Allow process group as optional argument for mamba class constructor See merge request ADLR/megatron-lm!2966

ADLR/megatron-lm!2588 - Add NVTX ranges to categorize execution

7f9f2bf

Merge branch 'llama31_automated_breakdown' into 'main'

8a9e864

Add NVTX ranges to categorize execution See merge request ADLR/megatron-lm!2588

ADLR/megatron-lm!3116 - Move fsdp 2 import from _composable to public

1ff5a37

Merge branch 'boxiangw/public_fsdp_import' into 'main'

ed0d528

Move fsdp 2 import from _composable to public See merge request ADLR/megatron-lm!3116

ADLR/megatron-lm!3321 - ci: Add nemo-image to `ci-rebuild-mcore-nemo-…

d70e2e4

…image`

Merge branch 'ko3n1g/ci/fix-rebuild-job' into 'main'

054fad5

ci: Add nemo-image to `ci-rebuild-mcore-nemo-image` See merge request ADLR/megatron-lm!3321

ADLR/megatron-lm!3197 - ci: Re-enable tests that failed on memory

e494219

Merge branch 'ko3n1g/ci/re-enable-broken-tests' into 'main'

bfc751a

ci: Re-enable tests that failed on memory See merge request ADLR/megatron-lm!3197

tests: Disable flaky test

a73b4d2

Signed-off-by: oliver könig <okoenig@nvidia.com>

ADLR/megatron-lm!3254 - Engine updates

407e504

Co-authored-by: Shanmugam Ramasamy <shanmugamr@cw-dfw-cs-001-vscode-01.cm.cluster> Co-authored-by: Shanmugam Ramasamy <shanmugamr@shanmugamr-mlt.client.nvidia.com>

Merge branch 'engine_updates' into 'main'

7fe8f69

Engine updates See merge request ADLR/megatron-lm!3254

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

swiss#73

swiss#73
xrsrke wants to merge 821 commits into
swiss-ai:mainfrom
xrsrke:main

xrsrke commented May 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

xrsrke commented May 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants