Partial upstream update#69
Closed
AleHD wants to merge 308 commits into
Closed
Conversation
Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com> Co-authored-by: Cyril Meurillon <cmeurillon@cs-oci-ord-vscode-02.cm.cluster>
Various improvements to RerunStateMachine See merge request ADLR/megatron-lm!2659
Re-enable MoE flaky unit tests. See merge request ADLR/megatron-lm!2676
build: Guard NVRX See merge request ADLR/megatron-lm!2679
…der and forward mismatch
Fix DDP over-param-gather issue when param order and forward mismatch See merge request ADLR/megatron-lm!2673
ci: Remove triton See merge request ADLR/megatron-lm!2664
ci: Read `package_info.py` See merge request ADLR/megatron-lm!2693
docs: Add changelog See merge request ADLR/megatron-lm!2694
Guard against 'common_step'=None See merge request ADLR/megatron-lm!2682
ci: Set legacy suite See merge request ADLR/megatron-lm!2699
ci: Fix release See merge request ADLR/megatron-lm!2696
Fix the PP backend error in cpu-only case See merge request ADLR/megatron-lm!2697
…mpatibility Tests
Fix Distributed Checkpointing for Backward Compatibility Tests See merge request ADLR/megatron-lm!2701
Add a fallback when tiktoken.offsets fail during generation See merge request ADLR/megatron-lm!2688
…prevent memory issue in some memory intensive scenario Co-authored-by: Slawek Kierat <skierat@nvidia.com> Co-authored-by: Youngeun Kwon <youngeunk@nvidia.com> Co-authored-by: Oliver Koenig <okoenig@nvidia.com> Co-authored-by: Maanu Grover <maanug@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com> Co-authored-by: Zijie Yan <zijiey@nvidia.com> Co-authored-by: Cyril Meurillon <cmeurillon@nvidia.com> Co-authored-by: Keshav Santhanam <ksanthanam@nvidia.com> Co-authored-by: Jack Chang <jianbinc@nvidia.com> Co-authored-by: Tuomas Rintamaki <trintamaki@nvidia.com> Co-authored-by: Parth Mannan <pmannan@nvidia.com>
Persistent Asynchronous Checkpoint Worker to prevent memory issue in some memory intensive scenario See merge request ADLR/megatron-lm!2524
build: Add one-logger See merge request ADLR/megatron-lm!2700
ci: Cosmetic changes See merge request ADLR/megatron-lm!2908
ci: Refactor notification script See merge request ADLR/megatron-lm!2909
ci: Configure OneLogger See merge request ADLR/megatron-lm!2906
Co-authored-by: Robin Zhang <robinz@draco-oci-login-01.cm.cluster>
Add conditional cudagraph support for MoE models See merge request ADLR/megatron-lm!2204
Co-authored-by: Boxin Wang <boxin.wbx@gmail.com>
Add ckpt step args in the test See merge request ADLR/megatron-lm!2905
ci: Different image for notifications See merge request ADLR/megatron-lm!2921
ci: Cosmetic changes 2 See merge request ADLR/megatron-lm!2917
ci: Allow for lightweight tests (but not use yet) See merge request ADLR/megatron-lm!2918
ci: non blocking steps See merge request ADLR/megatron-lm!2924
Co-authored-by: oliver könig <okoenig@nvidia.com>
test(moe): Add merge train test. See merge request ADLR/megatron-lm!2919
Co-authored-by: Zijie Yan <zijiey@nvidia.com> Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>
ci: Lightweight functional tests See merge request ADLR/megatron-lm!2881
Fix typo in cudagraphs arg check See merge request ADLR/megatron-lm!2916
ci: Fix typo in workflow rule See merge request ADLR/megatron-lm!2928
…rst-last-bf16 also supported Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>
Collaborator
Author
|
Surprisingly, it wasn't too hard to solve merge conflicts. However, there are still two things that might not work: Ademamix (because of the |
Collaborator
Author
|
Dropped in favour of #82 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR includes all of the commits from upstream, up to https://github.com/NVIDIA/Megatron-LM/tree/3db741160bb77ea31af60d2fb57aaee8d76e612f. This specific commit introduces various fp8 options which are high priority for the ongoing training. This PR should be easier to merge compared with the full update in #34.