Checkpoint - Add non-VPP to VPP model checkpoint converter#35
Open
limou102 wants to merge 23 commits into
Open
Checkpoint - Add non-VPP to VPP model checkpoint converter#35limou102 wants to merge 23 commits into
limou102 wants to merge 23 commits into
Conversation
Author
@microsoft-github-policy-service agree company="AMD" |
Contributor
|
As discussed offline, let's make the code more general, and support uneven VPP case. |
Author
|
uneven pipeline mode is supported now. |
yzygitzh
reviewed
Jun 10, 2025
yzygitzh
reviewed
Jun 10, 2025
yzygitzh
reviewed
Jun 10, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
yzygitzh
reviewed
Jun 11, 2025
Contributor
|
Please add a test case that runs in the CI/CD pipeline to test this PR. The test flow can be:
The test case should be in |
Author
|
resolve conversations above |
|
Marking as stale. No activity in 60 days. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
pp_to_vpp
description
This tool can convert a language model checkpoint without virtual pipeline parallelism into one with virtual pipeline parallelism by increasing the virtual pipeline stage size.
Other model parallel parameters (tensor-parallel-size, pipeline-parallel-size, expert-parallel-size ...) remain unchanged.
(2025-05-30) It now supports uneven pipeline mode, as well as cases where the number of layers in a pipeline stage is not divisible by the virtual pipeline degree.
see arguments:
The above two parameters must either both be provided(or both be omitted), indicating that uneven pipeline mode is enabled
and specifying the virtual pipeline layer distribution for the first and last pipeline stages(this distribution may be even, but it still needs to be explicitly provided).
This feature was introduced based on the following Pull Request.
#27
The model after converted needs to be loaded using a Megatron-LM framework that has this Pull Request applied.
Currently, tests have been conducted on the DeepSeek(v2, v3) and Mixtral models.
Note that currently, all of the following configurations must be satisfied to be supported.
tensor_parallel_size=1
ckpt_format=torch
so the checkpoint for each iteration folder should look like this:
how to use
you can modify run_convert_pp_to_vpp.sh and launch it as an example
examples
Some training logs from the tests are available in the logs directory for review.
NOTE
It's also possible to continue training by loading only the model weights without loading the optimizer state (add --no-load-optim argument when launch Megatron-LM, which will reset the optimizer), though performance may recover after training for a few more iterations.