Skip to content

Checkpoint - Add non-VPP to VPP model checkpoint converter#35

Open
limou102 wants to merge 23 commits into
devfrom
limou/vpp_converter
Open

Checkpoint - Add non-VPP to VPP model checkpoint converter#35
limou102 wants to merge 23 commits into
devfrom
limou/vpp_converter

Conversation

@limou102

@limou102 limou102 commented May 28, 2025

Copy link
Copy Markdown

pp_to_vpp

description

This tool can convert a language model checkpoint without virtual pipeline parallelism into one with virtual pipeline parallelism by increasing the virtual pipeline stage size.

Other model parallel parameters (tensor-parallel-size, pipeline-parallel-size, expert-parallel-size ...) remain unchanged.


(2025-05-30) It now supports uneven pipeline mode, as well as cases where the number of layers in a pipeline stage is not divisible by the virtual pipeline degree.

see arguments:

--target-first-virtual-pipeline-num-layers-split
--target-last-virtual-pipeline-num-layers-split

The above two parameters must either both be provided(or both be omitted), indicating that uneven pipeline mode is enabled
and specifying the virtual pipeline layer distribution for the first and last pipeline stages(this distribution may be even, but it still needs to be explicitly provided).

This feature was introduced based on the following Pull Request.

#27

The model after converted needs to be loaded using a Megatron-LM framework that has this Pull Request applied.


Currently, tests have been conducted on the DeepSeek(v2, v3) and Mixtral models.

Note that currently, all of the following configurations must be satisfied to be supported.
tensor_parallel_size=1
ckpt_format=torch
so the checkpoint for each iteration folder should look like this:

iter_0000050
├── mp_rank_00_000_000
│  ├── distrib_optim.pt
│  └── model_optim_rng.pt
├── mp_rank_00_000_001
│  ├── distrib_optim.pt
│  └── model_optim_rng.pt
├── mp_rank_00_000_002
│  ├── distrib_optim.pt
│  └── model_optim_rng.pt
...

how to use

you can modify run_convert_pp_to_vpp.sh and launch it as an example

usage: main.py [-h] --load-iteration-dir LOAD_ITERATION_DIR --expert-model-parallel-size EXPERT_MODEL_PARALLEL_SIZE --pipeline-model-parallel-size PIPELINE_MODEL_PARALLEL_SIZE
               --save-iteration-dir SAVE_ITERATION_DIR --target-virtual-pipeline-model-parallel-size TARGET_VIRTUAL_PIPELINE_MODEL_PARALLEL_SIZE
               [--target-first-virtual-pipeline-num-layers-split TARGET_FIRST_VIRTUAL_PIPELINE_NUM_LAYERS_SPLIT [TARGET_FIRST_VIRTUAL_PIPELINE_NUM_LAYERS_SPLIT ...]]
               [--target-last-virtual-pipeline-num-layers-split TARGET_LAST_VIRTUAL_PIPELINE_NUM_LAYERS_SPLIT [TARGET_LAST_VIRTUAL_PIPELINE_NUM_LAYERS_SPLIT ...]]
               [--num-max-processing-processes NUM_MAX_PROCESSING_PROCESSES] [--pipeline-ranks-to-process PIPELINE_RANKS_TO_PROCESS]

convert a non-virtual pipeline checkpoint to virtual pipeline checkpoint

options:
  -h, --help            show this help message and exit
  --load-iteration-dir LOAD_ITERATION_DIR
                        iteration folder of source model checkpoint
  --expert-model-parallel-size EXPERT_MODEL_PARALLEL_SIZE
                        ep_size of original model and the target model
  --pipeline-model-parallel-size PIPELINE_MODEL_PARALLEL_SIZE
                        physical pp_size of original model and the target model
  --save-iteration-dir SAVE_ITERATION_DIR
                        iteration folder of target model checkpoint, need to be empty if existed
  --target-virtual-pipeline-model-parallel-size TARGET_VIRTUAL_PIPELINE_MODEL_PARALLEL_SIZE
                        vpp_size of target model
  --target-first-virtual-pipeline-num-layers-split TARGET_FIRST_VIRTUAL_PIPELINE_NUM_LAYERS_SPLIT [TARGET_FIRST_VIRTUAL_PIPELINE_NUM_LAYERS_SPLIT ...]
                        only used in uneven pipeline mode, virtual pipeline split of the first stage
  --target-last-virtual-pipeline-num-layers-split TARGET_LAST_VIRTUAL_PIPELINE_NUM_LAYERS_SPLIT [TARGET_LAST_VIRTUAL_PIPELINE_NUM_LAYERS_SPLIT ...]
                        only used in uneven pipeline mode, virtual pipeline split of the last stage
  --num-max-processing-processes NUM_MAX_PROCESSING_PROCESSES
                        the maximum number of processing processes used by this script, increasing this value can speed up model conversion(but the final bottleneck may be disk
                        bandwidth), it will also consume more CPU memory.
  --pipeline-ranks-to-process PIPELINE_RANKS_TO_PROCESS
                        pipeline rank list to process using this script, to accelerate converting user can launch multiple tasks on different nodes, each one process part of pipeline
                        ranks. example : --pipeline-ranks-to-process 0 1 2 3 default is None, means process all pipeline ranks

examples

  1. The target model has virtual_pipeline_size=2, and uses 4 processes in parallel.
python main.py \
    --load-iteration-dir /path/to/src_checkpoints/iter_0000050 \
    --expert-model-parallel-size 4 \
    --pipeline-model-parallel-size 2 \
    --save-iteration-dir /path/to/dst_checkpoints/iter_0000050 \
    --target-virtual-pipeline-model-parallel-size 2 \
    --num-max-processing-processes 4
  1. Convert the checkpoints generated by pipeline ranks [0,1,2,3] on node 1, and convert the checkpoints generated by pipeline ranks [4,5,6,7] on node 2. (in cases where memory is limited on a single node)
# node1 :
python main.py \
    --load-iteration-dir /path/to/src_checkpoints/iter_0000050 \
    --expert-model-parallel-size 8 \
    --pipeline-model-parallel-size 8 \
    --save-iteration-dir /path/to/dst_checkpoints/iter_0000050 \
    --target-virtual-pipeline-model-parallel-size 2 \
    --num-max-processing-processes 4 \
    --pipeline-ranks-to-process 0 1 2 3

# node2:
python main.py \
    --load-iteration-dir /path/to/src_checkpoints/iter_0000050 \
    --expert-model-parallel-size 8 \
    --pipeline-model-parallel-size 8 \
    --save-iteration-dir /path/to/dst_checkpoints/iter_0000050 \
    --target-virtual-pipeline-model-parallel-size 2 \
    --num-max-processing-processes 4 \
    --pipeline-ranks-to-process 4 5 6 7
  1. convert a model with uneven pipeline mode, which was saved by Megatron-LM with arguments
--decoder-first-pipeline-num-layers 8
--decoder-last-pipeline-num-layers 7
# suppose pipeline_parallel_size=4, the model contains 31 layers in total, the layers distribution for each pipeline stages is [8, 8, 8, 7]
# now we use this model to inscrease virtual pipeline size to 2,
#   the layer split in first pipeline stage is [4, 4] and the layer split in last pipeline stage is [4, 3]
#         vpp0          vpp1
# pp0  0, 1, 2, 3    16,17,18,19
# pp1  4, 5, 6, 7    20,21,22,23
# pp2  8, 9,10,11    24,25,26,27
# pp3 12,13,14,15    28,29,30

python main.py \
    --load-iteration-dir /path/to/src_checkpoints/iter_0000050 \
    --save-iteration-dir /path/to/dst_checkpoints/iter_0000050 \
    --expert-model-parallel-size 8 \
    --pipeline-model-parallel-size 4 \
    --target-virtual-pipeline-model-parallel-size 2 \
    --target-first-virtual-pipeline-num-layers-split 4 4 \
    --target-last-virtual-pipeline-num-layers-split 4 3 \
    --num-max-processing-processes 8

Some training logs from the tests are available in the logs directory for review.

NOTE

It's also possible to continue training by loading only the model weights without loading the optimizer state (add --no-load-optim argument when launch Megatron-LM, which will reset the optimizer), though performance may recover after training for a few more iterations.

@limou102 limou102 requested a review from a team as a code owner May 28, 2025 02:49
@yzygitzh yzygitzh changed the title add pp to vpp model convert tool Checkpoint - Add non-VPP to VPP model checkpoint converter May 28, 2025
@limou102

Copy link
Copy Markdown
Author

@limou102 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree company="AMD"

@yzygitzh

yzygitzh commented Jun 3, 2025

Copy link
Copy Markdown
Contributor

As discussed offline, let's make the code more general, and support uneven VPP case.

@limou102

limou102 commented Jun 3, 2025

Copy link
Copy Markdown
Author

uneven pipeline mode is supported now.

@yzygitzh yzygitzh mentioned this pull request Jun 5, 2025
11 tasks
Comment thread tools/checkpoint/pp_tp_vpp/.gitignore Outdated
Comment thread tools/checkpoint/pp_tp_vpp/distrib_optim.py
Comment thread tools/checkpoint/pp_tp_vpp/run_convert_pp_to_vpp.sh Outdated
Comment thread tools/checkpoint/pp_tp_vpp/main.py Outdated
Comment thread tools/checkpoint/pp_tp_vpp/distrib_optim.py Outdated
Comment thread tools/checkpoint/pp_tp_vpp/vpp_converter.py
Comment thread tools/checkpoint/pp_tp_vpp/utils.py Outdated
Comment thread tools/checkpoint/pp_tp_vpp/utils.py
Comment thread tools/checkpoint/pp_tp_vpp/utils.py Outdated
Comment thread tools/checkpoint/pp_tp_vpp/utils.py Outdated
Comment thread tools/checkpoint/pp_tp_vpp/utils.py Outdated
Comment thread tools/checkpoint/pp_tp_vpp/utils.py Outdated
Comment thread tools/checkpoint/pp_tp_vpp/model_optim_rng.py
Comment thread tools/checkpoint/pp_tp_vpp/model_optim_rng.py Outdated
Comment thread tools/checkpoint/pp_tp_vpp/model_optim_rng.py
Comment thread tools/checkpoint/pp_tp_vpp/model_optim_rng.py Outdated
Comment thread tools/checkpoint/pp_tp_vpp/model_optim_rng.py Outdated
Comment thread tools/checkpoint/pp_tp_vpp/distrib_optim.py
Comment thread tools/checkpoint/pp_tp_vpp/distrib_optim.py Outdated
Comment thread tools/checkpoint/pp_tp_vpp/distrib_optim.py
Comment thread tools/checkpoint/pp_tp_vpp/distrib_optim.py
Comment thread tools/checkpoint/pp_tp_vpp/distrib_optim.py Outdated
Comment thread tools/checkpoint/pp_tp_vpp/distrib_optim.py Outdated
@yzygitzh

Copy link
Copy Markdown
Contributor

Please add a test case that runs in the CI/CD pipeline to test this PR. The test flow can be:

  • Setup a non-VPP model and save its ckpt
  • Convert it into vpp ckpt, and load it into a corresponding VPP model
  • Compare each parameter and its corresponding optimizer states from both models, and make sure they're bitwise equal. I think DistributedOptimizer._get_main_param_and_optimizer_states should be helpful for this comparison.

The test case should be in tests/unit_tests/test_checkpointing.py. We can follow the way how tests/unit_tests/data/test_preprocess_data.py tests things in tools folder.

@yzygitzh yzygitzh mentioned this pull request Jun 14, 2025
13 tasks
@limou102

limou102 commented Jun 14, 2025

Copy link
Copy Markdown
Author

resolve conversations above
add unit test at tests/unit_tests/test_convert_checkpoint.py

@github-actions

Copy link
Copy Markdown

Marking as stale. No activity in 60 days.

@github-actions github-actions Bot added the stale label Aug 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants