Skip to content

[RFC] Steps and Test Result for Running Qwen3-30B on NPU(A3) (use-rollout-routing-replay) #270

Description

@Windfeng8

This issue documents the end-to-end deployment and training workflow for running Qwen3-30B GRPO training (use-rollout-routing-replay)with VIME on Ascend NPU (Atlas 800I A3, 16× NPUs), intended as a reference for the community.

The reference guide PR is as follows:#292

Preparing the Running Environment

Use docker

# Update the vime image
export IMAGE=quay.io/ascend/vime:vime-latest

docker run -d --name vime-npu -it --net=host --shm-size=1024g \
    --privileged=true \
    --cap-add=SYS_PTRACE \
    --device=/dev/davinci_manager \
    --device=/dev/hisi_hdc \
    --device=/dev/devmm_svm \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/sbin:/usr/local/sbin \
    -v /home:/home \
    -v /mnt:/mnt \
    -v /tmp:/tmp \
    -v /data:/data \
    -v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
    $IMAGE

docker exec -it vime-npu bash

Use Source code

Python Version

Only python==3.12 is supported currently.

conda create -n vime_release python=3.12
conda activate vime_release

Working Directory Setup

mkdir <WORKSPACE> && cd <WORKSPACE>

CANN Environment

Prior to start work with vime on Ascend you need to install CANN Toolkit, Kernels operator package and NNAL version 9.0.0 , check the installation guide

source <CANN_PATH>/ascend-toolkit/set_env.sh
source <CANN_PATH>/nnal/atb/set_env.sh

PyTorch and PyTorch NPU

pip install torch-npu==2.10.0

Installing Dependencies

Megatron-Bridge

pip install git+https://github.com/ISEEKYAN/mbridge.git@89eb10887887bc74853f89a4de258c0702932a1c --no-deps

cd <WORKSPACE>
git clone https://github.com/radixark/Megatron-Bridge.git -b bridge
git checkout 3fd3768045422d0aa5c97e90a4e6c659aea9acb9
pip install nvidia-modelopt[torch]>=0.37.0 --no-build-isolation

Megatron-LM

cd <WORKSPACE>
git clone https://github.com/NVIDIA/Megatron-LM.git --recursive && \
  cd Megatron-LM/ && git checkout 3714d81d418c9f1bca4594fc35f9e8289f652862 && \
  pip install -e .

MindSpeed

cd <WORKSPACE>
git clone https://gitcode.com/Ascend/MindSpeed.git && \
  cd MindSpeed/ && git checkout fc63de5c48426dd019c3b3f39e65f5bdf56e4086 && \
  pip install -e .

Vime

cd <WORKSPACE>
git clone https://github.com/vllm-project/vime.git -b npu && cd vime
cp -r docker/npu_patch ../npu_patch
pip install -e .

Applying Patches

cd <WORKSPACE>/Megatron-LM
git apply ../npu_patch/megatron_common.patch
git apply ../npu_patch/megatron.patch

cd <WORKSPACE>/Megatron-Bridge
git apply ../npu_patch/megatron-bridge.patch

cd <WORKSPACE>/MindSpeed
git apply ../npu_patch/mindspeed.patch

Additional Dependencies

cd <WORKSPACE>/vime
pip install triton-ascend
pip install torch-npu==2.10.0
pip install torchvision==0.25.0
pip install numpy==1.26.4

Training script

qwen3_30b_example.sh example scripts:

export SLIME_SCRIPT_TRAIN_BACKEND=megatron
export PYTHONPATH="/root/Megatron-Bridge/src:/root/Megatron-LM/:$PYTHONPATH"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
export CUDA_DEVICE_MAX_CONNECTIONS=1
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050
export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050
export HYDRA_FULL_ERROR=1
export MASTER_PORT=$(shuf -i 20000-65000 -n 1)  # or any free port
export DISABLE_L2_CACHE=1
export VLLM_ASCEND_ENABLE_NZ=0


unset http_proxy
unset https_proxy


SCRIPT_DIR="/root/vime/scripts/"
source "${SCRIPT_DIR}/models/qwen3-30B-A3B.sh"
MODEL_ROOT="${MODEL_ROOT:-/root}"

python /root/vime/train.py \
  --train-backend megatron \
  --actor-num-nodes 1 \
  --actor-num-gpus-per-node 8 \
  --rollout-num-gpus 8 \
  --rollout-num-gpus-per-engine 8 \
  ${MODEL_ARGS[@]} \
  \
  --hf-checkpoint /mnt/weight/Qwen3-30B-A3B/ \
  \
  --prompt-data /mnt/dataset/dapo-math-17k/dapo-math-17k.jsonl \
  --input-key prompt \
  --label-key label \
  --apply-chat-template \
  --rollout-shuffle \
  --rm-type deepscaler \
  \
  --rollout-backend vllm \
  --use-rollout-routing-replay \
  --vllm-weight-sync-mode native \
  --vllm-gpu-memory-utilization 0.7 \
  --vllm-enable-sleep-mode \
  --vllm-max-model-len $((1024 * 20)) \
  \
  --num-rollout 300 \
  --rollout-batch-size 32 \
  --n-samples-per-prompt 8 \
  --rollout-max-response-len $((1024 * 8)) \
  --rollout-temperature 1.0 \
  --global-batch-size 256 \
  --balance-data \
  \
  --advantage-estimator grpo \
  --kl-loss-coef 0.0 \
  --kl-loss-type low_var_kl \
  --entropy-coef 0.0 \
  --eps-clip 0.2 \
  --eps-clip-high 0.28 \
  \
  --optimizer adam \
  --lr 1e-6 \
  --lr-decay-style constant \
  --weight-decay 0.1 \
  --adam-beta1 0.9 \
  --adam-beta2 0.98 \
  --optimizer-cpu-offload \
  --overlap-cpu-optimizer-d2h-h2d \
  --use-precision-aware-optimizer \
  \
  --tensor-model-parallel-size 4 \
  --sequence-parallel \
  --pipeline-model-parallel-size 1 \
  --context-parallel-size 1 \
  --expert-model-parallel-size 8 \
  --expert-tensor-parallel-size 1 \
  --recompute-granularity full \
  --recompute-method uniform \
  --recompute-num-layers 1 \
  --use-dynamic-batch-size \
  --max-tokens-per-gpu 8192 \
  --load /mnt/weight/Qwen3-30B-A3B/ \
  --megatron-to-hf-mode bridge \
  \
  --attention-dropout 0.0 \
  --hidden-dropout 0.0 \
  --accumulate-allreduce-grads-in-fp32 \
  --attention-softmax-in-fp32 \
  --attention-backend flash \
  --use-dynamic-batch-size \
  --max-tokens-per-gpu 20480 \
  --use-flash-attn \
  --no-gradient-accumulation-fusion \
  \
  --train-memory-margin-bytes 2147483648

Start training

bash qwen3_30b_example.sh

Result Comparison

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions