This issue documents the end-to-end deployment and training workflow for running Qwen3-30B GRPO training (use-rollout-routing-replay)with VIME on Ascend NPU (Atlas 800I A3, 16× NPUs), intended as a reference for the community.
The reference guide PR is as follows:#292
Preparing the Running Environment
Use docker
# Update the vime image
export IMAGE=quay.io/ascend/vime:vime-latest
docker run -d --name vime-npu -it --net=host --shm-size=1024g \
--privileged=true \
--cap-add=SYS_PTRACE \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /mnt:/mnt \
-v /tmp:/tmp \
-v /data:/data \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
$IMAGE
docker exec -it vime-npu bash
Use Source code
Python Version
Only python==3.12 is supported currently.
conda create -n vime_release python=3.12
conda activate vime_release
Working Directory Setup
mkdir <WORKSPACE> && cd <WORKSPACE>
CANN Environment
Prior to start work with vime on Ascend you need to install CANN Toolkit, Kernels operator package and NNAL version 9.0.0 , check the installation guide
source <CANN_PATH>/ascend-toolkit/set_env.sh
source <CANN_PATH>/nnal/atb/set_env.sh
PyTorch and PyTorch NPU
pip install torch-npu==2.10.0
Installing Dependencies
Megatron-Bridge
pip install git+https://github.com/ISEEKYAN/mbridge.git@89eb10887887bc74853f89a4de258c0702932a1c --no-deps
cd <WORKSPACE>
git clone https://github.com/radixark/Megatron-Bridge.git -b bridge
git checkout 3fd3768045422d0aa5c97e90a4e6c659aea9acb9
pip install nvidia-modelopt[torch]>=0.37.0 --no-build-isolation
Megatron-LM
cd <WORKSPACE>
git clone https://github.com/NVIDIA/Megatron-LM.git --recursive && \
cd Megatron-LM/ && git checkout 3714d81d418c9f1bca4594fc35f9e8289f652862 && \
pip install -e .
MindSpeed
cd <WORKSPACE>
git clone https://gitcode.com/Ascend/MindSpeed.git && \
cd MindSpeed/ && git checkout fc63de5c48426dd019c3b3f39e65f5bdf56e4086 && \
pip install -e .
Vime
cd <WORKSPACE>
git clone https://github.com/vllm-project/vime.git -b npu && cd vime
cp -r docker/npu_patch ../npu_patch
pip install -e .
Applying Patches
cd <WORKSPACE>/Megatron-LM
git apply ../npu_patch/megatron_common.patch
git apply ../npu_patch/megatron.patch
cd <WORKSPACE>/Megatron-Bridge
git apply ../npu_patch/megatron-bridge.patch
cd <WORKSPACE>/MindSpeed
git apply ../npu_patch/mindspeed.patch
Additional Dependencies
cd <WORKSPACE>/vime
pip install triton-ascend
pip install torch-npu==2.10.0
pip install torchvision==0.25.0
pip install numpy==1.26.4
Training script
qwen3_30b_example.sh example scripts:
export SLIME_SCRIPT_TRAIN_BACKEND=megatron
export PYTHONPATH="/root/Megatron-Bridge/src:/root/Megatron-LM/:$PYTHONPATH"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
export CUDA_DEVICE_MAX_CONNECTIONS=1
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050
export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050
export HYDRA_FULL_ERROR=1
export MASTER_PORT=$(shuf -i 20000-65000 -n 1) # or any free port
export DISABLE_L2_CACHE=1
export VLLM_ASCEND_ENABLE_NZ=0
unset http_proxy
unset https_proxy
SCRIPT_DIR="/root/vime/scripts/"
source "${SCRIPT_DIR}/models/qwen3-30B-A3B.sh"
MODEL_ROOT="${MODEL_ROOT:-/root}"
python /root/vime/train.py \
--train-backend megatron \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--rollout-num-gpus 8 \
--rollout-num-gpus-per-engine 8 \
${MODEL_ARGS[@]} \
\
--hf-checkpoint /mnt/weight/Qwen3-30B-A3B/ \
\
--prompt-data /mnt/dataset/dapo-math-17k/dapo-math-17k.jsonl \
--input-key prompt \
--label-key label \
--apply-chat-template \
--rollout-shuffle \
--rm-type deepscaler \
\
--rollout-backend vllm \
--use-rollout-routing-replay \
--vllm-weight-sync-mode native \
--vllm-gpu-memory-utilization 0.7 \
--vllm-enable-sleep-mode \
--vllm-max-model-len $((1024 * 20)) \
\
--num-rollout 300 \
--rollout-batch-size 32 \
--n-samples-per-prompt 8 \
--rollout-max-response-len $((1024 * 8)) \
--rollout-temperature 1.0 \
--global-batch-size 256 \
--balance-data \
\
--advantage-estimator grpo \
--kl-loss-coef 0.0 \
--kl-loss-type low_var_kl \
--entropy-coef 0.0 \
--eps-clip 0.2 \
--eps-clip-high 0.28 \
\
--optimizer adam \
--lr 1e-6 \
--lr-decay-style constant \
--weight-decay 0.1 \
--adam-beta1 0.9 \
--adam-beta2 0.98 \
--optimizer-cpu-offload \
--overlap-cpu-optimizer-d2h-h2d \
--use-precision-aware-optimizer \
\
--tensor-model-parallel-size 4 \
--sequence-parallel \
--pipeline-model-parallel-size 1 \
--context-parallel-size 1 \
--expert-model-parallel-size 8 \
--expert-tensor-parallel-size 1 \
--recompute-granularity full \
--recompute-method uniform \
--recompute-num-layers 1 \
--use-dynamic-batch-size \
--max-tokens-per-gpu 8192 \
--load /mnt/weight/Qwen3-30B-A3B/ \
--megatron-to-hf-mode bridge \
\
--attention-dropout 0.0 \
--hidden-dropout 0.0 \
--accumulate-allreduce-grads-in-fp32 \
--attention-softmax-in-fp32 \
--attention-backend flash \
--use-dynamic-batch-size \
--max-tokens-per-gpu 20480 \
--use-flash-attn \
--no-gradient-accumulation-fusion \
\
--train-memory-margin-bytes 2147483648
Start training
bash qwen3_30b_example.sh
Result Comparison

This issue documents the end-to-end deployment and training workflow for running Qwen3-30B GRPO training (use-rollout-routing-replay)with VIME on Ascend NPU (Atlas 800I A3, 16× NPUs), intended as a reference for the community.
The reference guide PR is as follows:#292
Preparing the Running Environment
Use docker
Use Source code
Python Version
Only python==3.12 is supported currently.
Working Directory Setup
CANN Environment
Prior to start work with vime on Ascend you need to install CANN Toolkit, Kernels operator package and NNAL version 9.0.0 , check the installation guide
PyTorch and PyTorch NPU
Installing Dependencies
Megatron-Bridge
Megatron-LM
MindSpeed
Vime
Applying Patches
Additional Dependencies
Training script
qwen3_30b_example.sh example scripts:
Start training
Result Comparison