Summary
Colocate (--colocate) GRPO training with Qwen3-4B on 8× Ascend 910B1 (A2) is validated end-to-end: Megatron training and vLLM rollout share the same NPUs. Training memory is released via torch_memory_saver during sleep; vLLM wake-up and rollout proceed on the freed memory; weights are synced over the IPC path in ~1.6–2.3 s/step.
Code: ascend branch @ 6211f114
Test Environment
| Item |
Configuration |
| Hardware |
8 × Ascend 910B1 (A2), ~61 GiB HBM per NPU |
| Architecture |
aarch64 |
| Model |
Qwen3-4B |
| Algorithm |
GRPO (--advantage-estimator grpo) |
| Layout |
1 node × 8 NPUs, colocate (actor 8 + rollout 8, --colocate) |
| Parallelism |
TP=2, rollout-num-gpus-per-engine=2 |
Software Stack
| Component |
Version / Commit |
| vime |
ascend @ 6211f114 |
| PyTorch |
2.10.0 |
| torch_npu |
2.10.0 |
| CANN |
9.0.0 |
| Python |
3.11 |
| vLLM |
0.20.2 (bc150f5) |
| vllm-ascend |
0.20.2rc1 (367b8e6), __device_type__ = 'A2' |
| Megatron-LM |
3bec9aa97 |
| MindSpeed |
376e9cc |
| sgl-kernel-npu |
87df718 (2026.6.0) |
| torch_memory_saver |
0.0.8 (NPU build with aclrtMalloc hook + preload .so) |
Base image (Huawei internal): pytorch_ascend:vime_0.3.0 + CANN 9.0.0 + PyTorch 2.10.0
Patches applied via docker/patch/latest/ on the ascend branch: vllm.patch, vllm-ascend.patch, megatron-all-changes.patch, mindspeed.patch, torch_npu.patch, megatron.patch (weights_only=False), torch-memory-saver-npu.patch (aclrtMalloc hook).
Build & Setup
Option A — Dockerfile (recommended)
git clone -b ascend https://github.com/vllm-project/vime.git
cd vime
docker build -f docker/Dockerfile.npu -t vime-npu-colocate:latest .
Dockerfile.npu applies all patches, builds torch_memory_saver from sgl-kernel-npu (uses the bundled tree when present, otherwise clones from SGL_KERNEL_NPU_REPO), and installs the preload .so.
Option B — Manual steps on an existing NPU image
- Install vime
git clone -b ascend https://github.com/vllm-project/vime.git
cd vime && git checkout 6211f114
pip install -e . --no-deps --no-build-isolation
- Build torch_memory_saver (NPU colocate prerequisite)
cd /home/ma-user/sgl-kernel-npu
git checkout 87df718a2f15161d3120e4e1dfd19b30afca7cbf
patch -p1 < /path/to/vime/docker/patch/latest/torch-memory-saver-npu.patch
bash build.sh -a memory-saver
pip install output/torch_memory_saver-*.whl --no-deps --force-reinstall
cp contrib/torch_memory_saver/python/build/lib.linux-aarch64-cpython-311/torch_memory_saver_hook_mode_preload.abi3.so \
$(python3 -c "import torch_memory_saver, os; print(os.path.dirname(torch_memory_saver.__file__))")/
- Build vllm_ascend_C.so (CaMemAllocator for vLLM sleep/wake)
cd /home/ma-user/vllm-ascend/csrc
g++ -shared -o ../vllm_ascend/vllm_ascend_C.so camem_allocator.cpp \
-I$PYTHON_INCLUDE -I/usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/include \
-L/usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/lib64 -lascendcl -fPIC
Run
Environment variables (colocate)
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False
export TMS_HOOK_MODE=preload
export TMS_INIT_ENABLE=1
export TMS_INIT_ENABLE_CPU_BACKUP=1
export HCCL_DETERMINISTIC=true
export HCCL_CONNECT_TIMEOUT=7200
export VLLM_ASCEND_ENABLE_NZ=0
export VLLM_SERVER_DEV_MODE=1
export LD_PRELOAD=$(python3 -c "import torch_memory_saver, os; print(os.path.join(os.path.dirname(torch_memory_saver.__file__), 'torch_memory_saver_hook_mode_preload.abi3.so'))")
Ray
ray start --head --num-gpus 0 --resources '{"NPU": 8}' --disable-usage-stats
Pass LD_PRELOAD and TMS variables through ray job submit --runtime-env-json.
Training command (core flags)
ray job submit --address="http://127.0.0.1:<RAY_DASHBOARD_PORT>" \
--runtime-env-json='{"env_vars": { ... }}' \
--working-dir="/path/to/vime" \
-- python3 train.py \
--train-backend megatron \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--rollout-num-gpus 8 \
--colocate \
--train-memory-margin-bytes 2147483648 \
--tensor-model-parallel-size 2 \
--rollout-num-gpus-per-engine 2 \
--vllm-gpu-memory-utilization 0.35 \
--vllm-enforce-eager \
--advantage-estimator grpo \
--hf-checkpoint /path/to/Qwen3-4B \
--ref-load /path/to/Qwen3-4B_torch_dist \
--prompt-data /path/to/dapo-math-17k.jsonl \
--num-rollout 3000 \
--rollout-batch-size 32 \
--n-samples-per-prompt 8 \
--global-batch-size 256 \
...
Full launcher script: scripts/models/qwen3-4B.sh + colocate env/Ray wiring (see start_colocate.sh pattern in internal deployment).
Test Results (steps 100–103)
Metrics
| Step |
train_rollout_logprob_abs_diff |
entropy_loss |
kl_loss |
grad_norm |
| 100 |
0.0565 |
0.0730 |
0.1435 |
0.3398 |
| 101 |
0.0673 |
0.0800 |
0.1467 |
0.3223 |
| 102 |
0.0628 |
0.0635 |
0.1355 |
0.1319 |
| 103 |
0.1303 |
0.0810 |
0.1992 |
0.3645 |
Performance (step 100)
| Phase |
Time |
| sleep |
0.57 s |
| update_weights |
2.27 s |
| wake_up |
0.97 s |
| train_wait (rollout) |
344.5 s |
| ref_logprobs |
27.6 s |
| actor_train |
84.8 s |
| step total |
457.4 s (~7.6 min) |
After torch_memory_saver.pause(), per-NPU free memory increased from ~15 GiB to ~46 GiB, enabling vLLM wake-up. Weight sync is no longer the bottleneck (~2 s); rollout wait dominates (~75% of step time).
Related
- Non-colocate NPU spike: #210
- A2 spike: #157
Summary
Colocate (
--colocate) GRPO training with Qwen3-4B on 8× Ascend 910B1 (A2) is validated end-to-end: Megatron training and vLLM rollout share the same NPUs. Training memory is released viatorch_memory_saverduring sleep; vLLM wake-up and rollout proceed on the freed memory; weights are synced over the IPC path in ~1.6–2.3 s/step.Code:
ascendbranch @6211f114Test Environment
--advantage-estimator grpo)--colocate)rollout-num-gpus-per-engine=2Software Stack
ascend@6211f114bc150f5)367b8e6),__device_type__ = 'A2'3bec9aa97376e9cc87df718(2026.6.0)aclrtMallochook + preload.so)Base image (Huawei internal):
pytorch_ascend:vime_0.3.0+ CANN 9.0.0 + PyTorch 2.10.0Patches applied via
docker/patch/latest/on theascendbranch:vllm.patch,vllm-ascend.patch,megatron-all-changes.patch,mindspeed.patch,torch_npu.patch,megatron.patch(weights_only=False),torch-memory-saver-npu.patch(aclrtMallochook).Build & Setup
Option A — Dockerfile (recommended)
Dockerfile.npuapplies all patches, buildstorch_memory_saverfromsgl-kernel-npu(uses the bundled tree when present, otherwise clones fromSGL_KERNEL_NPU_REPO), and installs the preload.so.Option B — Manual steps on an existing NPU image
Run
Environment variables (colocate)
Ray
ray start --head --num-gpus 0 --resources '{"NPU": 8}' --disable-usage-statsPass
LD_PRELOADand TMS variables throughray job submit --runtime-env-json.Training command (core flags)
Full launcher script:
scripts/models/qwen3-4B.sh+ colocate env/Ray wiring (seestart_colocate.shpattern in internal deployment).Test Results (steps 100–103)
Metrics
Performance (step 100)
After
torch_memory_saver.pause(), per-NPU free memory increased from ~15 GiB to ~46 GiB, enabling vLLM wake-up. Weight sync is no longer the bottleneck (~2 s); rollout wait dominates (~75% of step time).Related