Skip to content

[NPU][Spike] Colocate Mode Test (Qwen3-4B GRPO) on Ascend 910B1 (A2) #255

Description

@CalvinXKY

Summary

Colocate (--colocate) GRPO training with Qwen3-4B on 8× Ascend 910B1 (A2) is validated end-to-end: Megatron training and vLLM rollout share the same NPUs. Training memory is released via torch_memory_saver during sleep; vLLM wake-up and rollout proceed on the freed memory; weights are synced over the IPC path in ~1.6–2.3 s/step.

Code: ascend branch @ 6211f114


Test Environment

Item Configuration
Hardware 8 × Ascend 910B1 (A2), ~61 GiB HBM per NPU
Architecture aarch64
Model Qwen3-4B
Algorithm GRPO (--advantage-estimator grpo)
Layout 1 node × 8 NPUs, colocate (actor 8 + rollout 8, --colocate)
Parallelism TP=2, rollout-num-gpus-per-engine=2

Software Stack

Component Version / Commit
vime ascend @ 6211f114
PyTorch 2.10.0
torch_npu 2.10.0
CANN 9.0.0
Python 3.11
vLLM 0.20.2 (bc150f5)
vllm-ascend 0.20.2rc1 (367b8e6), __device_type__ = 'A2'
Megatron-LM 3bec9aa97
MindSpeed 376e9cc
sgl-kernel-npu 87df718 (2026.6.0)
torch_memory_saver 0.0.8 (NPU build with aclrtMalloc hook + preload .so)

Base image (Huawei internal): pytorch_ascend:vime_0.3.0 + CANN 9.0.0 + PyTorch 2.10.0

Patches applied via docker/patch/latest/ on the ascend branch: vllm.patch, vllm-ascend.patch, megatron-all-changes.patch, mindspeed.patch, torch_npu.patch, megatron.patch (weights_only=False), torch-memory-saver-npu.patch (aclrtMalloc hook).


Build & Setup

Option A — Dockerfile (recommended)

git clone -b ascend https://github.com/vllm-project/vime.git
cd vime
docker build -f docker/Dockerfile.npu -t vime-npu-colocate:latest .

Dockerfile.npu applies all patches, builds torch_memory_saver from sgl-kernel-npu (uses the bundled tree when present, otherwise clones from SGL_KERNEL_NPU_REPO), and installs the preload .so.

Option B — Manual steps on an existing NPU image

  1. Install vime
git clone -b ascend https://github.com/vllm-project/vime.git
cd vime && git checkout 6211f114
pip install -e . --no-deps --no-build-isolation
  1. Build torch_memory_saver (NPU colocate prerequisite)
cd /home/ma-user/sgl-kernel-npu
git checkout 87df718a2f15161d3120e4e1dfd19b30afca7cbf
patch -p1 < /path/to/vime/docker/patch/latest/torch-memory-saver-npu.patch
bash build.sh -a memory-saver
pip install output/torch_memory_saver-*.whl --no-deps --force-reinstall
cp contrib/torch_memory_saver/python/build/lib.linux-aarch64-cpython-311/torch_memory_saver_hook_mode_preload.abi3.so \
   $(python3 -c "import torch_memory_saver, os; print(os.path.dirname(torch_memory_saver.__file__))")/
  1. Build vllm_ascend_C.so (CaMemAllocator for vLLM sleep/wake)
cd /home/ma-user/vllm-ascend/csrc
g++ -shared -o ../vllm_ascend/vllm_ascend_C.so camem_allocator.cpp \
  -I$PYTHON_INCLUDE -I/usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/include \
  -L/usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/lib64 -lascendcl -fPIC

Run

Environment variables (colocate)

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False
export TMS_HOOK_MODE=preload
export TMS_INIT_ENABLE=1
export TMS_INIT_ENABLE_CPU_BACKUP=1
export HCCL_DETERMINISTIC=true
export HCCL_CONNECT_TIMEOUT=7200
export VLLM_ASCEND_ENABLE_NZ=0
export VLLM_SERVER_DEV_MODE=1
export LD_PRELOAD=$(python3 -c "import torch_memory_saver, os; print(os.path.join(os.path.dirname(torch_memory_saver.__file__), 'torch_memory_saver_hook_mode_preload.abi3.so'))")

Ray

ray start --head --num-gpus 0 --resources '{"NPU": 8}' --disable-usage-stats

Pass LD_PRELOAD and TMS variables through ray job submit --runtime-env-json.

Training command (core flags)

ray job submit --address="http://127.0.0.1:<RAY_DASHBOARD_PORT>" \
  --runtime-env-json='{"env_vars": { ... }}' \
  --working-dir="/path/to/vime" \
  -- python3 train.py \
    --train-backend megatron \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --colocate \
    --train-memory-margin-bytes 2147483648 \
    --tensor-model-parallel-size 2 \
    --rollout-num-gpus-per-engine 2 \
    --vllm-gpu-memory-utilization 0.35 \
    --vllm-enforce-eager \
    --advantage-estimator grpo \
    --hf-checkpoint /path/to/Qwen3-4B \
    --ref-load /path/to/Qwen3-4B_torch_dist \
    --prompt-data /path/to/dapo-math-17k.jsonl \
    --num-rollout 3000 \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    ...

Full launcher script: scripts/models/qwen3-4B.sh + colocate env/Ray wiring (see start_colocate.sh pattern in internal deployment).


Test Results (steps 100–103)

Metrics

Step train_rollout_logprob_abs_diff entropy_loss kl_loss grad_norm
100 0.0565 0.0730 0.1435 0.3398
101 0.0673 0.0800 0.1467 0.3223
102 0.0628 0.0635 0.1355 0.1319
103 0.1303 0.0810 0.1992 0.3645

Performance (step 100)

Phase Time
sleep 0.57 s
update_weights 2.27 s
wake_up 0.97 s
train_wait (rollout) 344.5 s
ref_logprobs 27.6 s
actor_train 84.8 s
step total 457.4 s (~7.6 min)

After torch_memory_saver.pause(), per-NPU free memory increased from ~15 GiB to ~46 GiB, enabling vLLM wake-up. Weight sync is no longer the bottleneck (~2 s); rollout wait dominates (~75% of step time).


Related

  • Non-colocate NPU spike: #210
  • A2 spike: #157

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions