[NPU][Spike] Colocate Mode Test (Qwen3-4B GRPO) on Ascend 910B1 (A2)

## Summary

Colocate (`--colocate`) GRPO training with **Qwen3-4B** on **8× Ascend 910B1 (A2)** is validated end-to-end: Megatron training and vLLM rollout share the same NPUs. Training memory is released via `torch_memory_saver` during sleep; vLLM wake-up and rollout proceed on the freed memory; weights are synced over the IPC path in ~**1.6–2.3 s/step**.

**Code:** [`ascend` branch @ `6211f114`](https://github.com/vllm-project/vime/commit/6211f114bbc45ae56b5383671fa4f3d052e8b5c1)

---

## Test Environment

| Item | Configuration |
|------|----------------|
| **Hardware** | 8 × Ascend **910B1** (A2), ~61 GiB HBM per NPU |
| **Architecture** | aarch64 |
| **Model** | Qwen3-4B |
| **Algorithm** | GRPO (`--advantage-estimator grpo`) |
| **Layout** | 1 node × 8 NPUs, colocate (actor 8 + rollout 8, `--colocate`) |
| **Parallelism** | TP=2, `rollout-num-gpus-per-engine=2` |

---

## Software Stack

| Component | Version / Commit |
|-----------|----------------|
| **vime** | `ascend` @ [`6211f114`](https://github.com/vllm-project/vime/commit/6211f114bbc45ae56b5383671fa4f3d052e8b5c1) |
| **PyTorch** | 2.10.0 |
| **torch_npu** | 2.10.0 |
| **CANN** | 9.0.0 |
| **Python** | 3.11 |
| **vLLM** | 0.20.2 (`bc150f5`) |
| **vllm-ascend** | 0.20.2rc1 (`367b8e6`), `__device_type__ = 'A2'` |
| **Megatron-LM** | `3bec9aa97` |
| **MindSpeed** | `376e9cc` |
| **sgl-kernel-npu** | `87df718` (2026.6.0) |
| **torch_memory_saver** | 0.0.8 (NPU build with `aclrtMalloc` hook + preload `.so`) |

**Base image (Huawei internal):** `pytorch_ascend:vime_0.3.0` + CANN 9.0.0 + PyTorch 2.10.0

Patches applied via `docker/patch/latest/` on the `ascend` branch: `vllm.patch`, `vllm-ascend.patch`, `megatron-all-changes.patch`, `mindspeed.patch`, `torch_npu.patch`, `megatron.patch` (`weights_only=False`), `torch-memory-saver-npu.patch` (`aclrtMalloc` hook).

---

## Build & Setup

### Option A — Dockerfile (recommended)

```bash
git clone -b ascend https://github.com/vllm-project/vime.git
cd vime
docker build -f docker/Dockerfile.npu -t vime-npu-colocate:latest .
```

`Dockerfile.npu` applies all patches, builds `torch_memory_saver` from `sgl-kernel-npu` (uses the bundled tree when present, otherwise clones from `SGL_KERNEL_NPU_REPO`), and installs the preload `.so`.

### Option B — Manual steps on an existing NPU image

1. **Install vime**

```bash
git clone -b ascend https://github.com/vllm-project/vime.git
cd vime && git checkout 6211f114
pip install -e . --no-deps --no-build-isolation
```

2. **Build torch_memory_saver (NPU colocate prerequisite)**

```bash
cd /home/ma-user/sgl-kernel-npu
git checkout 87df718a2f15161d3120e4e1dfd19b30afca7cbf
patch -p1 < /path/to/vime/docker/patch/latest/torch-memory-saver-npu.patch
bash build.sh -a memory-saver
pip install output/torch_memory_saver-*.whl --no-deps --force-reinstall
cp contrib/torch_memory_saver/python/build/lib.linux-aarch64-cpython-311/torch_memory_saver_hook_mode_preload.abi3.so \
   $(python3 -c "import torch_memory_saver, os; print(os.path.dirname(torch_memory_saver.__file__))")/
```

3. **Build vllm_ascend_C.so (CaMemAllocator for vLLM sleep/wake)**

```bash
cd /home/ma-user/vllm-ascend/csrc
g++ -shared -o ../vllm_ascend/vllm_ascend_C.so camem_allocator.cpp \
  -I$PYTHON_INCLUDE -I/usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/include \
  -L/usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/lib64 -lascendcl -fPIC
```

---

## Run

### Environment variables (colocate)

```bash
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False
export TMS_HOOK_MODE=preload
export TMS_INIT_ENABLE=1
export TMS_INIT_ENABLE_CPU_BACKUP=1
export HCCL_DETERMINISTIC=true
export HCCL_CONNECT_TIMEOUT=7200
export VLLM_ASCEND_ENABLE_NZ=0
export VLLM_SERVER_DEV_MODE=1
export LD_PRELOAD=$(python3 -c "import torch_memory_saver, os; print(os.path.join(os.path.dirname(torch_memory_saver.__file__), 'torch_memory_saver_hook_mode_preload.abi3.so'))")
```

### Ray

```bash
ray start --head --num-gpus 0 --resources '{"NPU": 8}' --disable-usage-stats
```

Pass `LD_PRELOAD` and TMS variables through `ray job submit --runtime-env-json`.

### Training command (core flags)

```bash
ray job submit --address="http://127.0.0.1:<RAY_DASHBOARD_PORT>" \
  --runtime-env-json='{"env_vars": { ... }}' \
  --working-dir="/path/to/vime" \
  -- python3 train.py \
    --train-backend megatron \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --colocate \
    --train-memory-margin-bytes 2147483648 \
    --tensor-model-parallel-size 2 \
    --rollout-num-gpus-per-engine 2 \
    --vllm-gpu-memory-utilization 0.35 \
    --vllm-enforce-eager \
    --advantage-estimator grpo \
    --hf-checkpoint /path/to/Qwen3-4B \
    --ref-load /path/to/Qwen3-4B_torch_dist \
    --prompt-data /path/to/dapo-math-17k.jsonl \
    --num-rollout 3000 \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    ...
```

Full launcher script: `scripts/models/qwen3-4B.sh` + colocate env/Ray wiring (see `start_colocate.sh` pattern in internal deployment).

---

## Test Results (steps 100–103)

### Metrics

| Step | train_rollout_logprob_abs_diff | entropy_loss | kl_loss | grad_norm |
|------|--------------------------------|--------------|---------|-----------|
| 100  | 0.0565 | 0.0730 | 0.1435 | 0.3398 |
| 101  | 0.0673 | 0.0800 | 0.1467 | 0.3223 |
| 102  | 0.0628 | 0.0635 | 0.1355 | 0.1319 |
| 103  | 0.1303 | 0.0810 | 0.1992 | 0.3645 |

### Performance (step 100)

| Phase | Time |
|-------|------|
| sleep | 0.57 s |
| update_weights | 2.27 s |
| wake_up | 0.97 s |
| train_wait (rollout) | 344.5 s |
| ref_logprobs | 27.6 s |
| actor_train | 84.8 s |
| **step total** | **457.4 s (~7.6 min)** |

After `torch_memory_saver.pause()`, per-NPU free memory increased from ~15 GiB to ~46 GiB, enabling vLLM wake-up. Weight sync is no longer the bottleneck (~2 s); rollout wait dominates (~75% of step time).

---

## Related

- Non-colocate NPU spike: [#210](https://github.com/vllm-project/vime/issues/210)
- A2 spike: [#157](https://github.com/vllm-project/vime/issues/157)


Item	Configuration
Hardware	8 × Ascend 910B1 (A2), ~61 GiB HBM per NPU
Architecture	aarch64
Model	Qwen3-4B
Algorithm	GRPO (`--advantage-estimator grpo`)
Layout	1 node × 8 NPUs, colocate (actor 8 + rollout 8, `--colocate`)
Parallelism	TP=2, `rollout-num-gpus-per-engine=2`

Component	Version / Commit
vime	`ascend` @ `6211f114`
PyTorch	2.10.0
torch_npu	2.10.0
CANN	9.0.0
Python	3.11
vLLM	0.20.2 (`bc150f5`)
vllm-ascend	0.20.2rc1 (`367b8e6`), `__device_type__ = 'A2'`
Megatron-LM	`3bec9aa97`
MindSpeed	`376e9cc`
sgl-kernel-npu	`87df718` (2026.6.0)
torch_memory_saver	0.0.8 (NPU build with `aclrtMalloc` hook + preload `.so`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NPU][Spike] Colocate Mode Test (Qwen3-4B GRPO) on Ascend 910B1 (A2) #255

Summary

Test Environment

Software Stack

Build & Setup

Option A — Dockerfile (recommended)

Option B — Manual steps on an existing NPU image

Run

Environment variables (colocate)

Ray

Training command (core flags)

Test Results (steps 100–103)

Metrics

Performance (step 100)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Step	train_rollout_logprob_abs_diff	entropy_loss	kl_loss	grad_norm
100	0.0565	0.0730	0.1435	0.3398
101	0.0673	0.0800	0.1467	0.3223
102	0.0628	0.0635	0.1355	0.1319
103	0.1303	0.0810	0.1992	0.3645

Phase	Time
sleep	0.57 s
update_weights	2.27 s
wake_up	0.97 s
train_wait (rollout)	344.5 s
ref_logprobs	27.6 s
actor_train	84.8 s
step total	457.4 s (~7.6 min)

Uh oh!

[NPU][Spike] Colocate Mode Test (Qwen3-4B GRPO) on Ascend 910B1 (A2) #255

Description

Summary

Test Environment

Software Stack

Build & Setup

Option A — Dockerfile (recommended)

Option B — Manual steps on an existing NPU image

Run

Environment variables (colocate)

Ray

Training command (core flags)

Test Results (steps 100–103)

Metrics

Performance (step 100)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions