Initial ROCm support for vime#273
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for AMD ROCm 7.0.2 (gfx950) by adding a dedicated Dockerfile, a Megatron no-fork patch to prevent segfaults, and run scripts for Qwen3-8B. Feedback highlights several issues: in the Dockerfile, Python dev packages might be skipped if the version matches the system default, the deprecated apt-key should be replaced with the modern signed-by keyring, and the find command for utils.py needs error handling. Additionally, the Megatron patch should locally import logging and time to prevent runtime NameErrors, and the shell scripts should avoid broad, disruptive pkill commands.
d6010ff to
e0f85bb
Compare
f6d1d63 to
ed16bb9
Compare
Initial working setup for building/running vime on ROCm 7.0.2 (gfx950).
ed16bb9 to
2c3e341
Compare
|
|
||
| # Local debug scratch (repro scripts, mem traces, sampled logs, plots) | ||
| _*.py | ||
| _*.sh |
There was a problem hiding this comment.
I think we don't need to merge dockerignore directly to the main branch
There was a problem hiding this comment.
Sure, I removed it.
… .dockerignore - Resolve conflict in _build_subprocess_env: keep main's refactored server_args_dict["_visible_devices"] API and apply the ROCm HIP visibility line on top of it. - Remove .dockerignore per review (@aoshen02): not needed in main.
| # actor and rollout on disjoint GPUs, RCCL-broadcast weight sync (no colocate IPC). | ||
|
|
||
| # Clean leftovers from a previous run (vLLM orphans procs named VLLM::*). | ||
| ray stop --force |
There was a problem hiding this comment.
The format can align more with other scrpt, for example scripts/run-qwen3-8B-async-rocm.sh
Initial AMD ROCm support for vime (ROCm 7.0.2, gfx950 / MI350/MI355X).
docker/Dockerfile.rocm— full source build for ROCm 7.0.2docker/patch/megatron_nofork_patch.py— in-process checkpoint writer (works around a ROCm 7.0.2 fork segfault)scripts/run-qwen3-8B-rocm.sh,scripts/run-qwen3-8B-async-rocm.sh— colocate / async run scriptsdocs/en/get_started/quick_start_rocm.md— ROCm quick start guide (from Quick Start Guide for ROCm Support #293)Build:
Main author: @pancake0003