vllm_dissag: unify 3 launchers into one two-axis driver + models.yaml#171
Open
raviguptaamd wants to merge 4 commits into
Open
vllm_dissag: unify 3 launchers into one two-axis driver + models.yaml#171raviguptaamd wants to merge 4 commits into
raviguptaamd wants to merge 4 commits into
Conversation
Consolidate vllm_disagg_server.sh / vllm_disagg_mori_ep.sh /
vllm_disagg_server_deepep.sh into a single launcher (vllm_disagg.sh) that
composes behavior from two orthogonal axes plus a validated EP backend:
CONNECTOR = rixl (NixlConnector) | moriio (MoRIIOConnector)
WIDE_EP = 0 (TP) | 1 (wide expert-parallel)
EP_BACKEND= mori | deepep (only when WIDE_EP=1; validated vs connector)
Connector-specific logic lives in connectors/{rixl,moriio}.sh; the TP-vs-wideEP
fork lives in parallelism.sh; per-model CLI flags + env overrides move from
inline declare -A maps / hardcoding into a models.yaml catalog. This adds a new
capability (MoRIIO + TP) and lets a model be added by editing data, not code.
Back-compat: RUN_MORI=1 / RUN_DEEPEP=1 still work (mapped to the new axes); the
no-flags default stays rixl + TP. run_xPyD_models.slurm resolves the axes,
keeps the VALID_MODELS allowlists, and plumbs the new env via docker -e.
Testing: tests/parity_check.sh drives the real launcher under DRY_RUN=1 and
diffs the assembled `vllm serve` argv against golden fixtures for every
connector x parallelism x role cell (byte-identical to the 3 legacy launchers),
plus validation-rejection checks. Live-validated serving for dense models
(amd-Llama-3.3-70B, Qwen3-32B) on the MoRIIO+TP path. See tests/TEST_PLAN.md.
Catalog seeded with the existing 6 models + Qwen3-32B (dense) and Qwen3-30B-A3B.
Co-Authored-By: Claude <noreply@anthropic.com>
Mermaid diagrams documenting the unified launcher: component architecture (sbatch → driver → connector → vllm serve), axis resolution flow + 2x2 capability matrix, per-node runtime state machine, driver↔connector hook sequence, and the per-model config/env layering. Co-Authored-By: Claude <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR consolidates the vLLM disaggregated prefill/decode launch flow in scripts/vllm_dissag/ by replacing three legacy launcher scripts with a single two-axis driver (CONNECTOR × WIDE_EP) plus a models.yaml catalog for per-model flags/env, and adds an offline parity test gate to keep the assembled vllm serve argv consistent with legacy behavior.
Changes:
- Replace
vllm_disagg_{server,mori_ep,server_deepep}.shwith a unifiedvllm_disagg.shthat sources connector + parallelism profiles and reads per-model config frommodels.yaml. - Update
run_xPyD_models.slurmto resolve axes/back-compat shims and plumb new env vars into Docker runs. - Add
tests/parity_check.sh+ golden fixtures to enforce byte-identical argv parity vs legacy launchers (plus validation rejection checks).
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/vllm_dissag/vllm_disagg.sh | New unified launcher with axis resolution, role branching, YAML model config parsing, and DRY_RUN support. |
| scripts/vllm_dissag/connectors/rixl.sh | New connector profile implementing rixl TP + DeepEP wideEP behavior. |
| scripts/vllm_dissag/connectors/moriio.sh | New connector profile implementing MoRI-EP wideEP and new MoRIIO+TP path. |
| scripts/vllm_dissag/parallelism.sh | Shared helper for wideEP master/child role args. |
| scripts/vllm_dissag/models.yaml | New model catalog replacing inline associative arrays / hardcoding. |
| scripts/vllm_dissag/tests/TEST_PLAN.md | New before/after validation plan documenting parity + live matrix. |
| scripts/vllm_dissag/tests/parity_check.sh | New offline parity gate that diffs DRY_RUN argv vs goldens. |
| scripts/vllm_dissag/tests/golden/*.txt | New golden argv fixtures captured from legacy launchers. |
| scripts/vllm_dissag/tests/golden/gen_golden.sh | Utility to regenerate golden fixtures. |
| scripts/vllm_dissag/run_xPyD_models.slurm | Updated slurm entry to use unified launcher and axis/back-compat resolution. |
| scripts/vllm_dissag/README.MD | Updated docs to match the unified launcher + models.yaml + test flow. |
| scripts/vllm_dissag/vllm_disagg_server.sh | Deleted legacy rixl TP launcher. |
| scripts/vllm_dissag/vllm_disagg_mori_ep.sh | Deleted legacy MoRI-EP launcher. |
| scripts/vllm_dissag/vllm_disagg_server_deepep.sh | Deleted legacy DeepEP launcher. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+7
to
+8
| # CONNECTOR = rixl | moriio (KV transfer; default moriio) | ||
| # WIDE_EP = 0 (TP) | 1 (wideEP) (parallelism; default per back-compat shim) |
Comment on lines
+31
to
+32
| SCRIPT_DIR="${NIXL_COOKBOOK_PATH:-$(cd "$(dirname "${BASH_SOURCE[0]:-$0}")" && pwd)}" | ||
|
|
Comment on lines
+86
to
+87
| echo "Listing NIXL_COOKBOOK_PATH : " | ||
| ls ${NIXL_COOKBOOK_PATH} |
Comment on lines
+186
to
+190
| --no-enable-prefix-caching \ | ||
| --all2all-backend "${_all2all}" \ | ||
| --trust-remote-code \ | ||
| --distributed-timeout-seconds "${DISTRIBUTED_TIMEOUT_SECONDS:-7200}" \ | ||
| "${exec_args[@]}" "${extra_args[@]}" "${kv_args[@]}" |
Comment on lines
+209
to
+212
| "${exec_args[@]}" \ | ||
| "${extra_args[@]}" \ | ||
| "${kv_args[@]}" \ | ||
| 2>&1 | tee /run_logs/${SLURM_JOB_ID}/${log_prefix}_NODE${NODE_RANK}.log >/dev/null & |
Comment on lines
+309
to
+312
| --all2all-backend "${backend}" \ | ||
| ${DBO_ARGS} \ | ||
| "${extra_args[@]}" \ | ||
| --kv-transfer-config "${kv_config}" |
Comment on lines
+330
to
+334
| --all2all-backend "${backend}" \ | ||
| ${DBO_ARGS} \ | ||
| "${extra_args[@]}" \ | ||
| --kv-transfer-config "${kv_config}" \ | ||
| 2>&1 | tee /run_logs/${SLURM_JOB_ID}/${log_prefix}_NODE${NODE_RANK}.log >/dev/null & |
Comment on lines
+16
to
+18
| # BOUNDARY (do NOT put these here — the launcher/connector owns them): | ||
| # --tensor-parallel-size / --data-parallel-size / --enable-expert-parallel / | ||
| # --all2all-backend / --kv-transfer-config / --port / transfer backend. |
| export BENCHMARK_CON="8 16 32" | ||
| export BENCHMARK_COMBINATIONS="1024/1024 8192/1024" | ||
| sbatch -N 2 -n 2 run_xPyD_models.slurm | ||
| python3 benchmark_parser.py <log_path>/benchmark_XXX_CONCURRENCY.log |
| MASTER_PORT="${MASTER_PORT:-23731}" | ||
| NODE_RANK="${NODE_RANK:-0}" | ||
| NNODES="${NNODES:-1}" | ||
| MODEL_PATH=$MODEL_PATH |
…uncher
Brings MAD-private #324's validated DeepSeek-V3 wide-EP support into the
consolidated format (driver + connectors + models.yaml), so MoE serves through
vllm_disagg.sh — not as a parallel launcher.
connectors/moriio.sh (parity with #324's mori launcher):
- runtime patch: use the idempotent Python patcher apply_39276_rebased.py with
topology-defaulted SKIP_RUNTIME_PATCH (explicit wins); the old bash patcher
double-patched a pre-patched image and corrupted the decode notify path.
- cudagraph: NEVER bare --enforce-eager (it routes fp8 quant through an AITER op
whose aiter_tensor_t signature crashes engine init); NONE -> compilation-config
cudagraph_mode:NONE +quant_fp8. Per-role PREFILL/DECODE_CUDAGRAPH_MODE.
- proxy: vllm_router (ROUTER_BINARY/PATH + --kv-connector moriio + discovery +
registration gate) and moriio_toy with online_serving/->disaggregated/ path
resolution.
- recipe knobs: --block-size ${KV_BLOCK_SIZE}, --kv-cache-memory-bytes (skip the
buggy profiling forward), per-role mori_high_throughput/mori_low_latency,
VLLM_ROCM_USE_AITER_MLA override, MORI_RDMA_TC/SL + MORI_SHMEM_HEAP_SIZE.
models.yaml: DeepSeek-V3/R1/V3-5layer env: blocks encode the full recipe (YAML
anchor) so MODEL_NAME=DeepSeek-V3 WIDE_EP=1 "just works" (= #324 .env auto-source).
slurm: BENCHMARK_SCRIPT selector (sweep/long_context); plumb the recipe + cache +
benchmark env via docker -e; driver runs $BENCHMARK_SCRIPT_FILE.
docker: add vllm_disagg_mori_ep_fullsource (the buildable validated stack: MoRI
1db01d8, AITER 0.1.14, vLLM fork b10a9f7a, router #181) for DeepSeek MoE; the
public-base vllm_disagg_inference stays for dense.
Verified: moriio wideEP `vllm serve` argv is byte-identical to #324's launcher
(offline gate); rixl/deepep/dense parity unchanged; goldens regenerated.
Co-Authored-By: Claude <noreply@anthropic.com>
The moriio toy proxy cannot route the wideEP DP-rank KV-notify -> decode hangs
("remote blocks never arrived", deferred-write expiry). vllm_router carries the
DP-rank dpfix and is what #324's recipe.env uses. Default moriio to vllm_router
(matching rixl); moriio_toy still selectable. Golden regenerated (kv-transfer
proxy_port 10001->30000); argv stays byte-identical to #324 (router on both sides).
Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two things, stacked:
scripts/vllm_dissag/(
vllm_disagg_server.sh,vllm_disagg_mori_ep.sh,vllm_disagg_server_deepep.sh) into a singlelauncher
vllm_disagg.shdriven by two orthogonal axes + a validated EP backend.models.yaml + Dockerfile), so MoE serves through the same launcher — not a parallel script.
Axis model
WIDE_EP=0(TP)WIDE_EP=1(wide expert-parallel)CONNECTOR=rixl(NixlConnector)EP_BACKEND=deepep)CONNECTOR=moriio(MoRIIOConnector)EP_BACKEND=mori)connectors/{rixl,moriio}.shparallelism.shmodels.yamlcatalog (replaces inlinedeclare -Amaps / hardcoding)Adding a model is data-only (edit
models.yaml+ the slurm allowlist). Invalid connector/EP pairings(
moriio+deepep,rixl+mori) abort with a clear error.Code changes
Launcher consolidation
vllm_disagg.sh— single driver: axis resolution + validation, topology math, models.yamlparse, NODE_RANK role branch (prefill/decode × master/child), container barrier, proxy, benchmark,
cleanup.
DRY_RUN=1echoes the assembledvllm serveargv for offline parity.connectors/rixl.sh(NixlConnector: TP + DeepEP) andconnectors/moriio.sh(MoRIIOConnector: MoRIIO+TP + MoRI-EP). Each implements the connector hook contract
(
connector_init,connector_setup_env,connector_runtime_patch,connector_launch_worker,connector_wait_workers_ready,connector_start_proxy).parallelism.sh— shared TP-vs-wideEP helpers.models.yaml— per-model flag/env catalog (sglang schema;tp/dpblocks selected by WIDE_EP).tests/parity_check.sh+tests/golden/— dry-run argv parity gate (byte-identical to the3 legacy launchers for every connector × parallelism × role cell, plus validation-rejection checks).
ARCHITECTURE.md— component + state diagrams. Newtests/TEST_PLAN.md.#324 DeepSeek-V3 MoRI-EP port
connectors/moriio.sh: per-role mori all2all (mori_high_throughput/mori_low_latency);--block-size ${KV_BLOCK_SIZE},--kv-cache-memory-bytes,VLLM_ROCM_USE_AITER_MLAoverride;MoRI fabric env (
MORI_RDMA_TC/SL,MORI_SHMEM_HEAP_SIZE); cudagraph NONE emitted ascompilation-config cudagraph_mode:NONE +quant_fp8(never bare--enforce-eager, which crashesengine init on these AITER images); per-role
PREFILL/DECODE_CUDAGRAPH_MODE.apply_39276_rebased.py(new) withtopology-defaulted
SKIP_RUNTIME_PATCH(explicit value wins; images that bake the fixes set =1).connector_start_proxysupportsvllm_router(ROUTER_BINARY/PATH +--kv-connector moriio+ discovery address + registration gate) andmoriio_toy(withonline_serving/→disaggregated/path resolution). moriio defaults tovllm_router— the toy proxy cannot routethe wideEP DP-rank KV-notify. Sets
ROUTER_PORT(default 30000) so the router gets a valid--port.models.yaml: DeepSeek-V3 / R1 / V3-5layerenv:blocks (YAML anchor) encode the validated recipe(block=16, MLA off, kv-cache-memory-bytes, per-role cudagraph + mori backends, SKIP_RUNTIME_PATCH=1,
fabric tuning), so
MODEL_NAME=DeepSeek-V3 WIDE_EP=1works without a wrapper.docker/vllm_disagg_mori_ep_fullsource.ubuntu.amd.Dockerfile— buildable MoRI-EP image(MoRI
1db01d8, AITER0.1.14, vLLM forkb10a9f7a, vllm-router #181 + DP-rank dpfix) on a namednightly base, with build-time integrity asserts. The public-base
vllm_disagg_inferenceDockerfile iskept unchanged for dense models.
benchmark_long_context.sh,benchmark_parser.py.run_xPyD_models.slurm:BENCHMARK_SCRIPTselector (sweep/long_context);-eplumbing for therecipe/cache/proxy env; cache-dir env forwarded only-if-set so a prewarmed image's baked cache wins;
driver runs
$BENCHMARK_SCRIPT_FILE;apply_39276_rebased.pyadded to REQUIRED_FILES.Back-compat
RUN_MORI=1/RUN_DEEPEP=1still work (mapped to the new axes); no-flags default staysrixl + TP.Testing
tests/parity_check.sh— byte-identical argv vs the 3 legacy launchers (offline, no GPUs).Notes / follow-ups
they fail inside AITER's MoE GEMM path at engine init — an image concern, independent of the launcher.
rocm/pytorch-privatenightly (BASE_IMAGE overridable).VLLM_CACHE_PERSIST(image-digest-keyed host JIT cache, to avoid per-run cold AITER/MoRI compiles)is a planned follow-up, not in this PR.
🤖 Generated with Claude Code