Skip to content

vllm_dissag: unify 3 launchers into one two-axis driver + models.yaml#171

Open
raviguptaamd wants to merge 4 commits into
ROCm:developfrom
raviguptaamd:vllm-disagg-unified-launcher
Open

vllm_dissag: unify 3 launchers into one two-axis driver + models.yaml#171
raviguptaamd wants to merge 4 commits into
ROCm:developfrom
raviguptaamd:vllm-disagg-unified-launcher

Conversation

@raviguptaamd

@raviguptaamd raviguptaamd commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Two things, stacked:

  1. Consolidates the three vLLM disaggregated-inference launchers in scripts/vllm_dissag/
    (vllm_disagg_server.sh, vllm_disagg_mori_ep.sh, vllm_disagg_server_deepep.sh) into a single
    launcher
    vllm_disagg.sh driven by two orthogonal axes + a validated EP backend.
  2. Ports the MAD-private #324 DeepSeek-V3 MoRI-EP recipe into that consolidated format (connector +
    models.yaml + Dockerfile), so MoE serves through the same launcher — not a parallel script.

Axis model

WIDE_EP=0 (TP) WIDE_EP=1 (wide expert-parallel)
CONNECTOR=rixl (NixlConnector) dense / TP DeepEP (EP_BACKEND=deepep)
CONNECTOR=moriio (MoRIIOConnector) MoRIIO + TP (new) MoRI-EP (EP_BACKEND=mori)
  • Connector-specific logic → connectors/{rixl,moriio}.sh
  • TP-vs-wideEP fork → parallelism.sh
  • Per-model CLI flags + env overrides → models.yaml catalog (replaces inline declare -A maps / hardcoding)

Adding a model is data-only (edit models.yaml + the slurm allowlist). Invalid connector/EP pairings
(moriio+deepep, rixl+mori) abort with a clear error.

Code changes

Launcher consolidation

  • New vllm_disagg.sh — single driver: axis resolution + validation, topology math, models.yaml
    parse, NODE_RANK role branch (prefill/decode × master/child), container barrier, proxy, benchmark,
    cleanup. DRY_RUN=1 echoes the assembled vllm serve argv for offline parity.
  • New connectors/rixl.sh (NixlConnector: TP + DeepEP) and connectors/moriio.sh
    (MoRIIOConnector: MoRIIO+TP + MoRI-EP). Each implements the connector hook contract
    (connector_init, connector_setup_env, connector_runtime_patch, connector_launch_worker,
    connector_wait_workers_ready, connector_start_proxy).
  • New parallelism.sh — shared TP-vs-wideEP helpers.
  • New models.yaml — per-model flag/env catalog (sglang schema; tp/dp blocks selected by WIDE_EP).
  • Deleted the 3 legacy launchers.
  • New tests/parity_check.sh + tests/golden/ — dry-run argv parity gate (byte-identical to the
    3 legacy launchers for every connector × parallelism × role cell, plus validation-rejection checks).
  • New ARCHITECTURE.md — component + state diagrams. New tests/TEST_PLAN.md.

#324 DeepSeek-V3 MoRI-EP port

  • connectors/moriio.sh: per-role mori all2all (mori_high_throughput/mori_low_latency);
    --block-size ${KV_BLOCK_SIZE}, --kv-cache-memory-bytes, VLLM_ROCM_USE_AITER_MLA override;
    MoRI fabric env (MORI_RDMA_TC/SL, MORI_SHMEM_HEAP_SIZE); cudagraph NONE emitted as
    compilation-config cudagraph_mode:NONE +quant_fp8 (never bare --enforce-eager, which crashes
    engine init on these AITER images); per-role PREFILL/DECODE_CUDAGRAPH_MODE.
  • Runtime patch: switched to the idempotent Python patcher apply_39276_rebased.py (new) with
    topology-defaulted SKIP_RUNTIME_PATCH (explicit value wins; images that bake the fixes set =1).
  • Proxy: connector_start_proxy supports vllm_router (ROUTER_BINARY/PATH + --kv-connector moriio + discovery address + registration gate) and moriio_toy (with online_serving/
    disaggregated/ path resolution). moriio defaults to vllm_router — the toy proxy cannot route
    the wideEP DP-rank KV-notify. Sets ROUTER_PORT (default 30000) so the router gets a valid --port.
  • models.yaml: DeepSeek-V3 / R1 / V3-5layer env: blocks (YAML anchor) encode the validated recipe
    (block=16, MLA off, kv-cache-memory-bytes, per-role cudagraph + mori backends, SKIP_RUNTIME_PATCH=1,
    fabric tuning), so MODEL_NAME=DeepSeek-V3 WIDE_EP=1 works without a wrapper.
  • New docker/vllm_disagg_mori_ep_fullsource.ubuntu.amd.Dockerfile — buildable MoRI-EP image
    (MoRI 1db01d8, AITER 0.1.14, vLLM fork b10a9f7a, vllm-router #181 + DP-rank dpfix) on a named
    nightly base, with build-time integrity asserts. The public-base vllm_disagg_inference Dockerfile is
    kept unchanged for dense models.
  • New helper scripts benchmark_long_context.sh, benchmark_parser.py.
  • run_xPyD_models.slurm: BENCHMARK_SCRIPT selector (sweep/long_context); -e plumbing for the
    recipe/cache/proxy env; cache-dir env forwarded only-if-set so a prewarmed image's baked cache wins;
    driver runs $BENCHMARK_SCRIPT_FILE; apply_39276_rebased.py added to REQUIRED_FILES.

Back-compat

RUN_MORI=1 / RUN_DEEPEP=1 still work (mapped to the new axes); no-flags default stays rixl + TP.

Testing

  • tests/parity_check.sh — byte-identical argv vs the 3 legacy launchers (offline, no GPUs).
  • moriio wideEP argv verified byte-identical to #324's launcher.

Notes / follow-ups

  • MoE models require a co-versioned AITER/vLLM image (the fullsource Dockerfile); on a mismatched image
    they fail inside AITER's MoE GEMM path at engine init — an image concern, independent of the launcher.
  • The fullsource base is a private rocm/pytorch-private nightly (BASE_IMAGE overridable).
  • VLLM_CACHE_PERSIST (image-digest-keyed host JIT cache, to avoid per-run cold AITER/MoRI compiles)
    is a planned follow-up, not in this PR.
  • #325 Hy3 (GQA) support is deferred.

🤖 Generated with Claude Code

Consolidate vllm_disagg_server.sh / vllm_disagg_mori_ep.sh /
vllm_disagg_server_deepep.sh into a single launcher (vllm_disagg.sh) that
composes behavior from two orthogonal axes plus a validated EP backend:

  CONNECTOR = rixl (NixlConnector) | moriio (MoRIIOConnector)
  WIDE_EP   = 0 (TP) | 1 (wide expert-parallel)
  EP_BACKEND= mori | deepep   (only when WIDE_EP=1; validated vs connector)

Connector-specific logic lives in connectors/{rixl,moriio}.sh; the TP-vs-wideEP
fork lives in parallelism.sh; per-model CLI flags + env overrides move from
inline declare -A maps / hardcoding into a models.yaml catalog. This adds a new
capability (MoRIIO + TP) and lets a model be added by editing data, not code.

Back-compat: RUN_MORI=1 / RUN_DEEPEP=1 still work (mapped to the new axes); the
no-flags default stays rixl + TP. run_xPyD_models.slurm resolves the axes,
keeps the VALID_MODELS allowlists, and plumbs the new env via docker -e.

Testing: tests/parity_check.sh drives the real launcher under DRY_RUN=1 and
diffs the assembled `vllm serve` argv against golden fixtures for every
connector x parallelism x role cell (byte-identical to the 3 legacy launchers),
plus validation-rejection checks. Live-validated serving for dense models
(amd-Llama-3.3-70B, Qwen3-32B) on the MoRIIO+TP path. See tests/TEST_PLAN.md.

Catalog seeded with the existing 6 models + Qwen3-32B (dense) and Qwen3-30B-A3B.

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 24, 2026 04:08
Mermaid diagrams documenting the unified launcher: component architecture
(sbatch → driver → connector → vllm serve), axis resolution flow + 2x2
capability matrix, per-node runtime state machine, driver↔connector hook
sequence, and the per-model config/env layering.

Co-Authored-By: Claude <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR consolidates the vLLM disaggregated prefill/decode launch flow in scripts/vllm_dissag/ by replacing three legacy launcher scripts with a single two-axis driver (CONNECTOR × WIDE_EP) plus a models.yaml catalog for per-model flags/env, and adds an offline parity test gate to keep the assembled vllm serve argv consistent with legacy behavior.

Changes:

  • Replace vllm_disagg_{server,mori_ep,server_deepep}.sh with a unified vllm_disagg.sh that sources connector + parallelism profiles and reads per-model config from models.yaml.
  • Update run_xPyD_models.slurm to resolve axes/back-compat shims and plumb new env vars into Docker runs.
  • Add tests/parity_check.sh + golden fixtures to enforce byte-identical argv parity vs legacy launchers (plus validation rejection checks).

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
scripts/vllm_dissag/vllm_disagg.sh New unified launcher with axis resolution, role branching, YAML model config parsing, and DRY_RUN support.
scripts/vllm_dissag/connectors/rixl.sh New connector profile implementing rixl TP + DeepEP wideEP behavior.
scripts/vllm_dissag/connectors/moriio.sh New connector profile implementing MoRI-EP wideEP and new MoRIIO+TP path.
scripts/vllm_dissag/parallelism.sh Shared helper for wideEP master/child role args.
scripts/vllm_dissag/models.yaml New model catalog replacing inline associative arrays / hardcoding.
scripts/vllm_dissag/tests/TEST_PLAN.md New before/after validation plan documenting parity + live matrix.
scripts/vllm_dissag/tests/parity_check.sh New offline parity gate that diffs DRY_RUN argv vs goldens.
scripts/vllm_dissag/tests/golden/*.txt New golden argv fixtures captured from legacy launchers.
scripts/vllm_dissag/tests/golden/gen_golden.sh Utility to regenerate golden fixtures.
scripts/vllm_dissag/run_xPyD_models.slurm Updated slurm entry to use unified launcher and axis/back-compat resolution.
scripts/vllm_dissag/README.MD Updated docs to match the unified launcher + models.yaml + test flow.
scripts/vllm_dissag/vllm_disagg_server.sh Deleted legacy rixl TP launcher.
scripts/vllm_dissag/vllm_disagg_mori_ep.sh Deleted legacy MoRI-EP launcher.
scripts/vllm_dissag/vllm_disagg_server_deepep.sh Deleted legacy DeepEP launcher.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +7 to +8
# CONNECTOR = rixl | moriio (KV transfer; default moriio)
# WIDE_EP = 0 (TP) | 1 (wideEP) (parallelism; default per back-compat shim)
Comment on lines +31 to +32
SCRIPT_DIR="${NIXL_COOKBOOK_PATH:-$(cd "$(dirname "${BASH_SOURCE[0]:-$0}")" && pwd)}"

Comment on lines +86 to +87
echo "Listing NIXL_COOKBOOK_PATH : "
ls ${NIXL_COOKBOOK_PATH}
Comment on lines +186 to +190
--no-enable-prefix-caching \
--all2all-backend "${_all2all}" \
--trust-remote-code \
--distributed-timeout-seconds "${DISTRIBUTED_TIMEOUT_SECONDS:-7200}" \
"${exec_args[@]}" "${extra_args[@]}" "${kv_args[@]}"
Comment on lines +209 to +212
"${exec_args[@]}" \
"${extra_args[@]}" \
"${kv_args[@]}" \
2>&1 | tee /run_logs/${SLURM_JOB_ID}/${log_prefix}_NODE${NODE_RANK}.log >/dev/null &
Comment on lines +309 to +312
--all2all-backend "${backend}" \
${DBO_ARGS} \
"${extra_args[@]}" \
--kv-transfer-config "${kv_config}"
Comment on lines +330 to +334
--all2all-backend "${backend}" \
${DBO_ARGS} \
"${extra_args[@]}" \
--kv-transfer-config "${kv_config}" \
2>&1 | tee /run_logs/${SLURM_JOB_ID}/${log_prefix}_NODE${NODE_RANK}.log >/dev/null &
Comment on lines +16 to +18
# BOUNDARY (do NOT put these here — the launcher/connector owns them):
# --tensor-parallel-size / --data-parallel-size / --enable-expert-parallel /
# --all2all-backend / --kv-transfer-config / --port / transfer backend.
export BENCHMARK_CON="8 16 32"
export BENCHMARK_COMBINATIONS="1024/1024 8192/1024"
sbatch -N 2 -n 2 run_xPyD_models.slurm
python3 benchmark_parser.py <log_path>/benchmark_XXX_CONCURRENCY.log
MASTER_PORT="${MASTER_PORT:-23731}"
NODE_RANK="${NODE_RANK:-0}"
NNODES="${NNODES:-1}"
MODEL_PATH=$MODEL_PATH
raviguptaamd and others added 2 commits June 24, 2026 15:05
…uncher

Brings MAD-private #324's validated DeepSeek-V3 wide-EP support into the
consolidated format (driver + connectors + models.yaml), so MoE serves through
vllm_disagg.sh — not as a parallel launcher.

connectors/moriio.sh (parity with #324's mori launcher):
- runtime patch: use the idempotent Python patcher apply_39276_rebased.py with
  topology-defaulted SKIP_RUNTIME_PATCH (explicit wins); the old bash patcher
  double-patched a pre-patched image and corrupted the decode notify path.
- cudagraph: NEVER bare --enforce-eager (it routes fp8 quant through an AITER op
  whose aiter_tensor_t signature crashes engine init); NONE -> compilation-config
  cudagraph_mode:NONE +quant_fp8. Per-role PREFILL/DECODE_CUDAGRAPH_MODE.
- proxy: vllm_router (ROUTER_BINARY/PATH + --kv-connector moriio + discovery +
  registration gate) and moriio_toy with online_serving/->disaggregated/ path
  resolution.
- recipe knobs: --block-size ${KV_BLOCK_SIZE}, --kv-cache-memory-bytes (skip the
  buggy profiling forward), per-role mori_high_throughput/mori_low_latency,
  VLLM_ROCM_USE_AITER_MLA override, MORI_RDMA_TC/SL + MORI_SHMEM_HEAP_SIZE.

models.yaml: DeepSeek-V3/R1/V3-5layer env: blocks encode the full recipe (YAML
anchor) so MODEL_NAME=DeepSeek-V3 WIDE_EP=1 "just works" (= #324 .env auto-source).

slurm: BENCHMARK_SCRIPT selector (sweep/long_context); plumb the recipe + cache +
benchmark env via docker -e; driver runs $BENCHMARK_SCRIPT_FILE.

docker: add vllm_disagg_mori_ep_fullsource (the buildable validated stack: MoRI
1db01d8, AITER 0.1.14, vLLM fork b10a9f7a, router #181) for DeepSeek MoE; the
public-base vllm_disagg_inference stays for dense.

Verified: moriio wideEP `vllm serve` argv is byte-identical to #324's launcher
(offline gate); rixl/deepep/dense parity unchanged; goldens regenerated.

Co-Authored-By: Claude <noreply@anthropic.com>
The moriio toy proxy cannot route the wideEP DP-rank KV-notify -> decode hangs
("remote blocks never arrived", deferred-write expiry). vllm_router carries the
DP-rank dpfix and is what #324's recipe.env uses. Default moriio to vllm_router
(matching rixl); moriio_toy still selectable. Golden regenerated (kv-transfer
proxy_port 10001->30000); argv stays byte-identical to #324 (router on both sides).

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants