Skip to content

CollectiveX: experimental cross-vendor collective/EP benchmark#1896

Open
Oseltamivir wants to merge 176 commits into
mainfrom
collectivex
Open

CollectiveX: experimental cross-vendor collective/EP benchmark#1896
Oseltamivir wants to merge 176 commits into
mainfrom
collectivex

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Adds CollectiveX under experimental/CollectiveX/ — a cross-vendor collective / expert-parallel benchmark — plus an orchestration-only workflow.

What it adds

  • Per-SKU launch adapters (launchers/launch_<sku>.sh, the launch_${RUNNER_NAME%%_*}.sh convention) that run any benchmark via a CX_BENCH selector (nccl|deepep|all) through a shared launchers/run_in_container.sh.
  • Benchmarks: run_nccl.py (stock nccl-tests → parsed flat JSON), run_deepep.py (DeepEP dispatch/combine, normal mode), env_capture.py (Layer-0 provenance), plot.py. Every result is correctness-gated and carries a topology-aware comparison_key.
  • Single multi-arch, digest-pinned container for all NVIDIA SKUs (lmsysorg/sglang@sha256:4219…, amd64+arm64); DeepEP via rebuild-deepep. See CONTAINERS.md.
  • .github/workflows/collectivex-experimental.ymlpush to collectivex (paths experimental/CollectiveX/**) → GB200 NCCL smoke; workflow_dispatch → chosen sku+benchmark (B200, DeepEP, larger sweeps). Logic stays under experimental/.

Validated on hardware

  • NCCL primitives: B200 (8× NVLink island) + GB200 (4× NVL72 MNNVL), 4 ops, correctness-passed, topology-keyed distinctly.
  • DeepEP dispatch/combine on GB200: correctness-gated (token conservation + combine vs DeepEP's own reference), ~154 µs roundtrip, 1.66M tok/s.
  • Local: shellcheck/bash -n, py_compile, actionlint, parser fixtures.

Notes / deferred

  • Result JSONs are gitignored (captured env embeds hostnames/UUIDs); CI uploads them as workflow artifacts. Headline numbers are summarized in CONTAINERS.md.
  • Importing the exact multi-arch digest needs the runner's registry creds (validated on the pre-staged v0.5.11-cu130).
  • Precision axes (NVFP4/MXFP8/…), low-latency EP, MoRI, EPLB, multinode DeepEP, and other collectives are captured as roadmap in plan.md, not built.

Note

Low Risk
Changes are isolated to experimental/CollectiveX/ and a read-only workflow; no production benchmark matrix or serving launchers are modified. Risk is mainly operational (self-hosted GPU time, Slurm/enroot failures) rather than app or security impact.

Overview
Introduces CollectiveX under experimental/CollectiveX/ — an experimental cross-vendor collective and MoE EP benchmark — plus orchestration-only .github/workflows/collectivex-experimental.yml. Production serving paths are untouched.

Benchmark stack: run_nccl.py wraps nccl-tests/rccl-tests into provenance-tagged JSON; run_deepep.py and run_mori.py add correctness-gated DeepEP and AMD MoRI dispatch/combine; env_capture.py, summarize.py, and plot.py handle environment capture, CI summaries, and plots. Results use topology-aware comparison_keys so unlike fabrics are not merged blindly.

Execution: Per-SKU Slurm launchers (launch_b200-dgxc.sh, launch_gb200-nv.sh, launch_b200-dgxc-slurm.sh, launch_mi355x-amds.sh) follow the same launch_${RUNNER_NAME%%_*}.sh pattern as serving, with shared common.sh (enroot squash by tag, optional CX_STAGE_DIR rsync, in-container nccl/rccl builds). CX_BENCH selects nccl, deepep, mori, or all via run_in_container.sh.

CI: Push to collectivex runs MI355X MoRI on mi355x runners; workflow_dispatch picks SKU and benchmark (GB200/B200 NCCL, DeepEP, etc.), writes markdown to the job summary, and uploads gitignored results/*.json as artifacts.

Reviewed by Cursor Bugbot for commit 871086d. Bugbot is set up for automated code reviews on this repo. Configure here.

Per-SKU launch adapters (launch_<sku>.sh) that run any benchmark via a CX_BENCH selector through a shared run_in_container.sh; multi-arch digest-pinned sglang container; NCCL-primitive + DeepEP dispatch/combine benchmarks with provenance + correctness gating; and an on:push workflow (GB200 NCCL smoke; workflow_dispatch for B200/DeepEP/larger sweeps).

Validated on hardware: NCCL primitives on B200 (8x NVLink) and GB200 (4x NVL72 MNNVL); DeepEP dispatch/combine on GB200 (correctness-gated).
Comment thread experimental/CollectiveX/launchers/run_in_container.sh Outdated
Comment thread .github/workflows/collectivex-experimental.yml
Comment thread experimental/CollectiveX/run_deepep.py Outdated
Comment thread experimental/CollectiveX/plot.py Fixed
Comment thread experimental/CollectiveX/run_deepep.py Fixed
The GB200 on:push smoke hung 25 min in enroot import: a bare digest ref (repo@sha256:) can't form an anonymous Docker Hub token scope, so enroot prompted for a password and blocked in non-interactive CI. Import by the multi-arch TAG instead (anonymous auth works, same as the serving launchers) and add </dev/null so a missing token fails fast rather than hanging.

Use v0.5.11-cu130 (multi-arch amd64+arm64, index sha256:061fb71f…): v0.5.12-cu130's 62 layers overflow enroot's overlay-based squash creation on these nodes (failed to mount overlay … Invalid argument). v0.5.11-cu130 imports cleanly and is pre-staged on GB200.
Comment thread .github/workflows/collectivex-experimental.yml
Comment thread experimental/CollectiveX/run_nccl.py Outdated
On the GB200 Actions path, CX_STAGE_DIR makes the launcher rsync the tree to compute-visible Lustre and the container writes results/ there; upload-artifact reads the checkout's results/ (empty), so the green smoke produced no artifact. Add cx_collect_results to copy result JSONs from the stage dir back to the checkout after the run (no-op when no staging was used).
Comment thread experimental/CollectiveX/run_deepep.py Outdated
Comment thread experimental/CollectiveX/launchers/launch_gb200-nv.sh Outdated
Add summarize.py (compact NCCL/DeepEP results table, printed at end of every job) and make it the result gate. Fix review findings: benchmark failures/skipped-deepep now fail the job instead of reporting green (#1); DeepEP nodes from SLURM_NNODES not world_size//8 (#3); apply Buffer.set_num_sms so num_comm_sms is real (#8); nccl-tests -c 1 with a missing check footer is now invalid (#7); use context managers for file reads (#4,#5); launchers export COLLECTIVEX_IMAGE/_DIGEST for provenance (#9); trim workflow_dispatch sku options to launcher-backed pools (#2). Artifact-path finding (#6) already fixed via cx_collect_results.
Comment thread experimental/CollectiveX/run_deepep.py Outdated
is_token_in_rank=is_token_in_rank,
num_tokens_per_expert=num_tokens_per_expert,
)
combined_x, _, _ = buffer.combine(recv_x, handle, topk_weights=recv_topk_weights)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispatch dtype not applied

Medium Severity

The --dispatch-dtype / CX_DISPATCH_DTYPE value is stored in result metadata but never used when building inputs or calling buffer.dispatch. Runs always use bfloat16 token tensors regardless of fp8 vs bf16, so provenance and comparison keys can describe a different shape than what was measured.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b384171. Configure here.

summarize.py --markdown emits GitHub-flavored markdown tables (NCCL + DeepEP); a per-job 'Results summary' workflow step appends it to $GITHUB_STEP_SUMMARY so the run page shows a rendered table (per the GitHub job-summaries feature). Plain-text mode still drives the in-container result gate.
--timestamp "$TS" || cx_log "WARN: parse $op failed"
done

cx_log "done — JSON artifacts under $CX_DIR/results/"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multinode launcher ignores failures

High Severity

The B200 multinode adapter logs warnings when srun or run_nccl.py fail but always exits successfully. Unlike run_in_container.sh, it never runs summarize.py as a non-zero gate, so workflow_dispatch on b200-multinode can finish green with no valid NCCL results.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f48daed. Configure here.

run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh"
- name: Results summary
if: always()
run: python3 experimental/CollectiveX/summarize.py --results-dir experimental/CollectiveX/results --markdown >> "$GITHUB_STEP_SUMMARY"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow skips result failure gate

Medium Severity

Both jobs only run summarize.py --markdown, which is documented to always exit 0. The workflow never runs the plain summarize.py gate on the checkout’s results/ after launch, so a successful Launch step can stay green when the checkout has no valid JSON (e.g. staged runs where copy-back failed).

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f48daed. Configure here.

dst="$repo_root/experimental/CollectiveX/results"
mkdir -p "$dst"
cp "$mount_src/experimental/CollectiveX/results/"*.json "$dst/" 2>/dev/null || true
cx_log "copied results from stage dir -> $dst (for artifact upload)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result copy errors ignored

Medium Severity

cx_collect_results wraps the staged-to-checkout cp in 2>/dev/null || true and always logs success, so a failed or empty copy does not affect the launcher exit code and the workflow can pass without uploadable JSON.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f48daed. Configure here.

First AMD / cross-vendor reach, scaffolded ahead of Milestone 1:

- run_mori.py: MoRI dispatch+combine (normal mode), correctness-gated,
  mirroring ROCm/mori's dispatch_combine example — int32 routing indices,
  (n,0) fp8 scales, the zero-copy registered-combine-input-buffer staging
  step, and expected = input x (#unique destination ranks). Emits the same
  flat JSON shape (family=moe, backend=mori) with CUDA-event timing.
- launchers/launch_mi355x-amds.sh: AMD adapter — partition compute, no
  account, --cpus-per-task=128, node-local /var/lib/squash imported via srun
  on the allocated node, --container-writable --container-remap-root, forces
  CX_BENCH=mori, mounts the (compute-visible) checkout at /ix.
- launchers/run_in_container.sh: run_mori_suite + mori case (nccl|deepep|mori|all).
- launchers/common.sh: ROCm MoRI image (rocm/sgl-dev:...-mori-0227-2) in
  cx_default_image for mi355x*/mi350x*/mi325x*/mi300x*.
- workflow: mi355x sku + mori benchmark options for workflow_dispatch.
- docs: CONTAINERS.md AMD section, README files/run/risks, plan.md status.

Not yet hardware-validated (no MI355X access) — MoRI's Python API is
version-sensitive (marked ADAPT HERE); the first runner job is the
validation, as GB200 was for DeepEP. The ROCm image isn't digest-pinned yet.
Comment thread experimental/CollectiveX/run_mori.py Fixed
- workflow: replace the on:push GB200 NCCL smoke with the MI355X MoRI
  dispatch/combine run (runs-on: mi355x, CX_BENCH=mori), and name the job
  "CollectiveX Experimental" (no longer "smoke"). GB200/B200 NCCL + DeepEP
  remain on workflow_dispatch.
- launch_mi355x-amds.sh: adapt more faithfully to runners/launch_mi355x-amds.sh
  — squeue by job-name only (no -u), flock -w 600, and clear ROCm gpucore.*
  dumps after the run so the next checkout is clean. Bump default CX_TIME to 60
  for a cold ROCm-image import.
- summarize.py: drop the "N/N results valid." footer from both the job-summary
  (markdown) and plain output; the failure gate still reports invalid results.
  Relabel the MoE section "MoE dispatch+combine (DeepEP / MoRI)".
- docs: README/plan describe push -> MI355X MoRI.
rm -f \"$SQUASH_FILE\"
enroot import -o \"$SQUASH_FILE\" \"docker://$IMAGE\" </dev/null
fi
"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MI355X import errors ignored

High Severity

The node-local enroot import runs inside an srun bash snippet without set -e and with no check after import. A failed import still yields exit 0 from that snippet, so the job continues into pyxis with a missing or corrupt squash file.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d8ee9bf. Configure here.

- name: Launch ${{ inputs.sku }} / ${{ inputs.benchmark }}
env:
RUNNER_NAME: ${{ runner.name }}
run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow skips multinode staging

Medium Severity

CX_STAGE_DIR is set only when inputs.sku is gb200. The b200-multinode dispatch target uses launch_b200-dgxc-slurm.sh, which documents the same compute-visible checkout requirement but leaves staging unset, so Slurm jobs may not see the repo mount.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d8ee9bf. Configure here.

… default)

First MI355X run reached the MoRI dispatch kernel — salloc, ROCm-image import,
mount, torchrun, 8-rank Gloo + shmem init, and EpDispatchCombineConfig/op/dispatch
all worked, confirming the API signatures. It OOM'd MoRI's default 2 GiB static
symmetric heap (hidden=7168 dispatch/combine buffers across 8 ranks request
~0.9 GiB each).

run_mori.py now sets MORI_SHMEM_HEAP_SIZE before `import mori` (default 16 GiB,
override CX_MORI_HEAP_BYTES). Docstring + CONTAINERS.md record the finding;
correctness/timing validated by the heap-sized re-run.

salloc --partition="$PARTITION" --exclude="$EXCLUDE_NODES" --gres=gpu:"$NGPUS" \
--exclusive --cpus-per-task=128 --time="$TIME_MIN" --no-shell --job-name="$RUNNER_NAME"
JOB_ID="$(squeue --name="$RUNNER_NAME" -h -o %A | head -n1)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slurm job ID not scoped

Medium Severity

launch_mi355x-amds.sh resolves JOB_ID with squeue --name="$RUNNER_NAME" and no -u "$USER", while the other CollectiveX NVIDIA launchers filter by user. On a shared cluster, the first matching job name may belong to another account, so subsequent srun/scancel can target the wrong allocation.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ac3f1b9. Configure here.

The heap-bump run cleared the 2 GiB OOM but then failed registering the 16 GiB
symmetric heap as an RDMA memory region (errno 22 EINVAL, size=17179869184).
ROCm/mori's reference test uses MORI_SHMEM_HEAP_SIZE="6G" single-node — big
enough for the hidden=7168 dispatch/combine buffers, small enough to register.

Match it: default "6G" (override CX_MORI_HEAP_SIZE). The rest of the config
already matches the reference (max_num_inp_token_per_rank=4096, hidden=7168,
backend cpu:gloo,cuda:nccl), so this lands on the proven single-node setup.
Drove run_mori.py to a correct run on 8x MI355X (on-node via salloc+srun):
dispatch+combine numerically correct (combine within tol, max_rel ~2e-3),
~85us round-trip at the decode shape. The first runs surfaced four issues,
all fixed and re-validated:

- RDMA MR ceiling: MoRI registers the WHOLE symmetric heap as one RDMA MR at
  init (even single-node; no disable-RDMA knob). The ionic_rdma NICs cap GPU
  MRs at ~4 GiB — a 6 GiB heap fails (RegisterRdmaMemoryRegion errno 22), 2 GiB
  registers. Hold heap at MORI_SHMEM_HEAP_SIZE=2G (override CX_MORI_HEAP_SIZE).
- Buffer sizing: max_num_inp_token_per_rank 4096 -> max(512, n) so the buffers
  fit the 2 GiB heap (4096 was inherited from the reference test).
- Correctness shape: combine returns the full max-token buffer; compare only
  combined[:n] against expected.
- recv count: read total_recv BEFORE combine (combine resets recv_num, which
  made recv_nonzero a false negative).
- Teardown: MoRI's shmem teardown asserts (CheckStatusValid -> SIGABRT) when the
  op is destroyed after shmem_finalize(); hard-exit after writing results.

Docs (README/plan/CONTAINERS) updated from "scaffolded" to validated, with the
fabric constraints recorded.
Comment thread experimental/CollectiveX/run_mori.py Fixed
Comment thread experimental/CollectiveX/run_mori.py Fixed
…CH=nccl)

Adds the AMD collective-primitive path so all_reduce/reduce_scatter/all_gather/
alltoall run on MI355X, not just MoRI:

- common.sh: cx_build_rccl_tests — clones ROCm/rccl-tests and builds with `make`
  against /opt/rocm (amdclang++/librccl). It's a nccl-tests fork producing the
  same <op>_perf binaries and output format, so run_nccl.py parses it unchanged.
  Validated building + running all 4 ops in-container on MI355X (correctness OK).
- run_in_container.sh: run_nccl_suite picks rccl-tests on ROCm (/opt/rocm or
  hipcc), nccl-tests otherwise; identical op loop + run_nccl.py invocation.
- launch_mi355x-amds.sh: honor CX_BENCH (mori default | nccl) instead of forcing
  mori; same -g N single-node 8-GPU launch.
- docs: README/CONTAINERS note the rccl path.

B200 already has the nccl path; this makes primitives available on all three
SKUs via workflow_dispatch.
Comment thread experimental/CollectiveX/launchers/launch_mi355x-amds.sh
if name:
devices.append(name)
elif _run(["ibstat", "-l"]):
devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ibstat fallback may crash capture

Low Severity

In _rdma, the ibstat -l branch calls _run twice. If the first call succeeds but the second returns None, None.splitlines() raises and env_capture.py aborts before writing provenance JSON for that run.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2b23573. Configure here.

…on-node

launch_gb200-nv.sh now branches on CX_NODES: 1 (default) keeps the single-tray
4-GPU dispatcher path; >1 runs across the NVL72 NVLink fabric (e.g. CX_NODES=2
= 8 GPU) by building nccl-tests MPI=1, running each op across WORLD ranks via
`srun --mpi=pmix` (1 GPU/rank) with the MNNVL env, and parsing on the login node
— mirroring launch_b200-dgxc-slurm but staying on NVLink instead of IB.

Validated on GB200 (2x watchtower-navy trays, 8 GPU): all 4 ops valid, peak
busbw all_reduce 822.8 / reduce_scatter 670.6 / all_gather 651.2 / alltoall
625.0 GB/s — ~30% over single-tray and on par with B200 8-GPU NVLink, i.e.
MNNVL engaged (not an IB fallback).

- common.sh: cx_build_nccl_tests auto-detects MPI_HOME for MPI=1 (Debian OpenMPI
  headers live under /usr/lib/<arch>/openmpi/include; MPI_HOME=/usr fails). Works
  x86_64 + aarch64.
- launch_b200-dgxc-slurm.sh: fix BUILD_IN_CTR path (.nccl-tests/nccl-tests/build).
- workflow: add `nodes` dispatch input -> CX_NODES.
…ngine

Wire the kv-cache 'mooncake' backend: tests/mooncake_transfer.py — Mooncake
TransferEngine, P2PHANDSHAKE metadata (no etcd), src/dst GPU buffers registered,
RDMA transfer_write_on_cuda/_on_hip loopback over a size sweep. run_in_container
pip-installs mooncake-transfer-engine (the directive's 'import a new one', as a pip
import). Auto-detects the RDMA NIC from /sys/class/infiniband; self-documents the API
+ device; absence of pkg/NIC is recorded. mooncake benchmark choice, both vendors.
Add _build_aiter to allreduce_fw_bench: tries aiter.dist.device_communicators.
custom_all_reduce.CustomAllreduce / quick_all_reduce (the AITER wrapper owns the IPC
buffer), else records the raw aiter.ops kernel as present-but-needs-wrapper. Registered
as the 'aiter' framework impl; import-guarded (skips on the NVIDIA image). Runs on
MI355X allreduce-fw alongside the RCCL baseline.
…P sweeps)

The concurrency group omitted inputs.nodes, so same sku/benchmark/dtype runs at
different node counts (EP16/32/64) shared one group -> GitHub kept 1 running + 1
pending and CANCELLED the rest (GB200 EP32 was cancelled while EP16/EP64 ran). Add
inputs.nodes so each EP size is its own group.
…Infer-MNNVL

DeepEP intranode caps at 8 ranks, but FlashInfer MoeAlltoAll's MNNVL workspace spans
the NVL72 NVLink domain: GB300/GB200 EP8/16/64 validated correct=True (EP32 re-run
after a concurrency-group fix). Cross-node-over-IB (H100/H200) is the remaining
internode-DeepEP/IBGDA gap (MNNVL doesn't span IB); cross-node MI355X needs multi-node alloc.
…-init (sglang/vllm)

The sglang/vllm CustomAllreduce skipped because it builds ca_comm only INSIDE the
framework's distributed init (initialize_model_parallel), not from a bare wrapper ctor.
New _sglang_vllm_ca_runner replicates that init (init_distributed_environment +
initialize_model_parallel) on the torchrun group, then uses the TP GroupCoordinator's
ca_comm.custom_all_reduce (with should_custom_ar size-gating -> _SkipSize). sglang runs
in-image; vllm runs under a vLLM container switch. Shared helper (sglang forked vllm's
parallel_state, identical API).
…benchmark)

allreduce-fw-vllm runs the framework-AR bench in a vLLM cuda image (vllm/vllm-openai:
latest) via CX_IMAGE — _build_vllm replicates vLLM's serving init (same helper proven
for sglang: 175 GB/s correct=True) and uses the TP GroupCoordinator's ca_comm. The
container switch the directive calls for (vLLM isn't in the sglang image).
aiter.dist.parallel_state forked vllm's (init_distributed_environment /
initialize_model_parallel / get_tp_group), with ca_comm nested under
device_communicator. Route _build_aiter through the shared _sglang_vllm_ca_runner
(helper now finds ca_comm on tp OR tp.device_communicator). The first bare-wrapper
version got a nan; replicating the init gives a working ca_comm (sglang proved the
pattern: 175 GB/s correct=True).
…with-device-API

vLLM: its CustomAllreduce is a CustomOp that asserts an active VllmConfig (observed
'Current vLLM config is not set' in vllm/vllm-openai). _build_vllm now enters
set_current_vllm_config(VllmConfig()) persistently around init + run; free() exits it.
NIXL EP: cx_probe_nixl_ep now builds UCX from source WITH CUDA (ships the device-API
header <ucp/api/device/ucp_device_impl.h> the dynamo image's UCX lacked) and points
pkg-config at it, then retries the nixl_ep meson — the directive's build-fix for the
'UCX GPU Device API: NO' wall.
…CX = driver wall

Framework all-reduce now DONE for all 3 via serving-init replication (sglang 175 GB/s,
aiter 367.8 GB/s, both correct=True) + vLLM via container switch + VllmConfig context
(correct=True). NIXL device-EP: UCX-from-source build attempted, device API STILL NO ->
the root cause is GPUDirect-Async/IBGDA driver+hardware support (not a build flag),
a base-platform capability. Evidenced terminal walls.
CX_NODES>1 on MI355X: salloc N nodes (pinned to the warm-squash nodes via CX_NODELIST
so no cold import), import the squash on each, multi-srun run_ep across NODES*8 ranks
(RANK/LOCAL_RANK from SLURM_*, MASTER_ADDR=first node) — the GB300 EP8 multi-srun shape.
MoRI is RDMA-native (ionic_rdma symmetric heap spans nodes), so this exercises true
cross-node EP. Reduced timing (MoRI wedge guard).
…L internode (goal 182)

run_in_container's run_ep torchrun gains multi-node rendezvous (CX_NNODES/CX_NODE_RANK/
CX_MASTER_ADDR -> torchrun --nnodes --node-rank --master-addr). launch_h200.sh CX_NODES>1:
salloc N nodes, one container task/node, run_in_container spans NODES*8 ranks over IB.
UCCL EP is internode-native (RDMA/IB) — the right backend (DeepEP normal-internode
asserts out). Squash+repo on compute-visible NFS. topology=h200-multinode-ib.
…er config)

Formalizes the 'newest-good-per-config kept; superseded moved aside' the .gitignore
references: groups results/ by comparison_key, keeps the newest 3 usable runs per
config (preserves repeat-run aggregation), moves older/superseded/stale-failed to
results/.superseded (out of the plot glob, recoverable). Genuinely-failed configs with
no valid counterpart are kept (preserve-failed-cases deliverable).
…torch rendezvous)

Both cross-node attempts failed at the torch.distributed rendezvous, not the EP backend:
MI355X gloo 'connect refused remote=[127.0.1.1]' (hostname loopback-aliased in /etc/hosts)
and H200 'connect to worker-1:29561 timed out' (hostname not routable cross-node). Resolve
MASTER_ADDR via scontrol NodeAddr (the routable IP) in both multi-node launchers, fall back
to hostname. GB200/GB300 worked because their hostnames are routable.
Cross-node EP (goal 182/183) failed at torch's gloo connectFullMesh with
remote=[127.0.1.1] despite MASTER_ADDR being the routable NodeAddr IP: the
per-rank mesh advertises each rank's hostname, which the MI355X/H200 /etc/hosts
aliases to loopback. Add runtime/_xnode_net.sh (sourced per-rank) to auto-pin
GLOO_SOCKET_IFNAME/NCCL_SOCKET_IFNAME to the routable 10.x NIC, and wire it into
the MI355X multi-srun WRAP and run_in_container's multi-node torchrun path.
probe_deepep_caps.py / probe_deepep_ll.py were one-off read-only DeepEP
capability probes from the earliest FP8/LL commit. The capability surface they
sampled is now owned canonically by tests/capability.py (+ ep_mori.py); the
probes have zero inbound references anywhere (code, docs, workflows). Remove
them as dead scaffolding.
prune_results.py is the canonical results-hygiene tool (newest-3-per-comparison_key
with publication_status gating + repeat-run preservation). tools/_keep_newest.py was
the older newest-1 variant; 0 inbound references, and its own docstring path
(launchers/_keep_newest.py) is stale. Remove the duplicate.
…fallback

Harden the cross-node bootstrap helper: always print the container's hostname +
every visible IPv4 (so a cross-node GHA log self-documents what each rank's
network namespace sees), and tolerate minimal images without iproute2. Clarify
that the iface pin cannot fix an unreachable MASTER_ADDR (a cluster/container-net
property), only the per-rank gloo connectFullMesh advertisement.
…ILE)

The env:// TCPStore rendezvous (MASTER_ADDR:PORT) times out cross-node on the
H100/H200/MI355X fleets because the rank-0 management-subnet NodeAddr is not
reachable from a peer rank's enroot container net namespace. When CX_RDZV_FILE
points at a path on the compute-visible shared mount, init the PG via a FileStore
instead: ranks exchange the store + NCCL unique-id through the shared file, and
NCCL connects peers over the IB fabric (routable cross-node) rather than the
unreachable management TCP. Default-off; single-node path is byte-identical.
Replace the one-container-per-node + torchrun path (whose elastic-agent TCPStore
timed out 900s on the unreachable management-subnet master-addr) with the proven
multi-srun shape used by MI355X/GB300: Slurm places NODES*NGPUS ranks directly
(RANK/LOCAL_RANK from SLURM_*), no torchrun agent. Ranks rendezvous via a
FileStore on the shared mount (CX_RDZV_FILE) so NCCL bootstraps over IB instead
of the unreachable management TCP. Parses CX_TIMING; sources _xnode_net.sh.
The FileStore rendezvous (CX_RDZV_FILE) got past the management-subnet TCPStore
wall, but the multi-srun-per-rank shape lacked uccl (pip-installed by
cx_build_uccl in run_in_container, which per-rank ephemeral containers skip).
Fix: keep one-container-per-node so run_in_container builds uccl once per node,
then when CX_NNODES>1 spawn NGPUS local ranks directly (global RANK =
CX_NODE_RANK*NGPUS+local) rendezvousing via the shared-mount FileStore instead of
torchrun — torchrun's elastic agent ran its own unreachable cross-node TCPStore.
run_ep_suite refactored to a shared EPARGS array driving both paths.
…vendors)

The canonical token-shuffle EP on pure torch.distributed all_to_all_single: the
ONLY EP backend that survives cross-node without GPUDirect-RDMA. UCCL's ibv_reg_mr
fails EINVAL->SIGSEGV and MoRI's RDMA registration aborts (both after the
rendezvous now forms via FileStore), but NCCL/RCCL host-stage the all-to-all over
IB. tests/ep_nccl.py (bf16/normal/layout-and-dispatch); run_ep + run_in_container
(run_nccl_ep_suite, no build) + capability (both vendors) + workflow choice wired.
MI355X multi-srun also gets CX_RDZV_FILE (nccl-ep uses pure rccl PG + FileStore,
sidestepping the gloo connectFullMesh 127.0.1.1 alias entirely).
The backend dispatch elif was added but the argparse choices list still rejected
'nccl-ep' (run 28326942401: 'invalid choice: nccl-ep'). Add it to choices.
…RDMA root cause)

Rewrite the cross-node section: goal 182 (H100/H200) is DONE via nccl-ep over IB
(H200 world=16, run 28327088942, correct=True). Document the two-layer root
cause: (1) rendezvous wall (management-subnet store unreachable from container
netns) solved by shared-mount FileStore + local-spawn; (2) custom-RDMA data-path
wall (UCCL ibv_reg_mr EINVAL→SIGSEGV, MoRI SIGABRT, DeepEP asserts) needs
GPUDirect-RDMA the HCAs lack, so NCCL/RCCL host-staged all-to-all is the portable
cross-node EP. MI355X (183) validation in flight on rccl.
The MI355X AMD-bench allowlist didn't include nccl-ep, so CX_BENCH=nccl-ep
silently fell back to mori — run 28327089664 ran MoRI cross-node (SIGABRT) instead
of the intended rccl all-to-all EP. nccl-ep IS AMD-supported (pure RCCL
all_to_all_single); add it to the allowlist so goal-183 cross-node runs on rccl.
…ep/rccl

MI355X nodes=2/world=16 over RoCE/IB, run 28328718973 correct=True T=1-8. Both
cross-node EP points (182 H200, 183 MI355X) now done via the unified nccl-ep
path; the custom-RDMA GPUDirect wall is documented + routed around.
mooncake (HOST_GPU_BENCH amd-capable) wasn't in the MI355X bench allowlist, so it
silently fell back to mori (run 28340951096). Add it so run_mooncake_suite can
attempt the ROCm transfer-engine on MI355X (documents the wall if the wheel lacks
HIP support).
MoonCake on MI355X = evidenced ROCm wall (engine inits on rdma0 but the wheel has
no transfer_write_on_hip, only _on_cuda; run 28342781762 invalid/0 groups) — needs
an upstream Mooncake ROCm build. MI355X rccl-tests (All-reduce/All-gather tab)
keeps failing in the runner checkout/setup step (shared with the agentic fleet) —
a runner-contention infra flake, not an rccl limitation. mori-io (28.2), copy-
engine/SDMA, and rccl-kv (71.7 GB/s) backfilled successfully.
…ests)

The persistent MI355X rccl-primitives failure was capability.py rejecting
benchmark=nccl on amd (exit 3 in the Validate-capability step, before the
launcher ran) — masked earlier by the gharunner06 root-LOGS EACCES. But the
nccl BENCHMARK runs on both vendors: run_nccl_suite auto-picks rccl-tests on
ROCm. Make COLLECTIVE nccl valid on amd so the All-reduce/All-gather tabs get an
MI355X line.
…l-parity sweeps

Thread deepep_v2=true (kernel_gen=v2 from-source) and a --backend override that
remaps the deepep suite matrix onto uccl/flashinfer/deepep-hybrid/nccl-ep, with a
capability pre-filter (resolve() per case) so no doomed dispatch is fired. Enables
per-backend full-matrix parity: deepep-v2 242 / uccl 242 / flashinfer 162 /
deepep-hybrid 156 NVIDIA cases across H100/H200/B300.
…nodes

Add b200 (8x NVLink, sibling of b300) + gb200 (NVL72, sibling of gb300) to
platforms.yaml + every relevant suite's platform list (mirroring b300/gb300
coverage). Un-drop gb300 in _gha_suite.sh (runners online now) + map gb200/b200
in the SKU dict. Thread nodes for the rack-scale SKUs (gb200/gb300 = 4 GPU/tray,
so EP8 = 2 trays/nodes). Enables full-parity sweeps across all 7 SKUs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant