Composable send/recv interface (transport + hooks + minimal NVLink path) (#2829)#2829
Open
ChenYuHo wants to merge 2 commits into
Open
Composable send/recv interface (transport + hooks + minimal NVLink path) (#2829)#2829ChenYuHo wants to merge 2 commits into
ChenYuHo wants to merge 2 commits into
Conversation
Contributor
|
@ChenYuHo has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107172780. |
ChenYuHo
pushed a commit
to ChenYuHo/torchcomms
that referenced
this pull request
Jun 8, 2026
…th) (meta-pytorch#2829) Summary: Design doc: P2365280005 Code walk through: P2365344262 First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780
1a57b88 to
74a4d4f
Compare
ChenYuHo
pushed a commit
to ChenYuHo/torchcomms
that referenced
this pull request
Jun 8, 2026
…th) (meta-pytorch#2829) Summary: Design doc: P2365280005 Code walk through: P2365344262 First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780
4f7337f to
c09f92f
Compare
ChenYuHo
pushed a commit
to ChenYuHo/torchcomms
that referenced
this pull request
Jun 9, 2026
…th) (meta-pytorch#2829) Summary: Design doc: P2365280005 Code walk through: P2365344262 First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780
c09f92f to
f22b024
Compare
ChenYuHo
pushed a commit
to ChenYuHo/torchcomms
that referenced
this pull request
Jun 9, 2026
…th) (meta-pytorch#2829) Summary: Design doc: P2365280005 Code walk through: P2365344262 First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780
f22b024 to
46c244d
Compare
ChenYuHo
pushed a commit
to ChenYuHo/torchcomms
that referenced
this pull request
Jun 9, 2026
…th) (meta-pytorch#2829) Summary: Design doc: P2365280005 Code walk through: P2365344262 First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780
6a874e2 to
93e11a4
Compare
ChenYuHo
pushed a commit
to ChenYuHo/torchcomms
that referenced
this pull request
Jun 9, 2026
…th) (meta-pytorch#2829) Summary: Design doc: P2365280005 Code walk through: P2365344262 First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780
93e11a4 to
f87199d
Compare
ChenYuHo
pushed a commit
to ChenYuHo/torchcomms
that referenced
this pull request
Jun 11, 2026
…th) (meta-pytorch#2829) Summary: First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780
71ab468 to
d712f15
Compare
ChenYuHo
pushed a commit
to ChenYuHo/torchcomms
that referenced
this pull request
Jun 11, 2026
…th) (meta-pytorch#2829) Summary: First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780
ChenYuHo
pushed a commit
to ChenYuHo/torchcomms
that referenced
this pull request
Jun 11, 2026
…th) (meta-pytorch#2829) Summary: First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780
d712f15 to
b562ac4
Compare
ChenYuHo
pushed a commit
to ChenYuHo/torchcomms
that referenced
this pull request
Jun 11, 2026
…th) (meta-pytorch#2829) Summary: Pull Request resolved: meta-pytorch#2829 First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780
ChenYuHo
pushed a commit
to ChenYuHo/torchcomms
that referenced
this pull request
Jun 16, 2026
…th) (meta-pytorch#2829) Summary: First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780
c7db6ed to
aabc6f1
Compare
ChenYuHo
pushed a commit
to ChenYuHo/torchcomms
that referenced
this pull request
Jun 16, 2026
…th) (meta-pytorch#2829) Summary: First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780
aabc6f1 to
6bd1698
Compare
ChenYuHo
pushed a commit
to ChenYuHo/torchcomms
that referenced
this pull request
Jun 16, 2026
…th) (meta-pytorch#2829) Summary: Pull Request resolved: meta-pytorch#2829 First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780
added 2 commits
June 16, 2026 00:24
…torch#2858) Summary: Overall design doc (README) for the comms/dsl framework, landing first in the stack so the design is reviewed before the code. Covers the design principles - the framework owns the generic 95% (schedule, multi-peer addressing, signal/wait), the kernel owner writes the 5% (a per-tile hook + a transport); performance is autotuned, not hand-tuned; spectrum of control - plus two worked examples: 1) a custom collective (a2a non-contig) built on the framework in ~10 lines, and 2) the autotuner workflow that populates the optimal kernel config and re-tunes on shape changes with no kernel edits (target: ~5 min when TBD change their shapes/dimensions). Reviewed By: cenzhaometa Differential Revision: D108105252
…th) (meta-pytorch#2829) Summary: First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780
6bd1698 to
68637eb
Compare
ChenYuHo
pushed a commit
to ChenYuHo/torchcomms
that referenced
this pull request
Jun 16, 2026
…th) (meta-pytorch#2829) Summary: Pull Request resolved: meta-pytorch#2829 First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.
New package
comms/dsl/:ctx.py—Ctx, the DSL-agnostic field-set spec of the hook contract,realized on-device per DSL (Triton
_aggregateintriton/ctx.py).ops.py— the device transport-ops seam (put/get/signal/wait): the only transport-specific device primitives.transport.py— user-owned p2p transport objects:P2pTransportProtocol +PeerEndpoint(host-resolved per-peer device state),NvlTransport+nvl_rendezvous(one collective rendezvous;no hidden cache — the user holds the object),
IbTransport/ib_rendezvous, andMeshTransportwhich routesper peer via
link_kindso a single collective can mix NVLink (intra-domain)and IB (inter-domain) later,
check_transfer()— fail-loud guard that numel fits the per-peer stagingregion and num_blocks fits the signal-pad slots (a violation would silently
overrun the next peer's region on the remote rank).
triton/— minimal, real (no-pipeline, single-shot) device kernelssend_tiles/recv_tileswrittenagainst the ops seam; hooks take a single
Ctxaggregate(
produce(ctx) -> regs,consume(ctx, regs)) —triton/ctx.pyrealizesCtxas a Triton
_aggregate.nvl_ops(real, over the framework's own self-contained PTX intriton/device_utils.py, nocomms/pipesdep);ib_ops(reserved);copy_*default hooks;send/recv/sendrecvlaunchers + a commented mixed-transport sketch.cute/—send/recvinterface stubs reserved for the CuTe backend (madereal in the next diff).
Design notes:
produce/consumeper direction) — minimalyet sufficient: it subsumes address-gather, value-transform, packed staging,
accumulate, and compute-into-send. The
addr_fn+transformform will be acomposer on top (follow-up).
Ctx, so the contract is churn-proof: future needs(pipeline
slot/step, MoEexpert_counts, FP8scales, warp-spec role, ...)become
Ctxfields and never change a hook signature.Ctxis alsoschedule-agnostic, so the same hooks serve a user-written schedule today and a
library schedule skeleton later.
constexpr), which is whatlets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
change to
send/recv, hooks, orCtx.allocation run at setup (outside capture); the captured region is pure kernel
launches over persistent, user-held buffers. Verified on 2x H100 for the
Triton path: send/recv captures and a single replay is correct (after a
host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
repeated replay is not yet gated — the minimal single-shot signal (
seq=1) isnot re-armed — so guaranteed repeat-replay support lands with the
monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
follow-up — see D107283879.)
Follow-up stacks: (1) Triton — real pipelined
send/recv+ portall_to_all_single; (2) CuTe — real CuTesend/recvon the same contract +cross-DSL interop test; (3) IB + mixed transport.
Differential Revision: D107172780