Composable send/recv interface (transport + hooks + minimal NVLink path) (#2829) by ChenYuHo · Pull Request #2829 · meta-pytorch/torchcomms

ChenYuHo · 2026-06-08T23:40:44Z

Summary:

First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.

New package comms/dsl/:

ctx.py — Ctx, the DSL-agnostic field-set spec of the hook contract,
realized on-device per DSL (Triton _aggregate in triton/ctx.py).
ops.py — the device transport-ops seam (put / get /
signal / wait): the only transport-specific device primitives.
transport.py — user-owned p2p transport objects:
- P2pTransport Protocol + PeerEndpoint (host-resolved per-peer device state),
- real, minimal NvlTransport + nvl_rendezvous (one collective rendezvous;
  no hidden cache — the user holds the object),
- reserved IbTransport / ib_rendezvous, and MeshTransport which routes
  per peer via link_kind so a single collective can mix NVLink (intra-domain)
  and IB (inter-domain) later,
- check_transfer() — fail-loud guard that numel fits the per-peer staging
  region and num_blocks fits the signal-pad slots (a violation would silently
  overrun the next peer's region on the remote rank).
triton/ — minimal, real (no-pipeline, single-shot) device kernels send_tiles/recv_tiles written
against the ops seam; hooks take a single Ctx aggregate
(produce(ctx) -> regs, consume(ctx, regs)) — triton/ctx.py realizes Ctx
as a Triton _aggregate. nvl_ops (real, over the framework's own self-contained PTX in
triton/device_utils.py, no comms/pipes dep); ib_ops (reserved); copy_* default hooks; send/recv/
sendrecv launchers + a commented mixed-transport sketch.
cute/ — send/recv interface stubs reserved for the CuTe backend (made
real in the next diff).

Design notes:

The hook seam is full-leg (one produce/consume per direction) — minimal
yet sufficient: it subsumes address-gather, value-transform, packed staging,
accumulate, and compute-into-send. The addr_fn+transform form will be a
composer on top (follow-up).
Hooks take a single opaque Ctx, so the contract is churn-proof: future needs
(pipeline slot/step, MoE expert_counts, FP8 scales, warp-spec role, ...)
become Ctx fields and never change a hook signature. Ctx is also
schedule-agnostic, so the same hooks serve a user-written schedule today and a
library schedule skeleton later.
Transport choice binds per call-site (ops as constexpr), which is what
lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
change to send/recv, hooks, or Ctx.
CUDA-graph-ready (architecture). The collective rendezvous + symm-mem
allocation run at setup (outside capture); the captured region is pure kernel
launches over persistent, user-held buffers. Verified on 2x H100 for the
Triton path: send/recv captures and a single replay is correct (after a
host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
repeated replay is not yet gated — the minimal single-shot signal (seq=1) is
not re-armed — so guaranteed repeat-replay support lands with the
monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
follow-up — see D107283879.)

Follow-up stacks: (1) Triton — real pipelined send/recv + port
all_to_all_single; (2) CuTe — real CuTe send/recv on the same contract +
cross-DSL interop test; (3) IB + mixed transport.

Differential Revision: D107172780

meta-codesync · 2026-06-08T23:40:59Z

@ChenYuHo has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107172780.

…th) (meta-pytorch#2829) Summary: Design doc: P2365280005 Code walk through: P2365344262 First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780

…th) (meta-pytorch#2829) Summary: First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780

…th) (meta-pytorch#2829) Summary: Pull Request resolved: meta-pytorch#2829 First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780

…th) (meta-pytorch#2829) Summary: First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780

…th) (meta-pytorch#2829) Summary: Pull Request resolved: meta-pytorch#2829 First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780

…torch#2858) Summary: Overall design doc (README) for the comms/dsl framework, landing first in the stack so the design is reviewed before the code. Covers the design principles - the framework owns the generic 95% (schedule, multi-peer addressing, signal/wait), the kernel owner writes the 5% (a per-tile hook + a transport); performance is autotuned, not hand-tuned; spectrum of control - plus two worked examples: 1) a custom collective (a2a non-contig) built on the framework in ~10 lines, and 2) the autotuner workflow that populates the optimal kernel config and re-tunes on shape changes with no kernel edits (target: ~5 min when TBD change their shapes/dimensions). Reviewed By: cenzhaometa Differential Revision: D108105252

…th) (meta-pytorch#2829) Summary: First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780

…th) (meta-pytorch#2829) Summary: Pull Request resolved: meta-pytorch#2829 First, interface-only diff for the composable device send/recv framework that fused compute/comm kernels (all-to-all, allgather, ...) will build on. No production pipeline is wired; review overhead is intentionally low. New package `comms/dsl/`: * `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract, realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`). * `ops.py` — the device transport-ops seam (`put` / `get` / `signal` / `wait`): the only transport-specific device primitives. * `transport.py` — user-owned p2p transport objects: - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state), - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous; no hidden cache — the user holds the object), - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes per peer via `link_kind` so a single collective can mix NVLink (intra-domain) and IB (inter-domain) later, - `check_transfer()` — fail-loud guard that numel fits the per-peer staging region and num_blocks fits the signal-pad slots (a violation would silently overrun the next peer's region on the remote rank). * `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written against the ops seam; hooks take a single `Ctx` aggregate (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx` as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/ `sendrecv` launchers + a commented mixed-transport sketch. * `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made real in the next diff). Design notes: * The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal yet sufficient: it subsumes address-gather, value-transform, packed staging, accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a composer on top (follow-up). * Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...) become `Ctx` fields and never change a hook signature. `Ctx` is also schedule-agnostic, so the same hooks serve a user-written schedule today and a library schedule skeleton later. * Transport choice binds **per call-site** (ops as `constexpr`), which is what lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no change to `send`/`recv`, hooks, or `Ctx`. * **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem allocation run at setup (outside capture); the captured region is pure kernel launches over persistent, user-held buffers. Verified on 2x H100 for the **Triton** path: send/recv captures and a single replay is correct (after a host-side signal-pad reset; kernels must be warmed up before capture). NOTE: repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is not re-armed — so guaranteed repeat-replay support lands with the monotonic-counter pipelined follow-up. (CuTe graph capture is a separate follow-up — see D107283879.) Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port `all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract + cross-DSL interop test; (3) IB + mixed transport. Differential Revision: D107172780

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 8, 2026

meta-codesync Bot added the meta-exported label Jun 8, 2026

meta-codesync Bot changed the title ~~Composable send/recv interface (transport + hooks + minimal NVLink path)~~ Composable send/recv interface (transport + hooks + minimal NVLink path) (#2829) Jun 8, 2026

ChenYuHo force-pushed the export-D107172780 branch from 1a57b88 to 74a4d4f Compare June 8, 2026 23:41

ChenYuHo force-pushed the export-D107172780 branch 2 times, most recently from 4f7337f to c09f92f Compare June 9, 2026 17:53

ChenYuHo force-pushed the export-D107172780 branch from c09f92f to f22b024 Compare June 9, 2026 17:54

ChenYuHo force-pushed the export-D107172780 branch from f22b024 to 46c244d Compare June 9, 2026 17:55

ChenYuHo force-pushed the export-D107172780 branch 2 times, most recently from 6a874e2 to 93e11a4 Compare June 9, 2026 18:47

ChenYuHo force-pushed the export-D107172780 branch from 93e11a4 to f87199d Compare June 9, 2026 18:48

ChenYuHo force-pushed the export-D107172780 branch 2 times, most recently from 71ab468 to d712f15 Compare June 11, 2026 21:36

ChenYuHo force-pushed the export-D107172780 branch from d712f15 to b562ac4 Compare June 11, 2026 21:37

ChenYuHo force-pushed the export-D107172780 branch 2 times, most recently from c7db6ed to aabc6f1 Compare June 16, 2026 00:23

ChenYuHo force-pushed the export-D107172780 branch from aabc6f1 to 6bd1698 Compare June 16, 2026 00:50

Elton Ho added 2 commits June 16, 2026 00:24

ChenYuHo force-pushed the export-D107172780 branch from 6bd1698 to 68637eb Compare June 16, 2026 07:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Composable send/recv interface (transport + hooks + minimal NVLink path) (#2829)#2829

Composable send/recv interface (transport + hooks + minimal NVLink path) (#2829)#2829
ChenYuHo wants to merge 2 commits into
meta-pytorch:mainfrom
ChenYuHo:export-D107172780

ChenYuHo commented Jun 8, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChenYuHo commented Jun 8, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChenYuHo commented Jun 8, 2026 •

edited by meta-codesync Bot

Loading