Skip to content

Composable send/recv interface (transport + hooks + minimal NVLink path) (#2829)#2829

Open
ChenYuHo wants to merge 2 commits into
meta-pytorch:mainfrom
ChenYuHo:export-D107172780
Open

Composable send/recv interface (transport + hooks + minimal NVLink path) (#2829)#2829
ChenYuHo wants to merge 2 commits into
meta-pytorch:mainfrom
ChenYuHo:export-D107172780

Conversation

@ChenYuHo

@ChenYuHo ChenYuHo commented Jun 8, 2026

Copy link
Copy Markdown

Summary:

First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.

New package comms/dsl/:

  • ctx.pyCtx, the DSL-agnostic field-set spec of the hook contract,
    realized on-device per DSL (Triton _aggregate in triton/ctx.py).
  • ops.py — the device transport-ops seam (put / get /
    signal / wait): the only transport-specific device primitives.
  • transport.py — user-owned p2p transport objects:
    • P2pTransport Protocol + PeerEndpoint (host-resolved per-peer device state),
    • real, minimal NvlTransport + nvl_rendezvous (one collective rendezvous;
      no hidden cache — the user holds the object),
    • reserved IbTransport / ib_rendezvous, and MeshTransport which routes
      per peer via link_kind so a single collective can mix NVLink (intra-domain)
      and IB (inter-domain) later,
    • check_transfer() — fail-loud guard that numel fits the per-peer staging
      region and num_blocks fits the signal-pad slots (a violation would silently
      overrun the next peer's region on the remote rank).
  • triton/ — minimal, real (no-pipeline, single-shot) device kernels send_tiles/recv_tiles written
    against the ops seam; hooks take a single Ctx aggregate
    (produce(ctx) -> regs, consume(ctx, regs)) — triton/ctx.py realizes Ctx
    as a Triton _aggregate. nvl_ops (real, over the framework's own self-contained PTX in
    triton/device_utils.py, no comms/pipes dep); ib_ops (reserved); copy_* default hooks; send/recv/
    sendrecv launchers + a commented mixed-transport sketch.
  • cute/send/recv interface stubs reserved for the CuTe backend (made
    real in the next diff).

Design notes:

  • The hook seam is full-leg (one produce/consume per direction) — minimal
    yet sufficient: it subsumes address-gather, value-transform, packed staging,
    accumulate, and compute-into-send. The addr_fn+transform form will be a
    composer on top (follow-up).
  • Hooks take a single opaque Ctx, so the contract is churn-proof: future needs
    (pipeline slot/step, MoE expert_counts, FP8 scales, warp-spec role, ...)
    become Ctx fields and never change a hook signature. Ctx is also
    schedule-agnostic, so the same hooks serve a user-written schedule today and a
    library schedule skeleton later.
  • Transport choice binds per call-site (ops as constexpr), which is what
    lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
    change to send/recv, hooks, or Ctx.
  • CUDA-graph-ready (architecture). The collective rendezvous + symm-mem
    allocation run at setup (outside capture); the captured region is pure kernel
    launches over persistent, user-held buffers. Verified on 2x H100 for the
    Triton path: send/recv captures and a single replay is correct (after a
    host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
    repeated replay is not yet gated — the minimal single-shot signal (seq=1) is
    not re-armed — so guaranteed repeat-replay support lands with the
    monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
    follow-up — see D107283879.)

Follow-up stacks: (1) Triton — real pipelined send/recv + port
all_to_all_single; (2) CuTe — real CuTe send/recv on the same contract +
cross-DSL interop test; (3) IB + mixed transport.

Differential Revision: D107172780

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 8, 2026
@meta-codesync

meta-codesync Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@ChenYuHo has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107172780.

@meta-codesync meta-codesync Bot changed the title Composable send/recv interface (transport + hooks + minimal NVLink path) Composable send/recv interface (transport + hooks + minimal NVLink path) (#2829) Jun 8, 2026
ChenYuHo pushed a commit to ChenYuHo/torchcomms that referenced this pull request Jun 8, 2026
…th) (meta-pytorch#2829)

Summary:

Design doc: P2365280005

Code walk through: P2365344262

First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.

New package `comms/dsl/`:

* `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract,
  realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`).
* `ops.py` — the device transport-ops seam (`put` / `get` /
  `signal` / `wait`): the only transport-specific device primitives.
* `transport.py` — user-owned p2p transport objects:
  - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state),
  - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous;
    no hidden cache — the user holds the object),
  - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes
    per peer via `link_kind` so a single collective can mix NVLink (intra-domain)
    and IB (inter-domain) later,
  - `check_transfer()` — fail-loud guard that numel fits the per-peer staging
    region and num_blocks fits the signal-pad slots (a violation would silently
    overrun the next peer's region on the remote rank).
* `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written
  against the ops seam; hooks take a single `Ctx` aggregate
  (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx`
  as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in
  `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/
  `sendrecv` launchers + a commented mixed-transport sketch.
* `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made
  real in the next diff).

Design notes:

* The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal
  yet sufficient: it subsumes address-gather, value-transform, packed staging,
  accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a
  composer on top (follow-up).
* Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs
  (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...)
  become `Ctx` fields and never change a hook signature. `Ctx` is also
  schedule-agnostic, so the same hooks serve a user-written schedule today and a
  library schedule skeleton later.
* Transport choice binds **per call-site** (ops as `constexpr`), which is what
  lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
  change to `send`/`recv`, hooks, or `Ctx`.
* **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem
  allocation run at setup (outside capture); the captured region is pure kernel
  launches over persistent, user-held buffers. Verified on 2x H100 for the
  **Triton** path: send/recv captures and a single replay is correct (after a
  host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
  repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is
  not re-armed — so guaranteed repeat-replay support lands with the
  monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
  follow-up — see D107283879.)

Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port
`all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract +
cross-DSL interop test; (3) IB + mixed transport.

Differential Revision: D107172780
@ChenYuHo ChenYuHo force-pushed the export-D107172780 branch from 1a57b88 to 74a4d4f Compare June 8, 2026 23:41
ChenYuHo pushed a commit to ChenYuHo/torchcomms that referenced this pull request Jun 8, 2026
…th) (meta-pytorch#2829)

Summary:

Design doc: P2365280005

Code walk through: P2365344262

First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.

New package `comms/dsl/`:

* `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract,
  realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`).
* `ops.py` — the device transport-ops seam (`put` / `get` /
  `signal` / `wait`): the only transport-specific device primitives.
* `transport.py` — user-owned p2p transport objects:
  - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state),
  - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous;
    no hidden cache — the user holds the object),
  - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes
    per peer via `link_kind` so a single collective can mix NVLink (intra-domain)
    and IB (inter-domain) later,
  - `check_transfer()` — fail-loud guard that numel fits the per-peer staging
    region and num_blocks fits the signal-pad slots (a violation would silently
    overrun the next peer's region on the remote rank).
* `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written
  against the ops seam; hooks take a single `Ctx` aggregate
  (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx`
  as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in
  `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/
  `sendrecv` launchers + a commented mixed-transport sketch.
* `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made
  real in the next diff).

Design notes:

* The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal
  yet sufficient: it subsumes address-gather, value-transform, packed staging,
  accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a
  composer on top (follow-up).
* Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs
  (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...)
  become `Ctx` fields and never change a hook signature. `Ctx` is also
  schedule-agnostic, so the same hooks serve a user-written schedule today and a
  library schedule skeleton later.
* Transport choice binds **per call-site** (ops as `constexpr`), which is what
  lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
  change to `send`/`recv`, hooks, or `Ctx`.
* **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem
  allocation run at setup (outside capture); the captured region is pure kernel
  launches over persistent, user-held buffers. Verified on 2x H100 for the
  **Triton** path: send/recv captures and a single replay is correct (after a
  host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
  repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is
  not re-armed — so guaranteed repeat-replay support lands with the
  monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
  follow-up — see D107283879.)

Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port
`all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract +
cross-DSL interop test; (3) IB + mixed transport.

Differential Revision: D107172780
@ChenYuHo ChenYuHo force-pushed the export-D107172780 branch 2 times, most recently from 4f7337f to c09f92f Compare June 9, 2026 17:53
ChenYuHo pushed a commit to ChenYuHo/torchcomms that referenced this pull request Jun 9, 2026
…th) (meta-pytorch#2829)

Summary:

Design doc: P2365280005

Code walk through: P2365344262

First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.

New package `comms/dsl/`:

* `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract,
  realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`).
* `ops.py` — the device transport-ops seam (`put` / `get` /
  `signal` / `wait`): the only transport-specific device primitives.
* `transport.py` — user-owned p2p transport objects:
  - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state),
  - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous;
    no hidden cache — the user holds the object),
  - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes
    per peer via `link_kind` so a single collective can mix NVLink (intra-domain)
    and IB (inter-domain) later,
  - `check_transfer()` — fail-loud guard that numel fits the per-peer staging
    region and num_blocks fits the signal-pad slots (a violation would silently
    overrun the next peer's region on the remote rank).
* `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written
  against the ops seam; hooks take a single `Ctx` aggregate
  (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx`
  as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in
  `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/
  `sendrecv` launchers + a commented mixed-transport sketch.
* `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made
  real in the next diff).

Design notes:

* The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal
  yet sufficient: it subsumes address-gather, value-transform, packed staging,
  accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a
  composer on top (follow-up).
* Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs
  (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...)
  become `Ctx` fields and never change a hook signature. `Ctx` is also
  schedule-agnostic, so the same hooks serve a user-written schedule today and a
  library schedule skeleton later.
* Transport choice binds **per call-site** (ops as `constexpr`), which is what
  lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
  change to `send`/`recv`, hooks, or `Ctx`.
* **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem
  allocation run at setup (outside capture); the captured region is pure kernel
  launches over persistent, user-held buffers. Verified on 2x H100 for the
  **Triton** path: send/recv captures and a single replay is correct (after a
  host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
  repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is
  not re-armed — so guaranteed repeat-replay support lands with the
  monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
  follow-up — see D107283879.)

Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port
`all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract +
cross-DSL interop test; (3) IB + mixed transport.

Differential Revision: D107172780
@ChenYuHo ChenYuHo force-pushed the export-D107172780 branch from c09f92f to f22b024 Compare June 9, 2026 17:54
ChenYuHo pushed a commit to ChenYuHo/torchcomms that referenced this pull request Jun 9, 2026
…th) (meta-pytorch#2829)

Summary:

Design doc: P2365280005

Code walk through: P2365344262

First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.

New package `comms/dsl/`:

* `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract,
  realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`).
* `ops.py` — the device transport-ops seam (`put` / `get` /
  `signal` / `wait`): the only transport-specific device primitives.
* `transport.py` — user-owned p2p transport objects:
  - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state),
  - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous;
    no hidden cache — the user holds the object),
  - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes
    per peer via `link_kind` so a single collective can mix NVLink (intra-domain)
    and IB (inter-domain) later,
  - `check_transfer()` — fail-loud guard that numel fits the per-peer staging
    region and num_blocks fits the signal-pad slots (a violation would silently
    overrun the next peer's region on the remote rank).
* `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written
  against the ops seam; hooks take a single `Ctx` aggregate
  (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx`
  as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in
  `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/
  `sendrecv` launchers + a commented mixed-transport sketch.
* `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made
  real in the next diff).

Design notes:

* The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal
  yet sufficient: it subsumes address-gather, value-transform, packed staging,
  accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a
  composer on top (follow-up).
* Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs
  (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...)
  become `Ctx` fields and never change a hook signature. `Ctx` is also
  schedule-agnostic, so the same hooks serve a user-written schedule today and a
  library schedule skeleton later.
* Transport choice binds **per call-site** (ops as `constexpr`), which is what
  lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
  change to `send`/`recv`, hooks, or `Ctx`.
* **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem
  allocation run at setup (outside capture); the captured region is pure kernel
  launches over persistent, user-held buffers. Verified on 2x H100 for the
  **Triton** path: send/recv captures and a single replay is correct (after a
  host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
  repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is
  not re-armed — so guaranteed repeat-replay support lands with the
  monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
  follow-up — see D107283879.)

Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port
`all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract +
cross-DSL interop test; (3) IB + mixed transport.

Differential Revision: D107172780
@ChenYuHo ChenYuHo force-pushed the export-D107172780 branch from f22b024 to 46c244d Compare June 9, 2026 17:55
ChenYuHo pushed a commit to ChenYuHo/torchcomms that referenced this pull request Jun 9, 2026
…th) (meta-pytorch#2829)

Summary:

Design doc: P2365280005

Code walk through: P2365344262

First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.

New package `comms/dsl/`:

* `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract,
  realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`).
* `ops.py` — the device transport-ops seam (`put` / `get` /
  `signal` / `wait`): the only transport-specific device primitives.
* `transport.py` — user-owned p2p transport objects:
  - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state),
  - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous;
    no hidden cache — the user holds the object),
  - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes
    per peer via `link_kind` so a single collective can mix NVLink (intra-domain)
    and IB (inter-domain) later,
  - `check_transfer()` — fail-loud guard that numel fits the per-peer staging
    region and num_blocks fits the signal-pad slots (a violation would silently
    overrun the next peer's region on the remote rank).
* `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written
  against the ops seam; hooks take a single `Ctx` aggregate
  (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx`
  as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in
  `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/
  `sendrecv` launchers + a commented mixed-transport sketch.
* `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made
  real in the next diff).

Design notes:

* The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal
  yet sufficient: it subsumes address-gather, value-transform, packed staging,
  accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a
  composer on top (follow-up).
* Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs
  (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...)
  become `Ctx` fields and never change a hook signature. `Ctx` is also
  schedule-agnostic, so the same hooks serve a user-written schedule today and a
  library schedule skeleton later.
* Transport choice binds **per call-site** (ops as `constexpr`), which is what
  lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
  change to `send`/`recv`, hooks, or `Ctx`.
* **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem
  allocation run at setup (outside capture); the captured region is pure kernel
  launches over persistent, user-held buffers. Verified on 2x H100 for the
  **Triton** path: send/recv captures and a single replay is correct (after a
  host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
  repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is
  not re-armed — so guaranteed repeat-replay support lands with the
  monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
  follow-up — see D107283879.)

Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port
`all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract +
cross-DSL interop test; (3) IB + mixed transport.

Differential Revision: D107172780
@ChenYuHo ChenYuHo force-pushed the export-D107172780 branch 2 times, most recently from 6a874e2 to 93e11a4 Compare June 9, 2026 18:47
ChenYuHo pushed a commit to ChenYuHo/torchcomms that referenced this pull request Jun 9, 2026
…th) (meta-pytorch#2829)

Summary:

Design doc: P2365280005

Code walk through: P2365344262

First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.

New package `comms/dsl/`:

* `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract,
  realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`).
* `ops.py` — the device transport-ops seam (`put` / `get` /
  `signal` / `wait`): the only transport-specific device primitives.
* `transport.py` — user-owned p2p transport objects:
  - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state),
  - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous;
    no hidden cache — the user holds the object),
  - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes
    per peer via `link_kind` so a single collective can mix NVLink (intra-domain)
    and IB (inter-domain) later,
  - `check_transfer()` — fail-loud guard that numel fits the per-peer staging
    region and num_blocks fits the signal-pad slots (a violation would silently
    overrun the next peer's region on the remote rank).
* `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written
  against the ops seam; hooks take a single `Ctx` aggregate
  (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx`
  as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in
  `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/
  `sendrecv` launchers + a commented mixed-transport sketch.
* `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made
  real in the next diff).

Design notes:

* The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal
  yet sufficient: it subsumes address-gather, value-transform, packed staging,
  accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a
  composer on top (follow-up).
* Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs
  (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...)
  become `Ctx` fields and never change a hook signature. `Ctx` is also
  schedule-agnostic, so the same hooks serve a user-written schedule today and a
  library schedule skeleton later.
* Transport choice binds **per call-site** (ops as `constexpr`), which is what
  lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
  change to `send`/`recv`, hooks, or `Ctx`.
* **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem
  allocation run at setup (outside capture); the captured region is pure kernel
  launches over persistent, user-held buffers. Verified on 2x H100 for the
  **Triton** path: send/recv captures and a single replay is correct (after a
  host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
  repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is
  not re-armed — so guaranteed repeat-replay support lands with the
  monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
  follow-up — see D107283879.)

Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port
`all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract +
cross-DSL interop test; (3) IB + mixed transport.

Differential Revision: D107172780
@ChenYuHo ChenYuHo force-pushed the export-D107172780 branch from 93e11a4 to f87199d Compare June 9, 2026 18:48
ChenYuHo pushed a commit to ChenYuHo/torchcomms that referenced this pull request Jun 11, 2026
…th) (meta-pytorch#2829)

Summary:

First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.

New package `comms/dsl/`:

* `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract,
  realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`).
* `ops.py` — the device transport-ops seam (`put` / `get` /
  `signal` / `wait`): the only transport-specific device primitives.
* `transport.py` — user-owned p2p transport objects:
  - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state),
  - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous;
    no hidden cache — the user holds the object),
  - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes
    per peer via `link_kind` so a single collective can mix NVLink (intra-domain)
    and IB (inter-domain) later,
  - `check_transfer()` — fail-loud guard that numel fits the per-peer staging
    region and num_blocks fits the signal-pad slots (a violation would silently
    overrun the next peer's region on the remote rank).
* `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written
  against the ops seam; hooks take a single `Ctx` aggregate
  (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx`
  as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in
  `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/
  `sendrecv` launchers + a commented mixed-transport sketch.
* `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made
  real in the next diff).

Design notes:

* The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal
  yet sufficient: it subsumes address-gather, value-transform, packed staging,
  accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a
  composer on top (follow-up).
* Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs
  (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...)
  become `Ctx` fields and never change a hook signature. `Ctx` is also
  schedule-agnostic, so the same hooks serve a user-written schedule today and a
  library schedule skeleton later.
* Transport choice binds **per call-site** (ops as `constexpr`), which is what
  lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
  change to `send`/`recv`, hooks, or `Ctx`.
* **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem
  allocation run at setup (outside capture); the captured region is pure kernel
  launches over persistent, user-held buffers. Verified on 2x H100 for the
  **Triton** path: send/recv captures and a single replay is correct (after a
  host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
  repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is
  not re-armed — so guaranteed repeat-replay support lands with the
  monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
  follow-up — see D107283879.)

Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port
`all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract +
cross-DSL interop test; (3) IB + mixed transport.

Differential Revision: D107172780
@ChenYuHo ChenYuHo force-pushed the export-D107172780 branch 2 times, most recently from 71ab468 to d712f15 Compare June 11, 2026 21:36
ChenYuHo pushed a commit to ChenYuHo/torchcomms that referenced this pull request Jun 11, 2026
…th) (meta-pytorch#2829)

Summary:

First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.

New package `comms/dsl/`:

* `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract,
  realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`).
* `ops.py` — the device transport-ops seam (`put` / `get` /
  `signal` / `wait`): the only transport-specific device primitives.
* `transport.py` — user-owned p2p transport objects:
  - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state),
  - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous;
    no hidden cache — the user holds the object),
  - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes
    per peer via `link_kind` so a single collective can mix NVLink (intra-domain)
    and IB (inter-domain) later,
  - `check_transfer()` — fail-loud guard that numel fits the per-peer staging
    region and num_blocks fits the signal-pad slots (a violation would silently
    overrun the next peer's region on the remote rank).
* `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written
  against the ops seam; hooks take a single `Ctx` aggregate
  (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx`
  as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in
  `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/
  `sendrecv` launchers + a commented mixed-transport sketch.
* `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made
  real in the next diff).

Design notes:

* The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal
  yet sufficient: it subsumes address-gather, value-transform, packed staging,
  accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a
  composer on top (follow-up).
* Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs
  (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...)
  become `Ctx` fields and never change a hook signature. `Ctx` is also
  schedule-agnostic, so the same hooks serve a user-written schedule today and a
  library schedule skeleton later.
* Transport choice binds **per call-site** (ops as `constexpr`), which is what
  lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
  change to `send`/`recv`, hooks, or `Ctx`.
* **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem
  allocation run at setup (outside capture); the captured region is pure kernel
  launches over persistent, user-held buffers. Verified on 2x H100 for the
  **Triton** path: send/recv captures and a single replay is correct (after a
  host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
  repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is
  not re-armed — so guaranteed repeat-replay support lands with the
  monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
  follow-up — see D107283879.)

Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port
`all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract +
cross-DSL interop test; (3) IB + mixed transport.

Differential Revision: D107172780
ChenYuHo pushed a commit to ChenYuHo/torchcomms that referenced this pull request Jun 11, 2026
…th) (meta-pytorch#2829)

Summary:

First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.

New package `comms/dsl/`:

* `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract,
  realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`).
* `ops.py` — the device transport-ops seam (`put` / `get` /
  `signal` / `wait`): the only transport-specific device primitives.
* `transport.py` — user-owned p2p transport objects:
  - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state),
  - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous;
    no hidden cache — the user holds the object),
  - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes
    per peer via `link_kind` so a single collective can mix NVLink (intra-domain)
    and IB (inter-domain) later,
  - `check_transfer()` — fail-loud guard that numel fits the per-peer staging
    region and num_blocks fits the signal-pad slots (a violation would silently
    overrun the next peer's region on the remote rank).
* `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written
  against the ops seam; hooks take a single `Ctx` aggregate
  (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx`
  as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in
  `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/
  `sendrecv` launchers + a commented mixed-transport sketch.
* `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made
  real in the next diff).

Design notes:

* The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal
  yet sufficient: it subsumes address-gather, value-transform, packed staging,
  accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a
  composer on top (follow-up).
* Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs
  (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...)
  become `Ctx` fields and never change a hook signature. `Ctx` is also
  schedule-agnostic, so the same hooks serve a user-written schedule today and a
  library schedule skeleton later.
* Transport choice binds **per call-site** (ops as `constexpr`), which is what
  lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
  change to `send`/`recv`, hooks, or `Ctx`.
* **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem
  allocation run at setup (outside capture); the captured region is pure kernel
  launches over persistent, user-held buffers. Verified on 2x H100 for the
  **Triton** path: send/recv captures and a single replay is correct (after a
  host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
  repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is
  not re-armed — so guaranteed repeat-replay support lands with the
  monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
  follow-up — see D107283879.)

Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port
`all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract +
cross-DSL interop test; (3) IB + mixed transport.

Differential Revision: D107172780
@ChenYuHo ChenYuHo force-pushed the export-D107172780 branch from d712f15 to b562ac4 Compare June 11, 2026 21:37
ChenYuHo pushed a commit to ChenYuHo/torchcomms that referenced this pull request Jun 11, 2026
…th) (meta-pytorch#2829)

Summary:
Pull Request resolved: meta-pytorch#2829

First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.

New package `comms/dsl/`:

* `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract,
  realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`).
* `ops.py` — the device transport-ops seam (`put` / `get` /
  `signal` / `wait`): the only transport-specific device primitives.
* `transport.py` — user-owned p2p transport objects:
  - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state),
  - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous;
    no hidden cache — the user holds the object),
  - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes
    per peer via `link_kind` so a single collective can mix NVLink (intra-domain)
    and IB (inter-domain) later,
  - `check_transfer()` — fail-loud guard that numel fits the per-peer staging
    region and num_blocks fits the signal-pad slots (a violation would silently
    overrun the next peer's region on the remote rank).
* `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written
  against the ops seam; hooks take a single `Ctx` aggregate
  (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx`
  as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in
  `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/
  `sendrecv` launchers + a commented mixed-transport sketch.
* `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made
  real in the next diff).

Design notes:

* The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal
  yet sufficient: it subsumes address-gather, value-transform, packed staging,
  accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a
  composer on top (follow-up).
* Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs
  (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...)
  become `Ctx` fields and never change a hook signature. `Ctx` is also
  schedule-agnostic, so the same hooks serve a user-written schedule today and a
  library schedule skeleton later.
* Transport choice binds **per call-site** (ops as `constexpr`), which is what
  lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
  change to `send`/`recv`, hooks, or `Ctx`.
* **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem
  allocation run at setup (outside capture); the captured region is pure kernel
  launches over persistent, user-held buffers. Verified on 2x H100 for the
  **Triton** path: send/recv captures and a single replay is correct (after a
  host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
  repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is
  not re-armed — so guaranteed repeat-replay support lands with the
  monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
  follow-up — see D107283879.)

Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port
`all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract +
cross-DSL interop test; (3) IB + mixed transport.

Differential Revision: D107172780
ChenYuHo pushed a commit to ChenYuHo/torchcomms that referenced this pull request Jun 16, 2026
…th) (meta-pytorch#2829)

Summary:

First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.

New package `comms/dsl/`:

* `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract,
  realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`).
* `ops.py` — the device transport-ops seam (`put` / `get` /
  `signal` / `wait`): the only transport-specific device primitives.
* `transport.py` — user-owned p2p transport objects:
  - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state),
  - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous;
    no hidden cache — the user holds the object),
  - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes
    per peer via `link_kind` so a single collective can mix NVLink (intra-domain)
    and IB (inter-domain) later,
  - `check_transfer()` — fail-loud guard that numel fits the per-peer staging
    region and num_blocks fits the signal-pad slots (a violation would silently
    overrun the next peer's region on the remote rank).
* `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written
  against the ops seam; hooks take a single `Ctx` aggregate
  (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx`
  as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in
  `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/
  `sendrecv` launchers + a commented mixed-transport sketch.
* `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made
  real in the next diff).

Design notes:

* The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal
  yet sufficient: it subsumes address-gather, value-transform, packed staging,
  accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a
  composer on top (follow-up).
* Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs
  (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...)
  become `Ctx` fields and never change a hook signature. `Ctx` is also
  schedule-agnostic, so the same hooks serve a user-written schedule today and a
  library schedule skeleton later.
* Transport choice binds **per call-site** (ops as `constexpr`), which is what
  lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
  change to `send`/`recv`, hooks, or `Ctx`.
* **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem
  allocation run at setup (outside capture); the captured region is pure kernel
  launches over persistent, user-held buffers. Verified on 2x H100 for the
  **Triton** path: send/recv captures and a single replay is correct (after a
  host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
  repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is
  not re-armed — so guaranteed repeat-replay support lands with the
  monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
  follow-up — see D107283879.)

Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port
`all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract +
cross-DSL interop test; (3) IB + mixed transport.

Differential Revision: D107172780
@ChenYuHo ChenYuHo force-pushed the export-D107172780 branch 2 times, most recently from c7db6ed to aabc6f1 Compare June 16, 2026 00:23
ChenYuHo pushed a commit to ChenYuHo/torchcomms that referenced this pull request Jun 16, 2026
…th) (meta-pytorch#2829)

Summary:

First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.

New package `comms/dsl/`:

* `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract,
  realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`).
* `ops.py` — the device transport-ops seam (`put` / `get` /
  `signal` / `wait`): the only transport-specific device primitives.
* `transport.py` — user-owned p2p transport objects:
  - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state),
  - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous;
    no hidden cache — the user holds the object),
  - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes
    per peer via `link_kind` so a single collective can mix NVLink (intra-domain)
    and IB (inter-domain) later,
  - `check_transfer()` — fail-loud guard that numel fits the per-peer staging
    region and num_blocks fits the signal-pad slots (a violation would silently
    overrun the next peer's region on the remote rank).
* `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written
  against the ops seam; hooks take a single `Ctx` aggregate
  (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx`
  as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in
  `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/
  `sendrecv` launchers + a commented mixed-transport sketch.
* `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made
  real in the next diff).

Design notes:

* The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal
  yet sufficient: it subsumes address-gather, value-transform, packed staging,
  accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a
  composer on top (follow-up).
* Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs
  (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...)
  become `Ctx` fields and never change a hook signature. `Ctx` is also
  schedule-agnostic, so the same hooks serve a user-written schedule today and a
  library schedule skeleton later.
* Transport choice binds **per call-site** (ops as `constexpr`), which is what
  lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
  change to `send`/`recv`, hooks, or `Ctx`.
* **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem
  allocation run at setup (outside capture); the captured region is pure kernel
  launches over persistent, user-held buffers. Verified on 2x H100 for the
  **Triton** path: send/recv captures and a single replay is correct (after a
  host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
  repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is
  not re-armed — so guaranteed repeat-replay support lands with the
  monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
  follow-up — see D107283879.)

Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port
`all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract +
cross-DSL interop test; (3) IB + mixed transport.

Differential Revision: D107172780
@ChenYuHo ChenYuHo force-pushed the export-D107172780 branch from aabc6f1 to 6bd1698 Compare June 16, 2026 00:50
ChenYuHo pushed a commit to ChenYuHo/torchcomms that referenced this pull request Jun 16, 2026
…th) (meta-pytorch#2829)

Summary:
Pull Request resolved: meta-pytorch#2829

First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.

New package `comms/dsl/`:

* `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract,
  realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`).
* `ops.py` — the device transport-ops seam (`put` / `get` /
  `signal` / `wait`): the only transport-specific device primitives.
* `transport.py` — user-owned p2p transport objects:
  - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state),
  - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous;
    no hidden cache — the user holds the object),
  - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes
    per peer via `link_kind` so a single collective can mix NVLink (intra-domain)
    and IB (inter-domain) later,
  - `check_transfer()` — fail-loud guard that numel fits the per-peer staging
    region and num_blocks fits the signal-pad slots (a violation would silently
    overrun the next peer's region on the remote rank).
* `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written
  against the ops seam; hooks take a single `Ctx` aggregate
  (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx`
  as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in
  `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/
  `sendrecv` launchers + a commented mixed-transport sketch.
* `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made
  real in the next diff).

Design notes:

* The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal
  yet sufficient: it subsumes address-gather, value-transform, packed staging,
  accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a
  composer on top (follow-up).
* Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs
  (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...)
  become `Ctx` fields and never change a hook signature. `Ctx` is also
  schedule-agnostic, so the same hooks serve a user-written schedule today and a
  library schedule skeleton later.
* Transport choice binds **per call-site** (ops as `constexpr`), which is what
  lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
  change to `send`/`recv`, hooks, or `Ctx`.
* **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem
  allocation run at setup (outside capture); the captured region is pure kernel
  launches over persistent, user-held buffers. Verified on 2x H100 for the
  **Triton** path: send/recv captures and a single replay is correct (after a
  host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
  repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is
  not re-armed — so guaranteed repeat-replay support lands with the
  monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
  follow-up — see D107283879.)

Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port
`all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract +
cross-DSL interop test; (3) IB + mixed transport.

Differential Revision: D107172780
Elton Ho added 2 commits June 16, 2026 00:24
…torch#2858)

Summary:

Overall design doc (README) for the comms/dsl framework, landing first in the stack so the
design is reviewed before the code. Covers the design principles - the framework owns the
generic 95% (schedule, multi-peer addressing, signal/wait), the kernel owner writes the 5%
(a per-tile hook + a transport); performance is autotuned, not hand-tuned; spectrum of
control - plus two worked examples:
1) a custom collective (a2a non-contig) built on the framework in ~10 lines, and
2) the autotuner workflow that populates the optimal kernel config and re-tunes on shape
   changes with no kernel edits (target: ~5 min when TBD change their shapes/dimensions).

Reviewed By: cenzhaometa

Differential Revision: D108105252
…th) (meta-pytorch#2829)

Summary:

First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.

New package `comms/dsl/`:

* `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract,
  realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`).
* `ops.py` — the device transport-ops seam (`put` / `get` /
  `signal` / `wait`): the only transport-specific device primitives.
* `transport.py` — user-owned p2p transport objects:
  - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state),
  - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous;
    no hidden cache — the user holds the object),
  - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes
    per peer via `link_kind` so a single collective can mix NVLink (intra-domain)
    and IB (inter-domain) later,
  - `check_transfer()` — fail-loud guard that numel fits the per-peer staging
    region and num_blocks fits the signal-pad slots (a violation would silently
    overrun the next peer's region on the remote rank).
* `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written
  against the ops seam; hooks take a single `Ctx` aggregate
  (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx`
  as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in
  `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/
  `sendrecv` launchers + a commented mixed-transport sketch.
* `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made
  real in the next diff).

Design notes:

* The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal
  yet sufficient: it subsumes address-gather, value-transform, packed staging,
  accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a
  composer on top (follow-up).
* Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs
  (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...)
  become `Ctx` fields and never change a hook signature. `Ctx` is also
  schedule-agnostic, so the same hooks serve a user-written schedule today and a
  library schedule skeleton later.
* Transport choice binds **per call-site** (ops as `constexpr`), which is what
  lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
  change to `send`/`recv`, hooks, or `Ctx`.
* **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem
  allocation run at setup (outside capture); the captured region is pure kernel
  launches over persistent, user-held buffers. Verified on 2x H100 for the
  **Triton** path: send/recv captures and a single replay is correct (after a
  host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
  repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is
  not re-armed — so guaranteed repeat-replay support lands with the
  monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
  follow-up — see D107283879.)

Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port
`all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract +
cross-DSL interop test; (3) IB + mixed transport.

Differential Revision: D107172780
@ChenYuHo ChenYuHo force-pushed the export-D107172780 branch from 6bd1698 to 68637eb Compare June 16, 2026 07:25
ChenYuHo pushed a commit to ChenYuHo/torchcomms that referenced this pull request Jun 16, 2026
…th) (meta-pytorch#2829)

Summary:
Pull Request resolved: meta-pytorch#2829

First, interface-only diff for the composable device send/recv framework that
fused compute/comm kernels (all-to-all, allgather, ...) will build on. No
production pipeline is wired; review overhead is intentionally low.

New package `comms/dsl/`:

* `ctx.py` — `Ctx`, the DSL-agnostic field-set spec of the hook contract,
  realized on-device per DSL (Triton `_aggregate` in `triton/ctx.py`).
* `ops.py` — the device transport-ops seam (`put` / `get` /
  `signal` / `wait`): the only transport-specific device primitives.
* `transport.py` — user-owned p2p transport objects:
  - `P2pTransport` Protocol + `PeerEndpoint` (host-resolved per-peer device state),
  - real, minimal `NvlTransport` + `nvl_rendezvous` (one collective rendezvous;
    no hidden cache — the user holds the object),
  - reserved `IbTransport` / `ib_rendezvous`, and `MeshTransport` which routes
    per peer via `link_kind` so a single collective can mix NVLink (intra-domain)
    and IB (inter-domain) later,
  - `check_transfer()` — fail-loud guard that numel fits the per-peer staging
    region and num_blocks fits the signal-pad slots (a violation would silently
    overrun the next peer's region on the remote rank).
* `triton/` — minimal, real (no-pipeline, single-shot) device kernels `send_tiles`/`recv_tiles` written
  against the ops seam; hooks take a single `Ctx` aggregate
  (`produce(ctx) -> regs`, `consume(ctx, regs)`) — `triton/ctx.py` realizes `Ctx`
  as a Triton `_aggregate`. `nvl_ops` (real, over the framework's own self-contained PTX in
  `triton/device_utils.py`, no `comms/pipes` dep); `ib_ops` (reserved); `copy_*` default hooks; `send`/`recv`/
  `sendrecv` launchers + a commented mixed-transport sketch.
* `cute/` — `send`/`recv` interface stubs reserved for the CuTe backend (made
  real in the next diff).

Design notes:

* The hook seam is **full-leg** (one `produce`/`consume` per direction) — minimal
  yet sufficient: it subsumes address-gather, value-transform, packed staging,
  accumulate, and compute-into-send. The `addr_fn`+`transform` form will be a
  composer on top (follow-up).
* Hooks take a single opaque `Ctx`, so the contract is churn-proof: future needs
  (pipeline `slot`/`step`, MoE `expert_counts`, FP8 `scales`, warp-spec role, ...)
  become `Ctx` fields and never change a hook signature. `Ctx` is also
  schedule-agnostic, so the same hooks serve a user-written schedule today and a
  library schedule skeleton later.
* Transport choice binds **per call-site** (ops as `constexpr`), which is what
  lets one kernel mix NVLink and IB once the IB ops/transport are filled in — no
  change to `send`/`recv`, hooks, or `Ctx`.
* **CUDA-graph-ready (architecture).** The collective rendezvous + symm-mem
  allocation run at setup (outside capture); the captured region is pure kernel
  launches over persistent, user-held buffers. Verified on 2x H100 for the
  **Triton** path: send/recv captures and a single replay is correct (after a
  host-side signal-pad reset; kernels must be warmed up before capture). NOTE:
  repeated replay is not yet gated — the minimal single-shot signal (`seq=1`) is
  not re-armed — so guaranteed repeat-replay support lands with the
  monotonic-counter pipelined follow-up. (CuTe graph capture is a separate
  follow-up — see D107283879.)

Follow-up stacks: (1) Triton — real pipelined `send`/`recv` + port
`all_to_all_single`; (2) CuTe — real CuTe `send`/`recv` on the same contract +
cross-DSL interop test; (3) IB + mixed transport.

Differential Revision: D107172780
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant