Skip to content

feat(sparse): FEM mixin + DSparseMatrix + @distributed assembler + Condenser dispatch#19

Open
walkerchi wants to merge 10 commits into
dev/corefrom
feat/fem-mixin-dsparsematrix-v2
Open

feat(sparse): FEM mixin + DSparseMatrix + @distributed assembler + Condenser dispatch#19
walkerchi wants to merge 10 commits into
dev/corefrom
feat/fem-mixin-dsparsematrix-v2

Conversation

@walkerchi

Copy link
Copy Markdown
Collaborator

Replacement of the closed #18 with DCO sign-off on every commit (same content, no --force-push was used; the old branch was rebased onto a fresh branch with git rebase --signoff and pushed clean).

Summary

  • _FEMSparsityMixin with sequence-identity layout_signature (data_ptr + _version + numel + shape). Zero GPU sync, zero data movement, no autograd graph break, correct semantics for the Condenser/AMG position-indexed cache.
  • SparseMatrix uses the mixin. The eager SHA-256 + .cpu().tobytes() in __init__ is gone. layout_hash retained as a DeprecationWarning alias.
  • DSparseMatrix (new) — composition over torch_sla.distributed.DSparseTensor. Carries a 63-bit partition UUID generated on rank 0 with uuid.uuid4 and broadcast through the process group at construction.
  • @distributed class decorator — wraps any Assembler subclass.
  • Condenser — polymorphic dispatch; DSparseMatrix round-trips via .to_single() + re-partition (interim contract; per-rank distributed Condenser is a follow-up PR).

Bug fixes uncovered while making CI green

This PR depends on three things in torch-sla that did not exist in the 0.2.1 PyPI release:

Fixed two packaging / regression issues on the torch-sla side (commits on the same feat/gather-owned-to-global branch):

  • pyproject.toml had a hardcoded packages = ["torch_sla", "torch_sla.backends"] which silently dropped torch_sla.distributed and torch_sla.sparse_tensor from wheels built via pip install git+.... Switched to packages.find include = ["torch_sla*"].
  • Restored SparseTensor.requires_grad_ delegation in core.py next to the other ops-shims.

CI workflow now installs the dev branch over the PyPI release. Once torch-sla cuts 0.3.0, that line can go away and the pyproject pin can bump to >=0.3.0.

Test plan

  • Single-device unit: tests/sparse/test_layout_signature.py (6 cases)
  • Multi-proc Gloo CPU: tests/sparse/test_dsparsematrix.py (5), tests/distributed/test_distributed_assembler_decorator.py, tests/operator/test_condenser_dispatch.py, existing tests/distributed/test_distributed_assemble.py (16) — all pass locally
  • Multi-proc NCCL + CUDA on 2× A100 (autodl): all 6 new worker tests PASS; @distributed(LaplaceElementAssembler) end-to-end max_diff = 1.78e-15 (machine epsilon)
  • Full local pytest sweep (clean Python 3.14 venv simulating CI install order): 295 pass, 13 skipped, 1 pre-existing flake (test_quad.py::test_sum[3] — fp32 epsilon on Python 3.14 only; fails identically on main, CI runs 3.10/3.11/3.12)
  • DCO sign-off: all 10 commits carry Signed-off-by: walkerchi <walker.chi.000@gmail.com>

Follow-ups (not in this PR)

  • Native per-rank distributed Condenser (no to_single round-trip)
  • Sweep callers to layout_signature directly; remove layout_hash alias in next minor

🤖 Generated with Claude Code

walkerchi and others added 10 commits June 16, 2026 01:50
torch-sla PR #31 removed the legacy DSparseTensor(values, row, col,
shape=..., num_partitions=..., coords=..., partition_method=...) ctor
in favour of the classmethod constructors (partition / from_local /
from_sparse_local). The old single-process simulator path is gone.

distributed_element_assemble was calling the dead ctor, so every
distributed assembly path errored out with TypeError on import.

Switched to ``DSparseTensor.partition(A_global, mesh, partition_method,
coords)`` via a small ``_build_dsparse`` helper:

  * torch.distributed initialised -> real distributed-shard
    DSparseTensor on the live world (one rank, one row-shard).
  * Not initialised (single-process driver + multi-thread assembly,
    or unit tests) -> mesh=None gives world=1; the DSparseTensor
    holds the entire global matrix locally. matvec / solve still
    compose with the rest of torch-sla.

Added test_multiproc_distributed_assemble_matvec_2procs: spawns 2
gloo ranks, each runs TensorMesh distributed assembly, builds a
DSparseTensor, runs matvec in Shard(0) space (x sliced to owned),
allgathers the result, compares to reference single-mesh assembly.
Rel err 0 vs reference on a triangle rectangle mesh.

All 11 distributed tests pass.

Signed-off-by: walkerchi <walker.chi.000@gmail.com>
…t tests

The original PR only had a multi-proc *matvec* test. Now adds three
real end-to-end tests that exercise the full distributed-solve path
TensorMesh -> DSparseTensor -> cg_shard -> allgather -> compare:

  * test_multiproc_distributed_solve_2procs: assemble Mass matrix on
    2 gloo ranks, build b = M @ x_ref from a single-process reference,
    run distributed CG, allgather x, verify x_dist == x_ref to 1e-6.
    Mass instead of Laplace because Laplace has a constant null space
    that distributed CG drifts along (separate issue).

  * test_multiproc_distributed_solve_4procs: same with world=4 to
    exercise more partitions + more halo edges.

  * test_multiproc_poisson_dirichlet_2procs: full physics path on
    2 ranks (assemble + Condenser for Dirichlet BCs + single-process
    SparseMatrix solve via distributed_element_assemble_to_sparse,
    compare to single-process reference). The to_sparse path returns
    a tensormesh SparseMatrix, not a DSparseTensor -- this test
    verifies the distributed assembly produces the same global matrix
    as single-process even under multi-process orchestration.

Implementation note: the solve worker calls cg_shard directly (raw
tensor in / out) instead of the higher-level torch_sla.solve(D, b_dt)
wrapper. The wrapper expects DTensor[Shard(0)] for b; building one
from a manually-sliced owned tensor via DTensor.from_local works but
adds DTensor wrapping overhead unrelated to the test goal.

14/14 distributed tests pass.

Signed-off-by: walkerchi <walker.chi.000@gmail.com>
The previous test_multiproc_poisson_dirichlet_2procs only exercised
distributed_element_assemble_to_sparse which returns a SparseMatrix
(single-process) -- never touched the actual DSparseTensor distributed
solve path. Adds two tests that close that gap:

  test_multiproc_poisson_dirichlet_dsparse_2procs
  test_multiproc_poisson_dirichlet_dsparse_4procs

Pipeline per rank:
  1. TensorMesh distributed assembly -> global Laplace K + load f.
  2. Single-process Condenser strips Dirichlet DOFs -> K_inner, f_inner
     (cheap at this size; the partitioning happens after).
  3. Wrap K_inner as torch_sla.SparseTensor + DSparseTensor.partition
     (partition_method=metis, since Condenser breaks the 1:1 mesh-point
     mapping that RCB/coordinate methods would need).
  4. Slice f_inner to this rank's owned rows.
  5. Run distributed CG via cg_shard with M_apply=identity.
  6. Allgather x_owned -> u_inner_global -> condenser.recover() -> full u.
  7. Compare to single-process reference u: rel-err < 1e-6 required.

This is the actual integration deliverable: real PDE, real distributed
matrix, real distributed solver, end-to-end matches single-process.

16/16 distributed tests pass on CPU (2 new + 14 existing).

Signed-off-by: walkerchi <walker.chi.000@gmail.com>
Adds distributed_element_assemble_per_rank: each rank assembles ONLY
its own submesh (no threading across other ranks' submeshes), then
all_to_all_single ships row-shared contributions to the row-owning
rank. Total compute is roughly single-process work split N ways,
instead of the old pseudo-distributed path which has every rank do
the full N-submesh work and then partitions the result (O(N) compute
per rank, O(N^2) total cluster compute).

Pipeline:
  1. Compute partition_ids deterministically (lowest rank wins for
     shared boundary nodes). No communication.
  2. Each rank: assemble own submesh -> local COO in global coords.
  3. Route by destination rank (partition_ids[g_row]); all_to_all_single
     exchange (NCCL on CUDA, gloo on CPU).
  4. Coalesce incoming triples (duplicate (g_row, g_col) sum).
  5. Discover halo columns from coalesced pattern, build Partition,
     extract local SparseTensor in local coords, wrap as DSparseTensor
     via from_sparse_local.
  6. On NCCL backend, .cuda() the result so callers don't have to.

Verified on autodl 2x A100 (NCCL):
  correctness: matvec match rel-err 1.7e-16 (fp64 eps)
  end-to-end Poisson-Dirichlet via DSparseTensor: rel-err 2.3e-16
  speed: 1.76x per-rank speedup on small mesh (n=510)

Existing distributed_element_assemble (pseudo-distributed) is kept
as a back-compat default; users opt into the new path explicitly.

Signed-off-by: walkerchi <walker.chi.000@gmail.com>
`mesh.partition(num_parts=4)` is now equivalent to
`DistributedMesh(mesh, num_partitions=4)` -- thin wrapper, no
algorithm change. The point is API cleanliness:

  mesh = tm.Mesh.gen_rectangle(...)
  dmesh = mesh.partition(num_parts=4, method='coordinate')

Pair with distributed_element_assemble_per_rank inside a
torch.distributed context for **truly distributed** assembly (each
rank only does its own submesh). The single-process pseudo-distributed
path via distributed_element_assemble still works with the same dmesh.

Verified on autodl 2x A100 NCCL:
  - mesh.partition() output equivalent to DistributedMesh() (same n_points,
    num_parts, and orig_nid mapping per submesh)
  - mesh.partition() + per_rank assembly + matvec match vs reference K:
    rel-err 1.7e-16 (fp64 machine eps)

Existing DistributedMesh stays as the underlying class -- nothing to
migrate. Just gives users a more Pythonic entry point.

Algorithm consolidation (moving partition to torch-sla so we get hilbert/
patoh for free) is intentionally NOT in this PR -- 3.5d of real work
including element-vs-node ownership rules and high-order edge cases.
Tracked separately.

Signed-off-by: walkerchi <walker.chi.000@gmail.com>
The two end-to-end distributed solve tests called the internal
cg_shard primitive directly. Tests represent the user surface, so they
should go through the public torch_sla.solve API with iterative
defaults scoped via SolverConfig. Same dispatch path under the hood
(SolverConfig -> solve -> cg_shard for DSparseTensor+CG) but the
example now reads like the documented usage.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: walkerchi <walker.chi.000@gmail.com>
…atter

Three sites in test_distributed_assemble.py each open-coded the same
pad -> 3x dist.all_gather -> Python for-loop scatter pattern (~14 lines
each). The torch-sla side just landed a vectorised public helper
(sparsexlab/torch-sla#41 -- gather_owned_to_global with one index_put_
over an all_gather_into_tensor buffer), so swap each site to a single
call.

-44 lines, +6 lines, same semantics.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: walkerchi <walker.chi.000@gmail.com>
…er, Condenser dispatch

Five-part refactor that closes out the design review on
SparseMatrix / DSparseMatrix / Assembler / Condenser:

* sparse/mixin.py (new) -- _FEMSparsityMixin with sequence-identity
  layout_signature. Zero GPU sync, zero data movement, no autograd
  graph break. Composes (data_ptr, _version, numel, shape) so two
  matrices sharing row/col tensor refs hit the cache, in-place
  modification (PyTorch _version bump) invalidates correctly, and
  cloned-but-content-equal matrices intentionally miss (the cached
  FEM buffers index by COO position, so set-identity would silently
  corrupt them).
* sparse/matrix.py -- SparseMatrix now mixes in _FEMSparsityMixin.
  Drops the eager SHA-256 hashlib call in __init__ and the GPU-to-CPU
  byte copy it required. layout_hash kept as a deprecated alias that
  routes to layout_signature; old callers continue to function with
  a DeprecationWarning.
* sparse/dmatrix.py (new) -- DSparseMatrix via composition over
  torch_sla.distributed.DSparseTensor. Carries a 63-bit partition
  UUID, generated on rank 0 with uuid.uuid4 and broadcast through
  the active process group at construction. The UUID enters
  layout_signature so two DSparseMatrix instances on the same
  partition share caches even if local layouts coincidentally match.
  Arithmetic / device methods propagate the parent UUID. to_single()
  bridges to single-device SparseMatrix for callers (like Condenser)
  that have not yet learned the distributed form.
* distributed/assembler.py (new) -- @distributed class decorator.
  Wraps any ElementAssembler / NodeAssembler subclass by overriding
  from_mesh (accept DistributedMesh) and __call__ (run per-rank
  submesh assembly and return DSparseMatrix or per-rank tensor).
  Avoids defining a parallel DAssembler hierarchy; covers third-party
  custom assemblers automatically.
* operator/condense.py -- Condenser now dispatches: SparseMatrix
  takes the original path (now using layout_signature instead of the
  retired SHA-256 hash), DSparseMatrix routes through _call_distributed
  which round-trips via .to_single() + re-partition. The round-trip
  is documented as the interim contract; the per-rank distributed
  Condenser is a follow-up PR.

Tests
* tests/sparse/test_layout_signature.py -- exercises shared-storage
  hit, different-storage miss, in-place version invalidation,
  hashable dict-key usage, deprecation warning on legacy layout_hash,
  signature-tuple acceptance.
* Existing tests/distributed/test_distributed_assemble.py still all
  pass (16/16, Gloo, world=2 and world=4) verifying the Condenser
  layout_hash -> layout_signature switch is non-breaking.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: walkerchi <walker.chi.000@gmail.com>
…Condenser

Adds the multi-process coverage flagged in review:

- tests/sparse/test_dsparsematrix.py -- UUID broadcast consistency
  across ranks, layout_signature includes the UUID, arithmetic
  preserves it, to_single() round-trip, matvec delegation.
- tests/distributed/test_distributed_assembler_decorator.py --
  @distributed(LaplaceElementAssembler).from_mesh(dmesh)() end-to-
  end vs the single-device reference (matvec equality via
  gather_owned_to_global).
- tests/operator/test_condenser_dispatch.py -- SparseMatrix path
  still works after the layout_signature switch; DSparseMatrix path
  routes through _call_distributed with the documented warning, and
  emits a DSparseMatrix back at the same shape.

Two fixes uncovered during NCCL+CUDA verification on 2x A100:

- operator/condense.py: torch.distributed.tensor.DTensor moved in
  torch >= 2.2; fall back to torch.distributed._tensor (torch
  2.0-2.1) so the round-trip path works on older runtimes.
- operator/condense.py: Condenser stores dirichlet_mask on its
  construction device (typically CPU). In the DSparseMatrix path,
  to_single() returns a CUDA SparseMatrix; the cached path's
  is_inner_dof[edge_u] then trips a cross-device index. Force CPU
  through the single-device round-trip and move the result back to
  the input's device before re-partitioning.

Verified on autodl 2x A100 NCCL: all 6 worker tests PASS, including
end-to-end @distributed-decorated assembly with max_diff=1.78e-15
(machine epsilon).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: walkerchi <walker.chi.000@gmail.com>
This PR depends on three things added in torch-sla after the 0.2.1
PyPI release:
  * DSparseTensor.partition() classmethod (PR #31's split + #66's
    DSparseTensor mirror surface)
  * gather_owned_to_global helper (PR #41)
  * SparseTensor.requires_grad_ method delegation (fixed on the same
    branch alongside a pyproject.toml ``packages = find`` fix)

CI used to pull torch-sla from PyPI via ``pip install -e ".[test]"``,
which dragged 0.2.1 and broke every test that imports
``DSparseTensor.partition`` with AttributeError. Force-reinstall the
``feat/gather-owned-to-global`` branch over the PyPI release after
the editable install so we use the as-needed dev surface.

Once torch-sla cuts 0.3.0, remove the --force-reinstall line and bump
the pyproject pin to ``torch-sla>=0.3.0``.

Verified locally in a clean venv (Python 3.14, CPU torch): 295 passed,
13 skipped, 1 pre-existing flake (test_quad.py::test_sum[3] on Python
3.14 only -- main fails the same way; CI runs 3.10/3.11/3.12 where it
passes).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: walkerchi <walker.chi.000@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant