feat(sparse): FEM mixin + DSparseMatrix + @distributed assembler + Condenser dispatch by walkerchi · Pull Request #19 · camlab-ethz/TensorMesh

walkerchi · 2026-06-15T17:51:16Z

Replacement of the closed #18 with DCO sign-off on every commit (same content, no --force-push was used; the old branch was rebased onto a fresh branch with git rebase --signoff and pushed clean).

Summary

_FEMSparsityMixin with sequence-identity layout_signature (data_ptr + _version + numel + shape). Zero GPU sync, zero data movement, no autograd graph break, correct semantics for the Condenser/AMG position-indexed cache.
SparseMatrix uses the mixin. The eager SHA-256 + .cpu().tobytes() in __init__ is gone. layout_hash retained as a DeprecationWarning alias.
DSparseMatrix (new) — composition over torch_sla.distributed.DSparseTensor. Carries a 63-bit partition UUID generated on rank 0 with uuid.uuid4 and broadcast through the process group at construction.
@distributed class decorator — wraps any Assembler subclass.
Condenser — polymorphic dispatch; DSparseMatrix round-trips via .to_single() + re-partition (interim contract; per-rank distributed Condenser is a follow-up PR).

Bug fixes uncovered while making CI green

This PR depends on three things in torch-sla that did not exist in the 0.2.1 PyPI release:

DSparseTensor.partition() classmethod (already on dev / feat/gather-owned-to-global)
gather_owned_to_global helper (feat(distributed): public gather_owned_to_global, vectorised scatter sparsexlab/torch-sla#41)
SparseTensor.requires_grad_ method delegation (0.3 refactor regression — body moved to ops.py as a free function but never re-exposed on the class)

Fixed two packaging / regression issues on the torch-sla side (commits on the same feat/gather-owned-to-global branch):

pyproject.toml had a hardcoded packages = ["torch_sla", "torch_sla.backends"] which silently dropped torch_sla.distributed and torch_sla.sparse_tensor from wheels built via pip install git+.... Switched to packages.find include = ["torch_sla*"].
Restored SparseTensor.requires_grad_ delegation in core.py next to the other ops-shims.

CI workflow now installs the dev branch over the PyPI release. Once torch-sla cuts 0.3.0, that line can go away and the pyproject pin can bump to >=0.3.0.

Test plan

Single-device unit: tests/sparse/test_layout_signature.py (6 cases)
Multi-proc Gloo CPU: tests/sparse/test_dsparsematrix.py (5), tests/distributed/test_distributed_assembler_decorator.py, tests/operator/test_condenser_dispatch.py, existing tests/distributed/test_distributed_assemble.py (16) — all pass locally
Multi-proc NCCL + CUDA on 2× A100 (autodl): all 6 new worker tests PASS; @distributed(LaplaceElementAssembler) end-to-end max_diff = 1.78e-15 (machine epsilon)
Full local pytest sweep (clean Python 3.14 venv simulating CI install order): 295 pass, 13 skipped, 1 pre-existing flake (test_quad.py::test_sum[3] — fp32 epsilon on Python 3.14 only; fails identically on main, CI runs 3.10/3.11/3.12)
DCO sign-off: all 10 commits carry Signed-off-by: walkerchi <walker.chi.000@gmail.com>

Follow-ups (not in this PR)

Native per-rank distributed Condenser (no to_single round-trip)
Sweep callers to layout_signature directly; remove layout_hash alias in next minor

🤖 Generated with Claude Code

torch-sla PR #31 removed the legacy DSparseTensor(values, row, col, shape=..., num_partitions=..., coords=..., partition_method=...) ctor in favour of the classmethod constructors (partition / from_local / from_sparse_local). The old single-process simulator path is gone. distributed_element_assemble was calling the dead ctor, so every distributed assembly path errored out with TypeError on import. Switched to ``DSparseTensor.partition(A_global, mesh, partition_method, coords)`` via a small ``_build_dsparse`` helper: * torch.distributed initialised -> real distributed-shard DSparseTensor on the live world (one rank, one row-shard). * Not initialised (single-process driver + multi-thread assembly, or unit tests) -> mesh=None gives world=1; the DSparseTensor holds the entire global matrix locally. matvec / solve still compose with the rest of torch-sla. Added test_multiproc_distributed_assemble_matvec_2procs: spawns 2 gloo ranks, each runs TensorMesh distributed assembly, builds a DSparseTensor, runs matvec in Shard(0) space (x sliced to owned), allgathers the result, compares to reference single-mesh assembly. Rel err 0 vs reference on a triangle rectangle mesh. All 11 distributed tests pass. Signed-off-by: walkerchi <walker.chi.000@gmail.com>

…t tests The original PR only had a multi-proc *matvec* test. Now adds three real end-to-end tests that exercise the full distributed-solve path TensorMesh -> DSparseTensor -> cg_shard -> allgather -> compare: * test_multiproc_distributed_solve_2procs: assemble Mass matrix on 2 gloo ranks, build b = M @ x_ref from a single-process reference, run distributed CG, allgather x, verify x_dist == x_ref to 1e-6. Mass instead of Laplace because Laplace has a constant null space that distributed CG drifts along (separate issue). * test_multiproc_distributed_solve_4procs: same with world=4 to exercise more partitions + more halo edges. * test_multiproc_poisson_dirichlet_2procs: full physics path on 2 ranks (assemble + Condenser for Dirichlet BCs + single-process SparseMatrix solve via distributed_element_assemble_to_sparse, compare to single-process reference). The to_sparse path returns a tensormesh SparseMatrix, not a DSparseTensor -- this test verifies the distributed assembly produces the same global matrix as single-process even under multi-process orchestration. Implementation note: the solve worker calls cg_shard directly (raw tensor in / out) instead of the higher-level torch_sla.solve(D, b_dt) wrapper. The wrapper expects DTensor[Shard(0)] for b; building one from a manually-sliced owned tensor via DTensor.from_local works but adds DTensor wrapping overhead unrelated to the test goal. 14/14 distributed tests pass. Signed-off-by: walkerchi <walker.chi.000@gmail.com>

The previous test_multiproc_poisson_dirichlet_2procs only exercised distributed_element_assemble_to_sparse which returns a SparseMatrix (single-process) -- never touched the actual DSparseTensor distributed solve path. Adds two tests that close that gap: test_multiproc_poisson_dirichlet_dsparse_2procs test_multiproc_poisson_dirichlet_dsparse_4procs Pipeline per rank: 1. TensorMesh distributed assembly -> global Laplace K + load f. 2. Single-process Condenser strips Dirichlet DOFs -> K_inner, f_inner (cheap at this size; the partitioning happens after). 3. Wrap K_inner as torch_sla.SparseTensor + DSparseTensor.partition (partition_method=metis, since Condenser breaks the 1:1 mesh-point mapping that RCB/coordinate methods would need). 4. Slice f_inner to this rank's owned rows. 5. Run distributed CG via cg_shard with M_apply=identity. 6. Allgather x_owned -> u_inner_global -> condenser.recover() -> full u. 7. Compare to single-process reference u: rel-err < 1e-6 required. This is the actual integration deliverable: real PDE, real distributed matrix, real distributed solver, end-to-end matches single-process. 16/16 distributed tests pass on CPU (2 new + 14 existing). Signed-off-by: walkerchi <walker.chi.000@gmail.com>

Adds distributed_element_assemble_per_rank: each rank assembles ONLY its own submesh (no threading across other ranks' submeshes), then all_to_all_single ships row-shared contributions to the row-owning rank. Total compute is roughly single-process work split N ways, instead of the old pseudo-distributed path which has every rank do the full N-submesh work and then partitions the result (O(N) compute per rank, O(N^2) total cluster compute). Pipeline: 1. Compute partition_ids deterministically (lowest rank wins for shared boundary nodes). No communication. 2. Each rank: assemble own submesh -> local COO in global coords. 3. Route by destination rank (partition_ids[g_row]); all_to_all_single exchange (NCCL on CUDA, gloo on CPU). 4. Coalesce incoming triples (duplicate (g_row, g_col) sum). 5. Discover halo columns from coalesced pattern, build Partition, extract local SparseTensor in local coords, wrap as DSparseTensor via from_sparse_local. 6. On NCCL backend, .cuda() the result so callers don't have to. Verified on autodl 2x A100 (NCCL): correctness: matvec match rel-err 1.7e-16 (fp64 eps) end-to-end Poisson-Dirichlet via DSparseTensor: rel-err 2.3e-16 speed: 1.76x per-rank speedup on small mesh (n=510) Existing distributed_element_assemble (pseudo-distributed) is kept as a back-compat default; users opt into the new path explicitly. Signed-off-by: walkerchi <walker.chi.000@gmail.com>

`mesh.partition(num_parts=4)` is now equivalent to `DistributedMesh(mesh, num_partitions=4)` -- thin wrapper, no algorithm change. The point is API cleanliness: mesh = tm.Mesh.gen_rectangle(...) dmesh = mesh.partition(num_parts=4, method='coordinate') Pair with distributed_element_assemble_per_rank inside a torch.distributed context for **truly distributed** assembly (each rank only does its own submesh). The single-process pseudo-distributed path via distributed_element_assemble still works with the same dmesh. Verified on autodl 2x A100 NCCL: - mesh.partition() output equivalent to DistributedMesh() (same n_points, num_parts, and orig_nid mapping per submesh) - mesh.partition() + per_rank assembly + matvec match vs reference K: rel-err 1.7e-16 (fp64 machine eps) Existing DistributedMesh stays as the underlying class -- nothing to migrate. Just gives users a more Pythonic entry point. Algorithm consolidation (moving partition to torch-sla so we get hilbert/ patoh for free) is intentionally NOT in this PR -- 3.5d of real work including element-vs-node ownership rules and high-order edge cases. Tracked separately. Signed-off-by: walkerchi <walker.chi.000@gmail.com>

The two end-to-end distributed solve tests called the internal cg_shard primitive directly. Tests represent the user surface, so they should go through the public torch_sla.solve API with iterative defaults scoped via SolverConfig. Same dispatch path under the hood (SolverConfig -> solve -> cg_shard for DSparseTensor+CG) but the example now reads like the documented usage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: walkerchi <walker.chi.000@gmail.com>

…atter Three sites in test_distributed_assemble.py each open-coded the same pad -> 3x dist.all_gather -> Python for-loop scatter pattern (~14 lines each). The torch-sla side just landed a vectorised public helper (sparsexlab/torch-sla#41 -- gather_owned_to_global with one index_put_ over an all_gather_into_tensor buffer), so swap each site to a single call. -44 lines, +6 lines, same semantics. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: walkerchi <walker.chi.000@gmail.com>

@distributed

…er, Condenser dispatch Five-part refactor that closes out the design review on SparseMatrix / DSparseMatrix / Assembler / Condenser: * sparse/mixin.py (new) -- _FEMSparsityMixin with sequence-identity layout_signature. Zero GPU sync, zero data movement, no autograd graph break. Composes (data_ptr, _version, numel, shape) so two matrices sharing row/col tensor refs hit the cache, in-place modification (PyTorch _version bump) invalidates correctly, and cloned-but-content-equal matrices intentionally miss (the cached FEM buffers index by COO position, so set-identity would silently corrupt them). * sparse/matrix.py -- SparseMatrix now mixes in _FEMSparsityMixin. Drops the eager SHA-256 hashlib call in __init__ and the GPU-to-CPU byte copy it required. layout_hash kept as a deprecated alias that routes to layout_signature; old callers continue to function with a DeprecationWarning. * sparse/dmatrix.py (new) -- DSparseMatrix via composition over torch_sla.distributed.DSparseTensor. Carries a 63-bit partition UUID, generated on rank 0 with uuid.uuid4 and broadcast through the active process group at construction. The UUID enters layout_signature so two DSparseMatrix instances on the same partition share caches even if local layouts coincidentally match. Arithmetic / device methods propagate the parent UUID. to_single() bridges to single-device SparseMatrix for callers (like Condenser) that have not yet learned the distributed form. * distributed/assembler.py (new) -- @distributed class decorator. Wraps any ElementAssembler / NodeAssembler subclass by overriding from_mesh (accept DistributedMesh) and __call__ (run per-rank submesh assembly and return DSparseMatrix or per-rank tensor). Avoids defining a parallel DAssembler hierarchy; covers third-party custom assemblers automatically. * operator/condense.py -- Condenser now dispatches: SparseMatrix takes the original path (now using layout_signature instead of the retired SHA-256 hash), DSparseMatrix routes through _call_distributed which round-trips via .to_single() + re-partition. The round-trip is documented as the interim contract; the per-rank distributed Condenser is a follow-up PR. Tests * tests/sparse/test_layout_signature.py -- exercises shared-storage hit, different-storage miss, in-place version invalidation, hashable dict-key usage, deprecation warning on legacy layout_hash, signature-tuple acceptance. * Existing tests/distributed/test_distributed_assemble.py still all pass (16/16, Gloo, world=2 and world=4) verifying the Condenser layout_hash -> layout_signature switch is non-breaking. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: walkerchi <walker.chi.000@gmail.com>

@distributed

…Condenser Adds the multi-process coverage flagged in review: - tests/sparse/test_dsparsematrix.py -- UUID broadcast consistency across ranks, layout_signature includes the UUID, arithmetic preserves it, to_single() round-trip, matvec delegation. - tests/distributed/test_distributed_assembler_decorator.py -- @distributed(LaplaceElementAssembler).from_mesh(dmesh)() end-to- end vs the single-device reference (matvec equality via gather_owned_to_global). - tests/operator/test_condenser_dispatch.py -- SparseMatrix path still works after the layout_signature switch; DSparseMatrix path routes through _call_distributed with the documented warning, and emits a DSparseMatrix back at the same shape. Two fixes uncovered during NCCL+CUDA verification on 2x A100: - operator/condense.py: torch.distributed.tensor.DTensor moved in torch >= 2.2; fall back to torch.distributed._tensor (torch 2.0-2.1) so the round-trip path works on older runtimes. - operator/condense.py: Condenser stores dirichlet_mask on its construction device (typically CPU). In the DSparseMatrix path, to_single() returns a CUDA SparseMatrix; the cached path's is_inner_dof[edge_u] then trips a cross-device index. Force CPU through the single-device round-trip and move the result back to the input's device before re-partitioning. Verified on autodl 2x A100 NCCL: all 6 worker tests PASS, including end-to-end @distributed-decorated assembly with max_diff=1.78e-15 (machine epsilon). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: walkerchi <walker.chi.000@gmail.com>

This PR depends on three things added in torch-sla after the 0.2.1 PyPI release: * DSparseTensor.partition() classmethod (PR #31's split + #66's DSparseTensor mirror surface) * gather_owned_to_global helper (PR #41) * SparseTensor.requires_grad_ method delegation (fixed on the same branch alongside a pyproject.toml ``packages = find`` fix) CI used to pull torch-sla from PyPI via ``pip install -e ".[test]"``, which dragged 0.2.1 and broke every test that imports ``DSparseTensor.partition`` with AttributeError. Force-reinstall the ``feat/gather-owned-to-global`` branch over the PyPI release after the editable install so we use the as-needed dev surface. Once torch-sla cuts 0.3.0, remove the --force-reinstall line and bump the pyproject pin to ``torch-sla>=0.3.0``. Verified locally in a clean venv (Python 3.14, CPU torch): 295 passed, 13 skipped, 1 pre-existing flake (test_quad.py::test_sum[3] on Python 3.14 only -- main fails the same way; CI runs 3.10/3.11/3.12 where it passes). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: walkerchi <walker.chi.000@gmail.com>

walkerchi and others added 10 commits June 16, 2026 01:50

walkerchi mentioned this pull request Jun 15, 2026

feat(sparse): FEM sparsity mixin + DSparseMatrix + @distributed assembler + Condenser dispatch #18

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sparse): FEM mixin + DSparseMatrix + @distributed assembler + Condenser dispatch#19

feat(sparse): FEM mixin + DSparseMatrix + @distributed assembler + Condenser dispatch#19
walkerchi wants to merge 10 commits into
dev/corefrom
feat/fem-mixin-dsparsematrix-v2

walkerchi commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

walkerchi commented Jun 15, 2026

Summary

Bug fixes uncovered while making CI green

Test plan

Follow-ups (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant