feat(sparse): FEM mixin + DSparseMatrix + @distributed assembler + Condenser dispatch#19
Open
walkerchi wants to merge 10 commits into
Open
feat(sparse): FEM mixin + DSparseMatrix + @distributed assembler + Condenser dispatch#19walkerchi wants to merge 10 commits into
walkerchi wants to merge 10 commits into
Conversation
torch-sla PR #31 removed the legacy DSparseTensor(values, row, col,
shape=..., num_partitions=..., coords=..., partition_method=...) ctor
in favour of the classmethod constructors (partition / from_local /
from_sparse_local). The old single-process simulator path is gone.
distributed_element_assemble was calling the dead ctor, so every
distributed assembly path errored out with TypeError on import.
Switched to ``DSparseTensor.partition(A_global, mesh, partition_method,
coords)`` via a small ``_build_dsparse`` helper:
* torch.distributed initialised -> real distributed-shard
DSparseTensor on the live world (one rank, one row-shard).
* Not initialised (single-process driver + multi-thread assembly,
or unit tests) -> mesh=None gives world=1; the DSparseTensor
holds the entire global matrix locally. matvec / solve still
compose with the rest of torch-sla.
Added test_multiproc_distributed_assemble_matvec_2procs: spawns 2
gloo ranks, each runs TensorMesh distributed assembly, builds a
DSparseTensor, runs matvec in Shard(0) space (x sliced to owned),
allgathers the result, compares to reference single-mesh assembly.
Rel err 0 vs reference on a triangle rectangle mesh.
All 11 distributed tests pass.
Signed-off-by: walkerchi <walker.chi.000@gmail.com>
…t tests
The original PR only had a multi-proc *matvec* test. Now adds three
real end-to-end tests that exercise the full distributed-solve path
TensorMesh -> DSparseTensor -> cg_shard -> allgather -> compare:
* test_multiproc_distributed_solve_2procs: assemble Mass matrix on
2 gloo ranks, build b = M @ x_ref from a single-process reference,
run distributed CG, allgather x, verify x_dist == x_ref to 1e-6.
Mass instead of Laplace because Laplace has a constant null space
that distributed CG drifts along (separate issue).
* test_multiproc_distributed_solve_4procs: same with world=4 to
exercise more partitions + more halo edges.
* test_multiproc_poisson_dirichlet_2procs: full physics path on
2 ranks (assemble + Condenser for Dirichlet BCs + single-process
SparseMatrix solve via distributed_element_assemble_to_sparse,
compare to single-process reference). The to_sparse path returns
a tensormesh SparseMatrix, not a DSparseTensor -- this test
verifies the distributed assembly produces the same global matrix
as single-process even under multi-process orchestration.
Implementation note: the solve worker calls cg_shard directly (raw
tensor in / out) instead of the higher-level torch_sla.solve(D, b_dt)
wrapper. The wrapper expects DTensor[Shard(0)] for b; building one
from a manually-sliced owned tensor via DTensor.from_local works but
adds DTensor wrapping overhead unrelated to the test goal.
14/14 distributed tests pass.
Signed-off-by: walkerchi <walker.chi.000@gmail.com>
The previous test_multiproc_poisson_dirichlet_2procs only exercised
distributed_element_assemble_to_sparse which returns a SparseMatrix
(single-process) -- never touched the actual DSparseTensor distributed
solve path. Adds two tests that close that gap:
test_multiproc_poisson_dirichlet_dsparse_2procs
test_multiproc_poisson_dirichlet_dsparse_4procs
Pipeline per rank:
1. TensorMesh distributed assembly -> global Laplace K + load f.
2. Single-process Condenser strips Dirichlet DOFs -> K_inner, f_inner
(cheap at this size; the partitioning happens after).
3. Wrap K_inner as torch_sla.SparseTensor + DSparseTensor.partition
(partition_method=metis, since Condenser breaks the 1:1 mesh-point
mapping that RCB/coordinate methods would need).
4. Slice f_inner to this rank's owned rows.
5. Run distributed CG via cg_shard with M_apply=identity.
6. Allgather x_owned -> u_inner_global -> condenser.recover() -> full u.
7. Compare to single-process reference u: rel-err < 1e-6 required.
This is the actual integration deliverable: real PDE, real distributed
matrix, real distributed solver, end-to-end matches single-process.
16/16 distributed tests pass on CPU (2 new + 14 existing).
Signed-off-by: walkerchi <walker.chi.000@gmail.com>
Adds distributed_element_assemble_per_rank: each rank assembles ONLY
its own submesh (no threading across other ranks' submeshes), then
all_to_all_single ships row-shared contributions to the row-owning
rank. Total compute is roughly single-process work split N ways,
instead of the old pseudo-distributed path which has every rank do
the full N-submesh work and then partitions the result (O(N) compute
per rank, O(N^2) total cluster compute).
Pipeline:
1. Compute partition_ids deterministically (lowest rank wins for
shared boundary nodes). No communication.
2. Each rank: assemble own submesh -> local COO in global coords.
3. Route by destination rank (partition_ids[g_row]); all_to_all_single
exchange (NCCL on CUDA, gloo on CPU).
4. Coalesce incoming triples (duplicate (g_row, g_col) sum).
5. Discover halo columns from coalesced pattern, build Partition,
extract local SparseTensor in local coords, wrap as DSparseTensor
via from_sparse_local.
6. On NCCL backend, .cuda() the result so callers don't have to.
Verified on autodl 2x A100 (NCCL):
correctness: matvec match rel-err 1.7e-16 (fp64 eps)
end-to-end Poisson-Dirichlet via DSparseTensor: rel-err 2.3e-16
speed: 1.76x per-rank speedup on small mesh (n=510)
Existing distributed_element_assemble (pseudo-distributed) is kept
as a back-compat default; users opt into the new path explicitly.
Signed-off-by: walkerchi <walker.chi.000@gmail.com>
`mesh.partition(num_parts=4)` is now equivalent to
`DistributedMesh(mesh, num_partitions=4)` -- thin wrapper, no
algorithm change. The point is API cleanliness:
mesh = tm.Mesh.gen_rectangle(...)
dmesh = mesh.partition(num_parts=4, method='coordinate')
Pair with distributed_element_assemble_per_rank inside a
torch.distributed context for **truly distributed** assembly (each
rank only does its own submesh). The single-process pseudo-distributed
path via distributed_element_assemble still works with the same dmesh.
Verified on autodl 2x A100 NCCL:
- mesh.partition() output equivalent to DistributedMesh() (same n_points,
num_parts, and orig_nid mapping per submesh)
- mesh.partition() + per_rank assembly + matvec match vs reference K:
rel-err 1.7e-16 (fp64 machine eps)
Existing DistributedMesh stays as the underlying class -- nothing to
migrate. Just gives users a more Pythonic entry point.
Algorithm consolidation (moving partition to torch-sla so we get hilbert/
patoh for free) is intentionally NOT in this PR -- 3.5d of real work
including element-vs-node ownership rules and high-order edge cases.
Tracked separately.
Signed-off-by: walkerchi <walker.chi.000@gmail.com>
The two end-to-end distributed solve tests called the internal cg_shard primitive directly. Tests represent the user surface, so they should go through the public torch_sla.solve API with iterative defaults scoped via SolverConfig. Same dispatch path under the hood (SolverConfig -> solve -> cg_shard for DSparseTensor+CG) but the example now reads like the documented usage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: walkerchi <walker.chi.000@gmail.com>
…atter Three sites in test_distributed_assemble.py each open-coded the same pad -> 3x dist.all_gather -> Python for-loop scatter pattern (~14 lines each). The torch-sla side just landed a vectorised public helper (sparsexlab/torch-sla#41 -- gather_owned_to_global with one index_put_ over an all_gather_into_tensor buffer), so swap each site to a single call. -44 lines, +6 lines, same semantics. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: walkerchi <walker.chi.000@gmail.com>
…er, Condenser dispatch Five-part refactor that closes out the design review on SparseMatrix / DSparseMatrix / Assembler / Condenser: * sparse/mixin.py (new) -- _FEMSparsityMixin with sequence-identity layout_signature. Zero GPU sync, zero data movement, no autograd graph break. Composes (data_ptr, _version, numel, shape) so two matrices sharing row/col tensor refs hit the cache, in-place modification (PyTorch _version bump) invalidates correctly, and cloned-but-content-equal matrices intentionally miss (the cached FEM buffers index by COO position, so set-identity would silently corrupt them). * sparse/matrix.py -- SparseMatrix now mixes in _FEMSparsityMixin. Drops the eager SHA-256 hashlib call in __init__ and the GPU-to-CPU byte copy it required. layout_hash kept as a deprecated alias that routes to layout_signature; old callers continue to function with a DeprecationWarning. * sparse/dmatrix.py (new) -- DSparseMatrix via composition over torch_sla.distributed.DSparseTensor. Carries a 63-bit partition UUID, generated on rank 0 with uuid.uuid4 and broadcast through the active process group at construction. The UUID enters layout_signature so two DSparseMatrix instances on the same partition share caches even if local layouts coincidentally match. Arithmetic / device methods propagate the parent UUID. to_single() bridges to single-device SparseMatrix for callers (like Condenser) that have not yet learned the distributed form. * distributed/assembler.py (new) -- @distributed class decorator. Wraps any ElementAssembler / NodeAssembler subclass by overriding from_mesh (accept DistributedMesh) and __call__ (run per-rank submesh assembly and return DSparseMatrix or per-rank tensor). Avoids defining a parallel DAssembler hierarchy; covers third-party custom assemblers automatically. * operator/condense.py -- Condenser now dispatches: SparseMatrix takes the original path (now using layout_signature instead of the retired SHA-256 hash), DSparseMatrix routes through _call_distributed which round-trips via .to_single() + re-partition. The round-trip is documented as the interim contract; the per-rank distributed Condenser is a follow-up PR. Tests * tests/sparse/test_layout_signature.py -- exercises shared-storage hit, different-storage miss, in-place version invalidation, hashable dict-key usage, deprecation warning on legacy layout_hash, signature-tuple acceptance. * Existing tests/distributed/test_distributed_assemble.py still all pass (16/16, Gloo, world=2 and world=4) verifying the Condenser layout_hash -> layout_signature switch is non-breaking. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: walkerchi <walker.chi.000@gmail.com>
…Condenser Adds the multi-process coverage flagged in review: - tests/sparse/test_dsparsematrix.py -- UUID broadcast consistency across ranks, layout_signature includes the UUID, arithmetic preserves it, to_single() round-trip, matvec delegation. - tests/distributed/test_distributed_assembler_decorator.py -- @distributed(LaplaceElementAssembler).from_mesh(dmesh)() end-to- end vs the single-device reference (matvec equality via gather_owned_to_global). - tests/operator/test_condenser_dispatch.py -- SparseMatrix path still works after the layout_signature switch; DSparseMatrix path routes through _call_distributed with the documented warning, and emits a DSparseMatrix back at the same shape. Two fixes uncovered during NCCL+CUDA verification on 2x A100: - operator/condense.py: torch.distributed.tensor.DTensor moved in torch >= 2.2; fall back to torch.distributed._tensor (torch 2.0-2.1) so the round-trip path works on older runtimes. - operator/condense.py: Condenser stores dirichlet_mask on its construction device (typically CPU). In the DSparseMatrix path, to_single() returns a CUDA SparseMatrix; the cached path's is_inner_dof[edge_u] then trips a cross-device index. Force CPU through the single-device round-trip and move the result back to the input's device before re-partitioning. Verified on autodl 2x A100 NCCL: all 6 worker tests PASS, including end-to-end @distributed-decorated assembly with max_diff=1.78e-15 (machine epsilon). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: walkerchi <walker.chi.000@gmail.com>
This PR depends on three things added in torch-sla after the 0.2.1
PyPI release:
* DSparseTensor.partition() classmethod (PR #31's split + #66's
DSparseTensor mirror surface)
* gather_owned_to_global helper (PR #41)
* SparseTensor.requires_grad_ method delegation (fixed on the same
branch alongside a pyproject.toml ``packages = find`` fix)
CI used to pull torch-sla from PyPI via ``pip install -e ".[test]"``,
which dragged 0.2.1 and broke every test that imports
``DSparseTensor.partition`` with AttributeError. Force-reinstall the
``feat/gather-owned-to-global`` branch over the PyPI release after
the editable install so we use the as-needed dev surface.
Once torch-sla cuts 0.3.0, remove the --force-reinstall line and bump
the pyproject pin to ``torch-sla>=0.3.0``.
Verified locally in a clean venv (Python 3.14, CPU torch): 295 passed,
13 skipped, 1 pre-existing flake (test_quad.py::test_sum[3] on Python
3.14 only -- main fails the same way; CI runs 3.10/3.11/3.12 where it
passes).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: walkerchi <walker.chi.000@gmail.com>
Closed
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replacement of the closed #18 with DCO sign-off on every commit (same content, no
--force-pushwas used; the old branch was rebased onto a fresh branch withgit rebase --signoffand pushed clean).Summary
_FEMSparsityMixinwith sequence-identitylayout_signature(data_ptr+_version+numel+shape). Zero GPU sync, zero data movement, no autograd graph break, correct semantics for the Condenser/AMG position-indexed cache.SparseMatrixuses the mixin. The eager SHA-256 +.cpu().tobytes()in__init__is gone.layout_hashretained as aDeprecationWarningalias.DSparseMatrix(new) — composition overtorch_sla.distributed.DSparseTensor. Carries a 63-bit partition UUID generated on rank 0 withuuid.uuid4and broadcast through the process group at construction.@distributedclass decorator — wraps any Assembler subclass.Condenser— polymorphic dispatch;DSparseMatrixround-trips via.to_single()+ re-partition (interim contract; per-rank distributed Condenser is a follow-up PR).Bug fixes uncovered while making CI green
This PR depends on three things in torch-sla that did not exist in the 0.2.1 PyPI release:
DSparseTensor.partition()classmethod (already on dev /feat/gather-owned-to-global)gather_owned_to_globalhelper (feat(distributed): public gather_owned_to_global, vectorised scatter sparsexlab/torch-sla#41)SparseTensor.requires_grad_method delegation (0.3 refactor regression — body moved toops.pyas a free function but never re-exposed on the class)Fixed two packaging / regression issues on the torch-sla side (commits on the same
feat/gather-owned-to-globalbranch):pyproject.tomlhad a hardcodedpackages = ["torch_sla", "torch_sla.backends"]which silently droppedtorch_sla.distributedandtorch_sla.sparse_tensorfrom wheels built viapip install git+.... Switched topackages.find include = ["torch_sla*"].SparseTensor.requires_grad_delegation incore.pynext to the other ops-shims.CI workflow now installs the dev branch over the PyPI release. Once torch-sla cuts 0.3.0, that line can go away and the
pyprojectpin can bump to>=0.3.0.Test plan
tests/sparse/test_layout_signature.py(6 cases)tests/sparse/test_dsparsematrix.py(5),tests/distributed/test_distributed_assembler_decorator.py,tests/operator/test_condenser_dispatch.py, existingtests/distributed/test_distributed_assemble.py(16) — all pass locally@distributed(LaplaceElementAssembler)end-to-end max_diff = 1.78e-15 (machine epsilon)test_quad.py::test_sum[3]— fp32 epsilon on Python 3.14 only; fails identically onmain, CI runs 3.10/3.11/3.12)Signed-off-by: walkerchi <walker.chi.000@gmail.com>Follow-ups (not in this PR)
to_singleround-trip)layout_signaturedirectly; removelayout_hashalias in next minor🤖 Generated with Claude Code