fix(distributed): wire to torch-sla DSparseTensor API (post PR #31)#16
Closed
walkerchi wants to merge 5 commits into
Closed
fix(distributed): wire to torch-sla DSparseTensor API (post PR #31)#16walkerchi wants to merge 5 commits into
walkerchi wants to merge 5 commits into
Conversation
torch-sla PR #31 removed the legacy DSparseTensor(values, row, col,
shape=..., num_partitions=..., coords=..., partition_method=...) ctor
in favour of the classmethod constructors (partition / from_local /
from_sparse_local). The old single-process simulator path is gone.
distributed_element_assemble was calling the dead ctor, so every
distributed assembly path errored out with TypeError on import.
Switched to ``DSparseTensor.partition(A_global, mesh, partition_method,
coords)`` via a small ``_build_dsparse`` helper:
* torch.distributed initialised -> real distributed-shard
DSparseTensor on the live world (one rank, one row-shard).
* Not initialised (single-process driver + multi-thread assembly,
or unit tests) -> mesh=None gives world=1; the DSparseTensor
holds the entire global matrix locally. matvec / solve still
compose with the rest of torch-sla.
Added test_multiproc_distributed_assemble_matvec_2procs: spawns 2
gloo ranks, each runs TensorMesh distributed assembly, builds a
DSparseTensor, runs matvec in Shard(0) space (x sliced to owned),
allgathers the result, compares to reference single-mesh assembly.
Rel err 0 vs reference on a triangle rectangle mesh.
All 11 distributed tests pass.
…t tests
The original PR only had a multi-proc *matvec* test. Now adds three
real end-to-end tests that exercise the full distributed-solve path
TensorMesh -> DSparseTensor -> cg_shard -> allgather -> compare:
* test_multiproc_distributed_solve_2procs: assemble Mass matrix on
2 gloo ranks, build b = M @ x_ref from a single-process reference,
run distributed CG, allgather x, verify x_dist == x_ref to 1e-6.
Mass instead of Laplace because Laplace has a constant null space
that distributed CG drifts along (separate issue).
* test_multiproc_distributed_solve_4procs: same with world=4 to
exercise more partitions + more halo edges.
* test_multiproc_poisson_dirichlet_2procs: full physics path on
2 ranks (assemble + Condenser for Dirichlet BCs + single-process
SparseMatrix solve via distributed_element_assemble_to_sparse,
compare to single-process reference). The to_sparse path returns
a tensormesh SparseMatrix, not a DSparseTensor -- this test
verifies the distributed assembly produces the same global matrix
as single-process even under multi-process orchestration.
Implementation note: the solve worker calls cg_shard directly (raw
tensor in / out) instead of the higher-level torch_sla.solve(D, b_dt)
wrapper. The wrapper expects DTensor[Shard(0)] for b; building one
from a manually-sliced owned tensor via DTensor.from_local works but
adds DTensor wrapping overhead unrelated to the test goal.
14/14 distributed tests pass.
The previous test_multiproc_poisson_dirichlet_2procs only exercised
distributed_element_assemble_to_sparse which returns a SparseMatrix
(single-process) -- never touched the actual DSparseTensor distributed
solve path. Adds two tests that close that gap:
test_multiproc_poisson_dirichlet_dsparse_2procs
test_multiproc_poisson_dirichlet_dsparse_4procs
Pipeline per rank:
1. TensorMesh distributed assembly -> global Laplace K + load f.
2. Single-process Condenser strips Dirichlet DOFs -> K_inner, f_inner
(cheap at this size; the partitioning happens after).
3. Wrap K_inner as torch_sla.SparseTensor + DSparseTensor.partition
(partition_method=metis, since Condenser breaks the 1:1 mesh-point
mapping that RCB/coordinate methods would need).
4. Slice f_inner to this rank's owned rows.
5. Run distributed CG via cg_shard with M_apply=identity.
6. Allgather x_owned -> u_inner_global -> condenser.recover() -> full u.
7. Compare to single-process reference u: rel-err < 1e-6 required.
This is the actual integration deliverable: real PDE, real distributed
matrix, real distributed solver, end-to-end matches single-process.
16/16 distributed tests pass on CPU (2 new + 14 existing).
The two end-to-end distributed solve tests called the internal cg_shard primitive directly. Tests represent the user surface, so they should go through the public torch_sla.solve API with iterative defaults scoped via SolverConfig. Same dispatch path under the hood (SolverConfig -> solve -> cg_shard for DSparseTensor+CG) but the example now reads like the documented usage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…atter Three sites in test_distributed_assemble.py each open-coded the same pad -> 3x dist.all_gather -> Python for-loop scatter pattern (~14 lines each). The torch-sla side just landed a vectorised public helper (sparsexlab/torch-sla#41 -- gather_owned_to_global with one index_put_ over an all_gather_into_tensor buffer), so swap each site to a single call. -44 lines, +6 lines, same semantics. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced Jun 15, 2026
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Re-opens #14 against the new
dev/coreintegration branch (the originaldevbranch was retired during the rename, GitHub auto-closed the PR and refused to reopen against a deleted base).Summary
DSparseTensorAPI surface (post feat(distributed): DSparseTensor shard-space stack + DSparseMatrix dissolution -- final sparsexlab/torch-sla#31 split ofDSparseMatrix->DSparseTensor).solve(D, b)under aSolverConfigcontext instead of the internalcg_shardprimitive (the user-facing dispatch path);torch_sla.distributed.gather_owned_to_global(feat(distributed): public gather_owned_to_global, vectorised scatter sparsexlab/torch-sla#41) in place of the open-coded pad + allgather + Python for-loop scatter.Status
Multi-proc tests pass under Gloo (CPU, 16/16, ~30s on autodl 2x A100).
Test plan
pytest tests/distributed/test_distributed_assemble.pyon autodl, Gloo backend, world=2 and world=4(continuation of closed PR #14)