DEBUG: instrument SPBase.allreduce_or to localize LOR_bug#717
Draft
DLWoodruff wants to merge 15 commits into
Draft
DEBUG: instrument SPBase.allreduce_or to localize LOR_bug#717DLWoodruff wants to merge 15 commits into
DLWoodruff wants to merge 15 commits into
Conversation
A spurious shutdown is firing on every xhatter rank despite no rank
having written 1.0 to the SHUTDOWN buffer; replacing
Allreduce(LOR) with Allreduce(SUM) returns a stable-by-pattern
nonzero (~69), and the input local_val has been verified zero on
the xhatter ranks themselves. Four hypotheses remain:
(i) self.mpicomm has wider membership than the xhatter cylinder
(ii) buffer memory underneath local_val is not 0 when MPI reads it
(iii) the Allreduce reducer path is malfunctioning
(iv) duplicate rank participation in self.mpicomm
This patch packs four diagnostic axes into a single Allreduce call:
1. an Allgather of (world_rk, cyl_rk, local_int) - shows exactly
which world ranks participate and what each one contributed
2. parallel SUM, MAX, LOR reductions - MAX distinguishes "many
small contributions" from "few large ones," LOR confirms
observed call-site behavior
3. a rank-sum sanity reduction (each rank contributes its
mpicomm rank), expected to equal n*(n-1)/2; mismatch flags
a corrupt SUM reducer
4. a comparison between the Allgather-summed values and the
Allreduce(SUM) result; divergence isolates the bug to the
reducer path
Output is printed on cyl_rk == 0 with the call counter, class
name, host, pid, comm name, and world-rank min/max/count/unique;
on any anomaly it also lists nonzero rows and the full participant
list. One Allreduce call now does 4 reductions + 1 Allgather; cost
is dominated by run-launch overhead in the target environment, so
the extra collectives are acceptable.
Revert before merging to main.
Reading the output (greps):
# 1. Did anything print at all?
grep '^\[LOR_bug' out.log | head
# 2. What does each cylinder think its comm size is?
# (one printer per cylinder; size should be that cylinder's rank count)
# If you see size=150 here, hypothesis (i) is the bug.
grep 'mpicomm size=' out.log | sort -u
# 3. World-rank membership of each cylinder.
# count==unique is required; unique<count means duplicate participation (iv).
# The range [min..max] tells you which slice of WORLD is in this comm.
grep 'world_ranks:' out.log
# 4. Sanity check that SUM works at all on this comm.
# rank_sum should equal expected_rank_sum (n*(n-1)/2). If not, the
# SUM reducer itself is broken on this comm - hypothesis (iii).
grep 'rank_sum=' out.log | head
# 5. The main reduce result vs. the gather-truth.
# sum == gather_sum is required. Divergence pins the bug to the
# reducer path (gather honest, reduce wrong).
# max>1 means at least one rank contributed >1 (not boolean) -
# hypothesis (ii) on that rank specifically.
grep -E 'reductions:|gather:' out.log
# 6. Which ranks actually contributed nonzero (only printed on anomaly).
# Tells you world_rk + cyl_rk of every guilty contributor.
grep 'nonzero: world_rk' out.log
# 7. Full participant list (only printed on anomaly).
# Use this if comm membership is suspicious.
grep 'ALL world_ranks:' out.log
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #717 +/- ##
==========================================
- Coverage 72.82% 72.78% -0.04%
==========================================
Files 164 165 +1
Lines 20959 21108 +149
==========================================
+ Hits 15263 15364 +101
- Misses 5696 5744 +48 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Initial trigger fired on any nonzero reduce result, which caught legitimate shutdown signals (sum=lor=1, gather_sum=1, all consistent) and would flood logs on every cylinder finalization. Real-bug signature is invariant-violating, not just nonzero: - rank_sum != expected_rank_sum -> SUM broken on this comm - sum != gather_sum -> reducer disagrees with gather - unique != count -> duplicate world ranks - max > 1 -> non-boolean input on some rank Verified: 0 false positives across ~440k allreduce_or calls in a sizes 3-scen 3-rank xhatshuffle+lagrangian run (sizes_cylinders.py --num-scens 3 --xhatshuffle --lagrangian).
Parses a log containing [LOR_bug ...] blocks emitted by the instrumented SPBase.allreduce_or and writes a short per-hypothesis report (H1 wider membership, H2 buffer aliasing, H3 reducer malfunction, H4 duplicate ranks). Intended to live and die with this debug branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fault comm names - Example lines now include the values that tripped each detector (max/gather_sum for H2, sum/gather_sum for H3 reducer, rank_sum/ expected for H3 sanity, count/unique for H4) so the triager doesn't have to grep back into the raw log. - H1 emits a NOTE when any comm has a default/empty Get_name() value, since (cls, name) grouping can collapse distinct physical comms and spuriously trip the "varying size" check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These are real (mergeable) fixes, not diagnostics. Committed onto the LOR_bug branch so they can be validated on the cluster alongside the Pyomo#717 instrumentation; extract to a clean PR once confirmed. Candidate fix for the spurious-shutdown / tcache heap-corruption bug ("malloc(): unaligned tcache chunk detected") investigated via PR Pyomo#717. The H4 diagnostic showed zeroed slots in a freshly-allocated collective buffer with a clean reducer, pointing at recycled freed memory rather than a bad reduction. 1. SPWindow.free(): drop the numpy view (self.buff) before Win.Free(). self.buff aliases the window memory via window.tomemory(); freeing the window first leaves it dangling over released memory. 2. Deterministic MPI-memory teardown: communicator_array now returns the MPI.Alloc_mem handle; FieldArray stores it and gains free() (drops the numpy views, then MPI.Free_mem, idempotent); free_windows() frees every send/recv buffer before the window. This avoids MPI.Free_mem running after MPI_Finalize during GC at interpreter shutdown, the classic mpi4py heap-corruption-at-exit signature. 3. spin_the_wheel: add a fullcomm.Barrier() between hub_finalize() (which may issue RMA gets) and free_windows(). Verified locally: ruff clean; test_buffer_inspect (42), spwindow multisource (np=6) and partial_get (np=4) pass; FieldArray.free idempotency + alloc/free stress pass. Not yet validated end-to-end under valgrind/ASan (no solver in the dev env). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a guard region (8 doubles, sentinel -123456789.0) immediately after
each FieldArray's padded window region. The window/RMA view stays exactly
padded_len, so MPI transfers are unchanged, but any write that runs past a
field's end lands in the guard instead of silently corrupting the adjacent
glibc tcache.
The guard is checked:
- right after every window.put (put_send_buffer) and window.get
(get_receive_buffer), to pin the exact RMA that breaches;
- once per iteration at the top of Spoke.got_kill_signal (before
allreduce_or, whose allocations otherwise trip the already-corrupted
heap), via SPCommunicator.check_buffer_canaries().
A breach prints "[LOR_bug CANARY BREACH] ... field=... origin=..." naming
the over-written field/rank/iteration at the moment of corruption, rather
than at a later unrelated malloc (which is all faulthandler/MALLOC_CHECK_
have been able to show: detection != cause).
Goal: localize the mid-run heap-corruption that surfaces as the tcache
abort inside allreduce_or (run 3, rank 85, XhatShuffle, mid-run).
test_buffer_inspect spoke stub updated to bind the new methods. Verified:
ruff clean; test_buffer_inspect 42 pass; spwindow multisource (np=6) and
partial_get (np=4) pass; canary trips on a 1-past-end over-write and stays
clean on a legit full-padded write.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
free_windows() (added with the deterministic MPI teardown) calls FieldArray.free(), which set self._array = None. But spokes keep a _bound reference to that FieldArray, and callers read wheel.spcomm.bound after WheelSpinner.spin() returns -- e.g. test_with_cylinders, test_cg_with_cylinders. The null _array made spcomm.bound raise "TypeError: 'NoneType' object is not subscriptable", failing the cylinder and column-generation CI jobs. Copy the logical view onto the regular Python heap before MPI.Free_mem instead of nulling it. MPI-allocated memory is still released deterministically before MPI_Finalize (the teardown's purpose), while post-spin reads of the final values keep working. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude and the cluster experiments run on different machines; print the short git hash (with -dirty flag) of the running checkout via global_toc so output can be matched to a version. Remove with the rest of the LOR_bug instrumentation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
watson61 can only run what is in the PR, so the control path (plain Allreduce(LOR), no instrumentation) now lives in allreduce_or behind an env var. Set MPISPPY_LOR_CONTROL=1 in the sbatch to run the control; unset leaves the existing instrumentation untouched. Lets one branch produce both the instrumented and control data points without re-pushing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Env-gated (MPISPPY_LOR_HEAP_PROBE=1) glibc heap walk at the phase boundaries the XhatShuffle spoke crosses before the iter-1 gurobi set_objective abort (slurm-6278803 control == 6278806 instrumented: both "unaligned tcache chunk detected"). heap_probe() bursts-drains every tcache/fastbin size class then malloc_trims, so an already- poisoned chunk aborts AT the probe with the same signature; the last "[LOR_bug HEAP PROBE OK]" marker before the abort localizes the corrupting phase. Validated: clean heap -> OK; poisoned tcache entry -> reproduces the exact "malloc(): unaligned tcache chunk detected". Probes: after-opt-create (scenario creation + SPFBBT), after-make- windows (RMA alloc), after-setup-hub, pre-main, and in the spoke loop main-enter / post-update_nonants / pre-try_scenario_dict. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The git_commit_hash() announce (22ee393) used comm_world.rank, which runs before the apportion_ranks ValueError in run(); test_flexible_rank _ratios' _StubComm provides Get_rank() but no .rank property, so the announce raised AttributeError before the expected ValueError. Use Get_rank() to match the stub and mpi4py. Pre-existing CI failure, not from the heap probes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Do not merge. Diagnostic patch for a spurious-shutdown bug under investigation. Revert before merging anything else from this branch.
How to use
Run your case with this branch and capture stdout to a file, then summarize with the bundled helper:
The script parses every
[LOR_bug ...]block emitted by the instrumentedSPBase.allreduce_orand prints:(cls, mpicomm name)),A clean run prints
Clean log: no anomalies on any of the four hypotheses.(or notes that nonzero results were observed but consistent with legitimate shutdown signals). Any other verdict is the lead to chase.What this does
Replaces
SPBase.allreduce_orwith an instrumented variant that, on every call, does four collectives instead of one and dumps a one-block diagnostic to stdout fromcyl_rk == 0of the cylinder'smpicomm.Originally the function was:
Now each call also does:
(world_rk, cyl_rk, local_int)— shows exactly which world ranks participate and what each one packed.mpicommrank; expected sum isn*(n-1)/2. Mismatch flags a corrupt SUM reducer.gather_sum(sum of locally-reported values from the Allgather) vsreduce_sum(the Allreduce SUM). Divergence isolates the bug to the reducer path.Hypotheses being tested (in priority order)
self.mpicommhas wider membership than the cylinder it should — i.e., includes ranks from another cylinder. Detected ifmpicomm sizeexceeds the cylinder's rank count, or ifworld_ranksincludes ranks outside the cylinder's slice.local_valis not 0 when MPI reads it. Detected ifgather_sum > 0(some rank reports nonzero) ANDmax > 0(the rank's value was nonzero from MPI's perspective).gather_sum == 0butreduce_sum != 0, OR ifrank_sum != expected_rank_sum.self.mpicomm. Detected ifworld_ranks: unique < count.Reading the output
Operator-friendly greps:
Cost / overhead
One Allreduce call now does 4 reductions + 1 Allgather. In the target environment the run-launch overhead dominates the per-call cost, so this is acceptable. Output is only printed on
cyl_rk == 0(one line block per cylinder per call), so log volume is bounded.Control run (
MPISPPY_LOR_CONTROL)SPBase.allreduce_orhas an env toggle so this single branch produces boththe instrumented and the control data point (no re-push needed):
Allreduce(LOR), no extracollectives, no per-call numpy allocations. Canary guards and the buffer/window
teardown fixes are unaffected either way.
Run the same case twice (toggle on vs. off) and compare:
collective volume; real but volume-dependent. Next: shrink and run ASan/valgrind.
Allgather) → the bug is independent of the diagnostic; go straight at the RMA
window / teardown code.
Reproduction notes
PYTHONMALLOC=debug(no valgrind) reproduces the failure deterministically atIter 0 at 150 ranks. The instrumented run aborts inside
got_kill_signal -> allreduce_or -> Allgatherwith MVAPICH2MPIC_Sendrecv: Negative countfromMPIR_Allgather_RD_MV2— MPI's internal count arithmetic went negative, i.e.its internal heap was already corrupted. The per-iteration field-buffer canary
did not fire, so the OOB write lands in MPI-internal allocations, outside
the guarded field buffers.
(
_make_comms) with libpsm2psm_ep_connect ... Operation timed out, becausevalgrind's slowdown blows past PSM connection timeouts. Use
PYTHONMALLOC=debug,or shrink to the smallest repro before adding valgrind/ASan (and set
PYTHONMALLOC=mallocunder valgrind).Followups
🤖 Generated with Claude Code