DEBUG: instrument SPBase.allreduce_or to localize LOR_bug by DLWoodruff · Pull Request #717 · Pyomo/mpi-sppy

DLWoodruff · 2026-05-20T23:06:06Z

Do not merge. Diagnostic patch for a spurious-shutdown bug under investigation. Revert before merging anything else from this branch.

How to use

Run your case with this branch and capture stdout to a file, then summarize with the bundled helper:

python lor_bug_report.py <stdout_file>

The script parses every [LOR_bug ...] block emitted by the instrumented SPBase.allreduce_or and prints:

a per-comm summary (one row per (cls, mpicomm name)),
a section per hypothesis (H1–H4) with counts and a few example offending calls,
a one-line verdict naming which hypotheses (if any) were triggered.

A clean run prints Clean log: no anomalies on any of the four hypotheses. (or notes that nonzero results were observed but consistent with legitimate shutdown signals). Any other verdict is the lead to chase.

What this does

Replaces SPBase.allreduce_or with an instrumented variant that, on every call, does four collectives instead of one and dumps a one-block diagnostic to stdout from cyl_rk == 0 of the cylinder's mpicomm.

Originally the function was:

def allreduce_or(self, val):
    local_val = np.array([val], dtype='int8')
    global_val = np.zeros(1, dtype='int8')
    self.mpicomm.Allreduce(local_val, global_val, op=MPI.LOR)
    return bool(global_val[0])

Now each call also does:

Allgather of (world_rk, cyl_rk, local_int) — shows exactly which world ranks participate and what each one packed.
SUM, MAX, LOR reductions in parallel — MAX distinguishes "many small" from "few large"; LOR confirms call-site behavior.
Rank-sum sanity — each rank contributes its mpicomm rank; expected sum is n*(n-1)/2. Mismatch flags a corrupt SUM reducer.
Cross-check: gather_sum (sum of locally-reported values from the Allgather) vs reduce_sum (the Allreduce SUM). Divergence isolates the bug to the reducer path.

Hypotheses being tested (in priority order)

self.mpicomm has wider membership than the cylinder it should — i.e., includes ranks from another cylinder. Detected if mpicomm size exceeds the cylinder's rank count, or if world_ranks includes ranks outside the cylinder's slice.
Buffer memory underneath local_val is not 0 when MPI reads it. Detected if gather_sum > 0 (some rank reports nonzero) AND max > 0 (the rank's value was nonzero from MPI's perspective).
The Allreduce reducer path is malfunctioning. Detected if gather_sum == 0 but reduce_sum != 0, OR if rank_sum != expected_rank_sum.
Duplicate rank participation in self.mpicomm. Detected if world_ranks: unique < count.

Reading the output

Operator-friendly greps:

# 1. Did anything print at all?
grep '^\[LOR_bug' out.log | head

# 2. What does each cylinder think its comm size is?
#    (one printer per cylinder; size should be that cylinder's rank count)
#    If you see size=150 here, hypothesis (i) is the bug.
grep 'mpicomm size=' out.log | sort -u

# 3. World-rank membership of each cylinder.
#    count==unique is required; unique<count means duplicate participation (iv).
#    The range [min..max] tells you which slice of WORLD is in this comm.
grep 'world_ranks:' out.log

# 4. Sanity check that SUM works at all on this comm.
#    rank_sum should equal expected_rank_sum (n*(n-1)/2). If not, the
#    SUM reducer itself is broken on this comm - hypothesis (iii).
grep 'rank_sum=' out.log | head

# 5. The main reduce result vs. the gather-truth.
#    sum == gather_sum is required. Divergence pins the bug to the
#    reducer path (gather honest, reduce wrong).
#    max>1 means at least one rank contributed >1 (not boolean) -
#    hypothesis (ii) on that rank specifically.
grep -E 'reductions:|gather:' out.log

# 6. Which ranks actually contributed nonzero (only printed on anomaly).
#    Tells you world_rk + cyl_rk of every guilty contributor.
grep 'nonzero: world_rk' out.log

# 7. Full participant list (only printed on anomaly).
#    Use this if comm membership is suspicious.
grep 'ALL world_ranks:' out.log

Cost / overhead

One Allreduce call now does 4 reductions + 1 Allgather. In the target environment the run-launch overhead dominates the per-call cost, so this is acceptable. Output is only printed on cyl_rk == 0 (one line block per cylinder per call), so log volume is bounded.

Control run (`MPISPPY_LOR_CONTROL`)

SPBase.allreduce_or has an env toggle so this single branch produces both
the instrumented and the control data point (no re-push needed):

export MPISPPY_LOR_CONTROL=1   # in the sbatch, alongside PYTHONMALLOC=debug

unset (default): the instrumented path above (4 reductions + 1 Allgather per call).
set: the original minimal path only — a single Allreduce(LOR), no extra
collectives, no per-call numpy allocations. Canary guards and the buffer/window
teardown fixes are unaffected either way.

Run the same case twice (toggle on vs. off) and compare:

control survives Iter 0 → the corruption tracks the instrumentation's extra
collective volume; real but volume-dependent. Next: shrink and run ASan/valgrind.
control also aborts at Iter 0 (in the real RMA path, not the diagnostic
Allgather) → the bug is independent of the diagnostic; go straight at the RMA
window / teardown code.

Reproduction notes

PYTHONMALLOC=debug (no valgrind) reproduces the failure deterministically at
Iter 0 at 150 ranks. The instrumented run aborts inside got_kill_signal -> allreduce_or -> Allgather with MVAPICH2 MPIC_Sendrecv: Negative count from
MPIR_Allgather_RD_MV2 — MPI's internal count arithmetic went negative, i.e.
its internal heap was already corrupted. The per-iteration field-buffer canary
did not fire, so the OOB write lands in MPI-internal allocations, outside
the guarded field buffers.
valgrind at 150 ranks does not work: every rank aborts during comm setup
(_make_comms) with libpsm2 psm_ep_connect ... Operation timed out, because
valgrind's slowdown blows past PSM connection timeouts. Use PYTHONMALLOC=debug,
or shrink to the smallest repro before adding valgrind/ASan (and set
PYTHONMALLOC=malloc under valgrind).

Followups

Once the hypothesis is pinned, fix in a separate PR.
Revert this file change before any non-diagnostic merge to main.

🤖 Generated with Claude Code

A spurious shutdown is firing on every xhatter rank despite no rank having written 1.0 to the SHUTDOWN buffer; replacing Allreduce(LOR) with Allreduce(SUM) returns a stable-by-pattern nonzero (~69), and the input local_val has been verified zero on the xhatter ranks themselves. Four hypotheses remain: (i) self.mpicomm has wider membership than the xhatter cylinder (ii) buffer memory underneath local_val is not 0 when MPI reads it (iii) the Allreduce reducer path is malfunctioning (iv) duplicate rank participation in self.mpicomm This patch packs four diagnostic axes into a single Allreduce call: 1. an Allgather of (world_rk, cyl_rk, local_int) - shows exactly which world ranks participate and what each one contributed 2. parallel SUM, MAX, LOR reductions - MAX distinguishes "many small contributions" from "few large ones," LOR confirms observed call-site behavior 3. a rank-sum sanity reduction (each rank contributes its mpicomm rank), expected to equal n*(n-1)/2; mismatch flags a corrupt SUM reducer 4. a comparison between the Allgather-summed values and the Allreduce(SUM) result; divergence isolates the bug to the reducer path Output is printed on cyl_rk == 0 with the call counter, class name, host, pid, comm name, and world-rank min/max/count/unique; on any anomaly it also lists nonzero rows and the full participant list. One Allreduce call now does 4 reductions + 1 Allgather; cost is dominated by run-launch overhead in the target environment, so the extra collectives are acceptable. Revert before merging to main. Reading the output (greps): # 1. Did anything print at all? grep '^\[LOR_bug' out.log | head # 2. What does each cylinder think its comm size is? # (one printer per cylinder; size should be that cylinder's rank count) # If you see size=150 here, hypothesis (i) is the bug. grep 'mpicomm size=' out.log | sort -u # 3. World-rank membership of each cylinder. # count==unique is required; unique<count means duplicate participation (iv). # The range [min..max] tells you which slice of WORLD is in this comm. grep 'world_ranks:' out.log # 4. Sanity check that SUM works at all on this comm. # rank_sum should equal expected_rank_sum (n*(n-1)/2). If not, the # SUM reducer itself is broken on this comm - hypothesis (iii). grep 'rank_sum=' out.log | head # 5. The main reduce result vs. the gather-truth. # sum == gather_sum is required. Divergence pins the bug to the # reducer path (gather honest, reduce wrong). # max>1 means at least one rank contributed >1 (not boolean) - # hypothesis (ii) on that rank specifically. grep -E 'reductions:|gather:' out.log # 6. Which ranks actually contributed nonzero (only printed on anomaly). # Tells you world_rk + cyl_rk of every guilty contributor. grep 'nonzero: world_rk' out.log # 7. Full participant list (only printed on anomaly). # Use this if comm membership is suspicious. grep 'ALL world_ranks:' out.log Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-20T23:16:22Z

Codecov Report

❌ Patch coverage is 71.25000% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.78%. Comparing base (ade8219) to head (a8de453).

Files with missing lines	Patch %	Lines
mpisppy/debug_utils/heap_probe.py	36.36%	21 Missing ⚠️
mpisppy/spbase.py	78.57%	12 Missing ⚠️
mpisppy/cylinders/spcommunicator.py	76.59%	11 Missing ⚠️
mpisppy/__init__.py	80.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #717      +/-   ##
==========================================
- Coverage   72.82%   72.78%   -0.04%     
==========================================
  Files         164      165       +1     
  Lines       20959    21108     +149     
==========================================
+ Hits        15263    15364     +101     
- Misses       5696     5744      +48

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Initial trigger fired on any nonzero reduce result, which caught legitimate shutdown signals (sum=lor=1, gather_sum=1, all consistent) and would flood logs on every cylinder finalization. Real-bug signature is invariant-violating, not just nonzero: - rank_sum != expected_rank_sum -> SUM broken on this comm - sum != gather_sum -> reducer disagrees with gather - unique != count -> duplicate world ranks - max > 1 -> non-boolean input on some rank Verified: 0 false positives across ~440k allreduce_or calls in a sizes 3-scen 3-rank xhatshuffle+lagrangian run (sizes_cylinders.py --num-scens 3 --xhatshuffle --lagrangian).

Parses a log containing [LOR_bug ...] blocks emitted by the instrumented SPBase.allreduce_or and writes a short per-hypothesis report (H1 wider membership, H2 buffer aliasing, H3 reducer malfunction, H4 duplicate ranks). Intended to live and die with this debug branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fault comm names - Example lines now include the values that tripped each detector (max/gather_sum for H2, sum/gather_sum for H3 reducer, rank_sum/ expected for H3 sanity, count/unique for H4) so the triager doesn't have to grep back into the raw log. - H1 emits a NOTE when any comm has a default/empty Get_name() value, since (cls, name) grouping can collapse distinct physical comms and spuriously trip the "varying size" check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

These are real (mergeable) fixes, not diagnostics. Committed onto the LOR_bug branch so they can be validated on the cluster alongside the Pyomo#717 instrumentation; extract to a clean PR once confirmed. Candidate fix for the spurious-shutdown / tcache heap-corruption bug ("malloc(): unaligned tcache chunk detected") investigated via PR Pyomo#717. The H4 diagnostic showed zeroed slots in a freshly-allocated collective buffer with a clean reducer, pointing at recycled freed memory rather than a bad reduction. 1. SPWindow.free(): drop the numpy view (self.buff) before Win.Free(). self.buff aliases the window memory via window.tomemory(); freeing the window first leaves it dangling over released memory. 2. Deterministic MPI-memory teardown: communicator_array now returns the MPI.Alloc_mem handle; FieldArray stores it and gains free() (drops the numpy views, then MPI.Free_mem, idempotent); free_windows() frees every send/recv buffer before the window. This avoids MPI.Free_mem running after MPI_Finalize during GC at interpreter shutdown, the classic mpi4py heap-corruption-at-exit signature. 3. spin_the_wheel: add a fullcomm.Barrier() between hub_finalize() (which may issue RMA gets) and free_windows(). Verified locally: ruff clean; test_buffer_inspect (42), spwindow multisource (np=6) and partial_get (np=4) pass; FieldArray.free idempotency + alloc/free stress pass. Not yet validated end-to-end under valgrind/ASan (no solver in the dev env). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds a guard region (8 doubles, sentinel -123456789.0) immediately after each FieldArray's padded window region. The window/RMA view stays exactly padded_len, so MPI transfers are unchanged, but any write that runs past a field's end lands in the guard instead of silently corrupting the adjacent glibc tcache. The guard is checked: - right after every window.put (put_send_buffer) and window.get (get_receive_buffer), to pin the exact RMA that breaches; - once per iteration at the top of Spoke.got_kill_signal (before allreduce_or, whose allocations otherwise trip the already-corrupted heap), via SPCommunicator.check_buffer_canaries(). A breach prints "[LOR_bug CANARY BREACH] ... field=... origin=..." naming the over-written field/rank/iteration at the moment of corruption, rather than at a later unrelated malloc (which is all faulthandler/MALLOC_CHECK_ have been able to show: detection != cause). Goal: localize the mid-run heap-corruption that surfaces as the tcache abort inside allreduce_or (run 3, rank 85, XhatShuffle, mid-run). test_buffer_inspect spoke stub updated to bind the new methods. Verified: ruff clean; test_buffer_inspect 42 pass; spwindow multisource (np=6) and partial_get (np=4) pass; canary trips on a 1-past-end over-write and stays clean on a legit full-padded write. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

free_windows() (added with the deterministic MPI teardown) calls FieldArray.free(), which set self._array = None. But spokes keep a _bound reference to that FieldArray, and callers read wheel.spcomm.bound after WheelSpinner.spin() returns -- e.g. test_with_cylinders, test_cg_with_cylinders. The null _array made spcomm.bound raise "TypeError: 'NoneType' object is not subscriptable", failing the cylinder and column-generation CI jobs. Copy the logical view onto the regular Python heap before MPI.Free_mem instead of nulling it. MPI-allocated memory is still released deterministically before MPI_Finalize (the teardown's purpose), while post-spin reads of the final values keep working. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Claude and the cluster experiments run on different machines; print the short git hash (with -dirty flag) of the running checkout via global_toc so output can be matched to a version. Remove with the rest of the LOR_bug instrumentation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

watson61 can only run what is in the PR, so the control path (plain Allreduce(LOR), no instrumentation) now lives in allreduce_or behind an env var. Set MPISPPY_LOR_CONTROL=1 in the sbatch to run the control; unset leaves the existing instrumentation untouched. Lets one branch produce both the instrumented and control data points without re-pushing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Env-gated (MPISPPY_LOR_HEAP_PROBE=1) glibc heap walk at the phase boundaries the XhatShuffle spoke crosses before the iter-1 gurobi set_objective abort (slurm-6278803 control == 6278806 instrumented: both "unaligned tcache chunk detected"). heap_probe() bursts-drains every tcache/fastbin size class then malloc_trims, so an already- poisoned chunk aborts AT the probe with the same signature; the last "[LOR_bug HEAP PROBE OK]" marker before the abort localizes the corrupting phase. Validated: clean heap -> OK; poisoned tcache entry -> reproduces the exact "malloc(): unaligned tcache chunk detected". Probes: after-opt-create (scenario creation + SPFBBT), after-make- windows (RMA alloc), after-setup-hub, pre-main, and in the spoke loop main-enter / post-update_nonants / pre-try_scenario_dict. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The git_commit_hash() announce (22ee393) used comm_world.rank, which runs before the apportion_ranks ValueError in run(); test_flexible_rank _ratios' _StubComm provides Get_rank() but no .rank property, so the announce raised AttributeError before the expected ValueError. Use Get_rank() to match the stub and mpi4py. Pre-existing CI failure, not from the heap probes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

DLWoodruff and others added 14 commits May 20, 2026 16:21

Merge remote-tracking branch 'upstream/main' into LOR_bug

9475da1

DEBUG(LOR_bug): rename log_file -> stdout_file in usage text

6578aef

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge branch 'main' into LOR_bug

ad0ae57

Merge remote-tracking branch 'upstream/main' into LOR_bug

c81d9f1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DEBUG: instrument SPBase.allreduce_or to localize LOR_bug#717

DEBUG: instrument SPBase.allreduce_or to localize LOR_bug#717
DLWoodruff wants to merge 15 commits into
Pyomo:mainfrom
DLWoodruff:LOR_bug

DLWoodruff commented May 20, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

DLWoodruff commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use

What this does

Hypotheses being tested (in priority order)

Reading the output

Cost / overhead

Control run (MPISPPY_LOR_CONTROL)

Reproduction notes

Followups

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DLWoodruff commented May 20, 2026 •

edited

Loading

Control run (`MPISPPY_LOR_CONTROL`)

codecov Bot commented May 20, 2026 •

edited

Loading