Skip to content

DEBUG: instrument SPBase.allreduce_or to localize LOR_bug#717

Draft
DLWoodruff wants to merge 15 commits into
Pyomo:mainfrom
DLWoodruff:LOR_bug
Draft

DEBUG: instrument SPBase.allreduce_or to localize LOR_bug#717
DLWoodruff wants to merge 15 commits into
Pyomo:mainfrom
DLWoodruff:LOR_bug

Conversation

@DLWoodruff

@DLWoodruff DLWoodruff commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Do not merge. Diagnostic patch for a spurious-shutdown bug under investigation. Revert before merging anything else from this branch.

How to use

Run your case with this branch and capture stdout to a file, then summarize with the bundled helper:

python lor_bug_report.py <stdout_file>

The script parses every [LOR_bug ...] block emitted by the instrumented SPBase.allreduce_or and prints:

  • a per-comm summary (one row per (cls, mpicomm name)),
  • a section per hypothesis (H1–H4) with counts and a few example offending calls,
  • a one-line verdict naming which hypotheses (if any) were triggered.

A clean run prints Clean log: no anomalies on any of the four hypotheses. (or notes that nonzero results were observed but consistent with legitimate shutdown signals). Any other verdict is the lead to chase.

What this does

Replaces SPBase.allreduce_or with an instrumented variant that, on every call, does four collectives instead of one and dumps a one-block diagnostic to stdout from cyl_rk == 0 of the cylinder's mpicomm.

Originally the function was:

def allreduce_or(self, val):
    local_val = np.array([val], dtype='int8')
    global_val = np.zeros(1, dtype='int8')
    self.mpicomm.Allreduce(local_val, global_val, op=MPI.LOR)
    return bool(global_val[0])

Now each call also does:

  1. Allgather of (world_rk, cyl_rk, local_int) — shows exactly which world ranks participate and what each one packed.
  2. SUM, MAX, LOR reductions in parallel — MAX distinguishes "many small" from "few large"; LOR confirms call-site behavior.
  3. Rank-sum sanity — each rank contributes its mpicomm rank; expected sum is n*(n-1)/2. Mismatch flags a corrupt SUM reducer.
  4. Cross-check: gather_sum (sum of locally-reported values from the Allgather) vs reduce_sum (the Allreduce SUM). Divergence isolates the bug to the reducer path.

Hypotheses being tested (in priority order)

  1. self.mpicomm has wider membership than the cylinder it should — i.e., includes ranks from another cylinder. Detected if mpicomm size exceeds the cylinder's rank count, or if world_ranks includes ranks outside the cylinder's slice.
  2. Buffer memory underneath local_val is not 0 when MPI reads it. Detected if gather_sum > 0 (some rank reports nonzero) AND max > 0 (the rank's value was nonzero from MPI's perspective).
  3. The Allreduce reducer path is malfunctioning. Detected if gather_sum == 0 but reduce_sum != 0, OR if rank_sum != expected_rank_sum.
  4. Duplicate rank participation in self.mpicomm. Detected if world_ranks: unique < count.

Reading the output

Operator-friendly greps:

# 1. Did anything print at all?
grep '^\[LOR_bug' out.log | head

# 2. What does each cylinder think its comm size is?
#    (one printer per cylinder; size should be that cylinder's rank count)
#    If you see size=150 here, hypothesis (i) is the bug.
grep 'mpicomm size=' out.log | sort -u

# 3. World-rank membership of each cylinder.
#    count==unique is required; unique<count means duplicate participation (iv).
#    The range [min..max] tells you which slice of WORLD is in this comm.
grep 'world_ranks:' out.log

# 4. Sanity check that SUM works at all on this comm.
#    rank_sum should equal expected_rank_sum (n*(n-1)/2). If not, the
#    SUM reducer itself is broken on this comm - hypothesis (iii).
grep 'rank_sum=' out.log | head

# 5. The main reduce result vs. the gather-truth.
#    sum == gather_sum is required. Divergence pins the bug to the
#    reducer path (gather honest, reduce wrong).
#    max>1 means at least one rank contributed >1 (not boolean) -
#    hypothesis (ii) on that rank specifically.
grep -E 'reductions:|gather:' out.log

# 6. Which ranks actually contributed nonzero (only printed on anomaly).
#    Tells you world_rk + cyl_rk of every guilty contributor.
grep 'nonzero: world_rk' out.log

# 7. Full participant list (only printed on anomaly).
#    Use this if comm membership is suspicious.
grep 'ALL world_ranks:' out.log

Cost / overhead

One Allreduce call now does 4 reductions + 1 Allgather. In the target environment the run-launch overhead dominates the per-call cost, so this is acceptable. Output is only printed on cyl_rk == 0 (one line block per cylinder per call), so log volume is bounded.

Control run (MPISPPY_LOR_CONTROL)

SPBase.allreduce_or has an env toggle so this single branch produces both
the instrumented and the control data point (no re-push needed):

export MPISPPY_LOR_CONTROL=1   # in the sbatch, alongside PYTHONMALLOC=debug
  • unset (default): the instrumented path above (4 reductions + 1 Allgather per call).
  • set: the original minimal path only — a single Allreduce(LOR), no extra
    collectives, no per-call numpy allocations. Canary guards and the buffer/window
    teardown fixes are unaffected either way.

Run the same case twice (toggle on vs. off) and compare:

  • control survives Iter 0 → the corruption tracks the instrumentation's extra
    collective volume; real but volume-dependent. Next: shrink and run ASan/valgrind.
  • control also aborts at Iter 0 (in the real RMA path, not the diagnostic
    Allgather) → the bug is independent of the diagnostic; go straight at the RMA
    window / teardown code.

Reproduction notes

  • PYTHONMALLOC=debug (no valgrind) reproduces the failure deterministically at
    Iter 0 at 150 ranks. The instrumented run aborts inside got_kill_signal -> allreduce_or -> Allgather with MVAPICH2 MPIC_Sendrecv: Negative count from
    MPIR_Allgather_RD_MV2 — MPI's internal count arithmetic went negative, i.e.
    its internal heap was already corrupted. The per-iteration field-buffer canary
    did not fire, so the OOB write lands in MPI-internal allocations, outside
    the guarded field buffers.
  • valgrind at 150 ranks does not work: every rank aborts during comm setup
    (_make_comms) with libpsm2 psm_ep_connect ... Operation timed out, because
    valgrind's slowdown blows past PSM connection timeouts. Use PYTHONMALLOC=debug,
    or shrink to the smallest repro before adding valgrind/ASan (and set
    PYTHONMALLOC=malloc under valgrind).

Followups

  • Once the hypothesis is pinned, fix in a separate PR.
  • Revert this file change before any non-diagnostic merge to main.

🤖 Generated with Claude Code

A spurious shutdown is firing on every xhatter rank despite no rank
having written 1.0 to the SHUTDOWN buffer; replacing
Allreduce(LOR) with Allreduce(SUM) returns a stable-by-pattern
nonzero (~69), and the input local_val has been verified zero on
the xhatter ranks themselves. Four hypotheses remain:

  (i)   self.mpicomm has wider membership than the xhatter cylinder
  (ii)  buffer memory underneath local_val is not 0 when MPI reads it
  (iii) the Allreduce reducer path is malfunctioning
  (iv)  duplicate rank participation in self.mpicomm

This patch packs four diagnostic axes into a single Allreduce call:

  1. an Allgather of (world_rk, cyl_rk, local_int) - shows exactly
     which world ranks participate and what each one contributed
  2. parallel SUM, MAX, LOR reductions - MAX distinguishes "many
     small contributions" from "few large ones," LOR confirms
     observed call-site behavior
  3. a rank-sum sanity reduction (each rank contributes its
     mpicomm rank), expected to equal n*(n-1)/2; mismatch flags
     a corrupt SUM reducer
  4. a comparison between the Allgather-summed values and the
     Allreduce(SUM) result; divergence isolates the bug to the
     reducer path

Output is printed on cyl_rk == 0 with the call counter, class
name, host, pid, comm name, and world-rank min/max/count/unique;
on any anomaly it also lists nonzero rows and the full participant
list. One Allreduce call now does 4 reductions + 1 Allgather; cost
is dominated by run-launch overhead in the target environment, so
the extra collectives are acceptable.

Revert before merging to main.

Reading the output (greps):

  # 1. Did anything print at all?
  grep '^\[LOR_bug' out.log | head

  # 2. What does each cylinder think its comm size is?
  #    (one printer per cylinder; size should be that cylinder's rank count)
  #    If you see size=150 here, hypothesis (i) is the bug.
  grep 'mpicomm size=' out.log | sort -u

  # 3. World-rank membership of each cylinder.
  #    count==unique is required; unique<count means duplicate participation (iv).
  #    The range [min..max] tells you which slice of WORLD is in this comm.
  grep 'world_ranks:' out.log

  # 4. Sanity check that SUM works at all on this comm.
  #    rank_sum should equal expected_rank_sum (n*(n-1)/2). If not, the
  #    SUM reducer itself is broken on this comm - hypothesis (iii).
  grep 'rank_sum=' out.log | head

  # 5. The main reduce result vs. the gather-truth.
  #    sum == gather_sum is required. Divergence pins the bug to the
  #    reducer path (gather honest, reduce wrong).
  #    max>1 means at least one rank contributed >1 (not boolean) -
  #    hypothesis (ii) on that rank specifically.
  grep -E 'reductions:|gather:' out.log

  # 6. Which ranks actually contributed nonzero (only printed on anomaly).
  #    Tells you world_rk + cyl_rk of every guilty contributor.
  grep 'nonzero: world_rk' out.log

  # 7. Full participant list (only printed on anomaly).
  #    Use this if comm membership is suspicious.
  grep 'ALL world_ranks:' out.log

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented May 20, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 71.25000% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.78%. Comparing base (ade8219) to head (a8de453).

Files with missing lines Patch % Lines
mpisppy/debug_utils/heap_probe.py 36.36% 21 Missing ⚠️
mpisppy/spbase.py 78.57% 12 Missing ⚠️
mpisppy/cylinders/spcommunicator.py 76.59% 11 Missing ⚠️
mpisppy/__init__.py 80.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #717      +/-   ##
==========================================
- Coverage   72.82%   72.78%   -0.04%     
==========================================
  Files         164      165       +1     
  Lines       20959    21108     +149     
==========================================
+ Hits        15263    15364     +101     
- Misses       5696     5744      +48     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

DLWoodruff and others added 14 commits May 20, 2026 16:21
Initial trigger fired on any nonzero reduce result, which caught
legitimate shutdown signals (sum=lor=1, gather_sum=1, all
consistent) and would flood logs on every cylinder finalization.
Real-bug signature is invariant-violating, not just nonzero:

  - rank_sum != expected_rank_sum  -> SUM broken on this comm
  - sum != gather_sum              -> reducer disagrees with gather
  - unique != count                -> duplicate world ranks
  - max > 1                        -> non-boolean input on some rank

Verified: 0 false positives across ~440k allreduce_or calls in a
sizes 3-scen 3-rank xhatshuffle+lagrangian run (sizes_cylinders.py
--num-scens 3 --xhatshuffle --lagrangian).
Parses a log containing [LOR_bug ...] blocks emitted by the instrumented
SPBase.allreduce_or and writes a short per-hypothesis report (H1 wider
membership, H2 buffer aliasing, H3 reducer malfunction, H4 duplicate
ranks). Intended to live and die with this debug branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fault comm names

- Example lines now include the values that tripped each detector
  (max/gather_sum for H2, sum/gather_sum for H3 reducer, rank_sum/
  expected for H3 sanity, count/unique for H4) so the triager doesn't
  have to grep back into the raw log.
- H1 emits a NOTE when any comm has a default/empty Get_name() value,
  since (cls, name) grouping can collapse distinct physical comms and
  spuriously trip the "varying size" check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These are real (mergeable) fixes, not diagnostics. Committed onto the
LOR_bug branch so they can be validated on the cluster alongside the
Pyomo#717 instrumentation; extract to a clean PR once confirmed.

Candidate fix for the spurious-shutdown / tcache heap-corruption bug
("malloc(): unaligned tcache chunk detected") investigated via PR Pyomo#717.
The H4 diagnostic showed zeroed slots in a freshly-allocated collective
buffer with a clean reducer, pointing at recycled freed memory rather
than a bad reduction.

1. SPWindow.free(): drop the numpy view (self.buff) before Win.Free().
   self.buff aliases the window memory via window.tomemory(); freeing the
   window first leaves it dangling over released memory.

2. Deterministic MPI-memory teardown: communicator_array now returns the
   MPI.Alloc_mem handle; FieldArray stores it and gains free() (drops the
   numpy views, then MPI.Free_mem, idempotent); free_windows() frees every
   send/recv buffer before the window. This avoids MPI.Free_mem running
   after MPI_Finalize during GC at interpreter shutdown, the classic
   mpi4py heap-corruption-at-exit signature.

3. spin_the_wheel: add a fullcomm.Barrier() between hub_finalize() (which
   may issue RMA gets) and free_windows().

Verified locally: ruff clean; test_buffer_inspect (42), spwindow
multisource (np=6) and partial_get (np=4) pass; FieldArray.free
idempotency + alloc/free stress pass. Not yet validated end-to-end under
valgrind/ASan (no solver in the dev env).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a guard region (8 doubles, sentinel -123456789.0) immediately after
each FieldArray's padded window region. The window/RMA view stays exactly
padded_len, so MPI transfers are unchanged, but any write that runs past a
field's end lands in the guard instead of silently corrupting the adjacent
glibc tcache.

The guard is checked:
  - right after every window.put (put_send_buffer) and window.get
    (get_receive_buffer), to pin the exact RMA that breaches;
  - once per iteration at the top of Spoke.got_kill_signal (before
    allreduce_or, whose allocations otherwise trip the already-corrupted
    heap), via SPCommunicator.check_buffer_canaries().

A breach prints "[LOR_bug CANARY BREACH] ... field=... origin=..." naming
the over-written field/rank/iteration at the moment of corruption, rather
than at a later unrelated malloc (which is all faulthandler/MALLOC_CHECK_
have been able to show: detection != cause).

Goal: localize the mid-run heap-corruption that surfaces as the tcache
abort inside allreduce_or (run 3, rank 85, XhatShuffle, mid-run).

test_buffer_inspect spoke stub updated to bind the new methods. Verified:
ruff clean; test_buffer_inspect 42 pass; spwindow multisource (np=6) and
partial_get (np=4) pass; canary trips on a 1-past-end over-write and stays
clean on a legit full-padded write.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
free_windows() (added with the deterministic MPI teardown) calls
FieldArray.free(), which set self._array = None. But spokes keep a
_bound reference to that FieldArray, and callers read wheel.spcomm.bound
after WheelSpinner.spin() returns -- e.g. test_with_cylinders,
test_cg_with_cylinders. The null _array made spcomm.bound raise
"TypeError: 'NoneType' object is not subscriptable", failing the
cylinder and column-generation CI jobs.

Copy the logical view onto the regular Python heap before MPI.Free_mem
instead of nulling it. MPI-allocated memory is still released
deterministically before MPI_Finalize (the teardown's purpose), while
post-spin reads of the final values keep working.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude and the cluster experiments run on different machines; print the
short git hash (with -dirty flag) of the running checkout via global_toc
so output can be matched to a version. Remove with the rest of the
LOR_bug instrumentation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
watson61 can only run what is in the PR, so the control path (plain
Allreduce(LOR), no instrumentation) now lives in allreduce_or behind an
env var. Set MPISPPY_LOR_CONTROL=1 in the sbatch to run the control;
unset leaves the existing instrumentation untouched. Lets one branch
produce both the instrumented and control data points without re-pushing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Env-gated (MPISPPY_LOR_HEAP_PROBE=1) glibc heap walk at the phase
boundaries the XhatShuffle spoke crosses before the iter-1 gurobi
set_objective abort (slurm-6278803 control == 6278806 instrumented:
both "unaligned tcache chunk detected"). heap_probe() bursts-drains
every tcache/fastbin size class then malloc_trims, so an already-
poisoned chunk aborts AT the probe with the same signature; the last
"[LOR_bug HEAP PROBE OK]" marker before the abort localizes the
corrupting phase. Validated: clean heap -> OK; poisoned tcache entry
-> reproduces the exact "malloc(): unaligned tcache chunk detected".

Probes: after-opt-create (scenario creation + SPFBBT), after-make-
windows (RMA alloc), after-setup-hub, pre-main, and in the spoke loop
main-enter / post-update_nonants / pre-try_scenario_dict.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The git_commit_hash() announce (22ee393) used comm_world.rank, which
runs before the apportion_ranks ValueError in run(); test_flexible_rank
_ratios' _StubComm provides Get_rank() but no .rank property, so the
announce raised AttributeError before the expected ValueError. Use
Get_rank() to match the stub and mpi4py. Pre-existing CI failure, not
from the heap probes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant