Skip to content

SST Master Branch Merger: Auto Create Pull Request to Promote from devel to master - All Tests Ran Clean#2693

Merged
sst-autotester merged 6 commits into
masterfrom
devel
Jun 29, 2026
Merged

SST Master Branch Merger: Auto Create Pull Request to Promote from devel to master - All Tests Ran Clean#2693
sst-autotester merged 6 commits into
masterfrom
devel

Conversation

@sst-autotester

Copy link
Copy Markdown
Contributor

Pull Request created to promote from devel branch to master due to successfully passing the following Jenkins Jobs :
JENKINS_SRN/SST__Nightly_OSX-15-XC16_OMPI-4.1.6_PY3.10_Mainline : Build 954
JENKINS_SRN/SST__Nightly_OSX-15-XC16_OMPI-4.1.6_PY3.10_Mainline_MR-2 : Build 778
JENKINS_SRN/SST__Nightly_OSX-15-XC16_OMPI-4.1.6_PY3.10_Mainline_MT-2 : Build 777
JENKINS_SRN/SST__Nightly_OSX-15-XC16_OMPI-4.1.6_PY3.10_Mainline_OutOfSource : Build 775
JENKINS_SRN/SST__Nightly_OSX-15-XC16_OMPI-4.1.6_PY3.10_SST-Macro_NoCore : Build 778
JENKINS_SRN/SST__Nightly_OSX-15-XC16_OMPI-4.1.6_PY3.10_SST-Macro_WithCore : Build 775
JENKINS_SRN/SST__Nightly_OSX-26-XC26_OMPI-4.1.4_PY3.10_Mainline : Build 1634
JENKINS_SRN/SST__Nightly_OSX-26-XC26_OMPI-4.1.4_PY3.10_Mainline_MR-2 : Build 1622
JENKINS_SRN/SST__Nightly_OSX-26-XC26_OMPI-4.1.4_PY3.10_Mainline_MT-2 : Build 1614
JENKINS_SRN/SST__Nightly_OSX-26-XC26_OMPI-4.1.4_PY3.10_Mainline_OutOfSource : Build 1615
JENKINS_SRN/SST__Nightly_OSX-26-XC26_OMPI-4.1.4_PY3.10_SST-Macro_NoCore : Build 1455
JENKINS_SRN/SST__Nightly_OSX-26-XC26_OMPI-4.1.4_PY3.10_SST-Macro_WithCore : Build 1485
JENKINS_SRN/SST__Nightly_sst-test_clang18_OMPI-NONE_PY3.13_Mainline : Build 434
JENKINS_SRN/SST__Nightly_sst-test_OMPI-4.1.4_PY3.9_Mainline : Build 1937
JENKINS_SRN/SST__Nightly_sst-test_OMPI-4.1.4_PY3.9_Mainline_memH-A_Sweep-1 : Build 1909
JENKINS_SRN/SST__Nightly_sst-test_OMPI-4.1.4_PY3.9_Mainline_memH-A_Sweep-2 : Build 1917
JENKINS_SRN/SST__Nightly_sst-test_OMPI-4.1.4_PY3.9_Mainline_memH-A_Sweep-3 : Build 1910
JENKINS_SRN/SST__Nightly_sst-test_OMPI-4.1.4_PY3.9_Mainline_memH-A_Sweep-4 : Build 1910
JENKINS_SRN/SST__Nightly_sst-test_OMPI-4.1.4_PY3.9_Mainline_MR-2 : Build 1910
JENKINS_SRN/SST__Nightly_sst-test_OMPI-4.1.4_PY3.9_Mainline_MT-2 : Build 1908
JENKINS_SRN/SST__Nightly_sst-test_OMPI-4.1.4_PY3.9_Mainline_MT-4 : Build 1908
JENKINS_SRN/SST__Nightly_sst-test_OMPI-4.1.4_PY3.9_Mainline_MT-2_MR-2 : Build 381
JENKINS_SRN/SST__Nightly_sst-test_OMPI-4.1.4_PY3.9_Mainline_OutOfSource : Build 1906
JENKINS_SRN/SST__Nightly_sst-test_OMPI-4.1.4_PY3.9_Make-Dist : Build 1913
JENKINS_SRN/SST__Nightly_sst-test_OMPI-4.1.4_PY3.9_SST-Macro_NoCore : Build 1910
JENKINS_SRN/SST__Nightly_sst-test_OMPI-4.1.4_PY3.9_SST-Macro_WithCore : Build 1912
JENKINS_SRN/SST__Nightly_sst-test_OMPI-4.1.4_PY3.9_SST_Macro_Make-Dist : Build 1916
JENKINS_SRN/SST__Nightly_sst-test_OMPI-NONE_PY3.9_Mainline : Build 1895
JENKINS_SRN/SST__Nightly_sst-test_OMPI-NONE_PY3.9_Mainline_MT-2 : Build 1914
JENKINS_SRN/SST__Nightly_Ubuntu-24.04_OMPI-4.1.6_PY3.12_Mainline : Build 832
JENKINS_SRN/SST__Nightly_Ubuntu-24.04_OMPI-4.1.6_PY3.12_Mainline_MR-2 : Build 763
JENKINS_SRN/SST__Nightly_Ubuntu-24.04_OMPI-4.1.6_PY3.12_Mainline_MT-2 : Build 776
JENKINS_SRN/SST__Nightly_Ubuntu-24.04_OMPI-4.1.6_PY3.12_Make-Dist : Build 777
JENKINS_SRN/SST__Nightly_Ubuntu-24.04_OMPI-4.1.6_PY3.12_SST-Macro_NoCore : Build 768
JENKINS_SRN/SST__Nightly_Ubuntu-24.04_OMPI-4.1.6_PY3.12_SST-Macro_WithCore : Build 763
JENKINS_SRN/SST__Nightly_Ubuntu-26.04_OMPI-5.0.10_PY3.14_Mainline : Build 90
JENKINS_SRN/SST__Nightly_Ubuntu-26.04_Doxygen : Build 65
JENKINS_SRN/SST__Nightly_TOSS_4.8_OMPI-4.1.6_PY3.12_Mainline : Build 803
JENKINS_SRN/SST__Nightly_Rocky-10_OMPI-5.0.2_PY3.12_Mainline : Build 261
JENKINS_SRN/SST__Nightly_Rocky-9_OMPI-4.1.6_PY3.9_Mainline : Build 832
JENKINS_SRN/SST__Nightly_COERHEL-9_OMPI-4.1.6_PY3.9_Mainline : Build 851

nab880 and others added 6 commits June 8, 2026 02:01
Format into the caller-supplied buf and return it instead of leaking a
new[]-allocated buffer; report required size via *len.
Use snprintf instead of unbounded strcpy to respect the caller buffer size.
Return NULL on missing addr or len pointer before dereferencing *len.
…2687)

The intra-node loopback delivery path truncated multi-segment sends,
causing ProcessQueuesState::copyIoVec() to abort on assert(copied == len)
when an Allgather-class collective runs with multiple ranks bound to a
single endpoint NIC.

Root cause
----------
For on-node peers, processSendLoop() packs the MatchHdr plus *every* data
segment of the sender's I/O vector into a single vector (vec[0] = MatchHdr,
vec[1..N] = data segments). The LoopReq constructor, however, kept only the
first segment (vec[1]) and discarded vec[2..N], while the MatchHdr it
carries still reports the *total* byte count across all segments. At
delivery, copyIoVec() is asked to copy the full multi-segment length but
only has the first segment as source, so copied stalls below len and the
assertion fires (or, under NDEBUG, a short/incorrect receive buffer is
produced silently).

Multiple segments arise because Allgather::initIoVec() coalesces only
physically contiguous chunks: a recursive-doubling stage whose send window
wraps the modular buffer is split into two or more non-contiguous runs.
This is why the failure reproduces reliably at high PPN (almost every
neighbour is on-node) and even at 2 ranks/node with a large payload, but
never for point-to-point or small single-segment collectives.

Fix
---
* LoopReq (ctrlMsgProcessQueuesState.h): copy every data segment
  (vec[1..N]) instead of only vec[1], so multi-segment intra-node sends are
  delivered intact.

* copyIoVec (ctrlMsgProcessQueuesState.cc): harden as defence-in-depth.
  Bound the copy loops on `rV < dst.size()` in the loop conditions (instead
  of an assert that is compiled out under NDEBUG), and replace the bare
  assert(copied == len) with a dbg().fatal() that reports copied, len,
  src/dst segment counts and total byte counts. A mismatch is an internal
  defect between sender and receiver of the same collective algorithm, so
  it is surfaced loudly rather than silently truncated.

Verified by reproducing the abort with the original code at
numCores >= 8, then confirming clean completion after the fix across
numCores = 2/8/16/56 and count = 1/131072 (Allgather).

Fixes #2686
@sst-autotester sst-autotester merged commit 4c663e9 into master Jun 29, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants