Skip to content

Update to NVIDIA nccl-tests v2.18.3 + GPU-serial JSON output (Anton)#1

Open
slucascore wants to merge 39 commits into
coreweave:masterfrom
slucascore:serial-json-on-2.18.3
Open

Update to NVIDIA nccl-tests v2.18.3 + GPU-serial JSON output (Anton)#1
slucascore wants to merge 39 commits into
coreweave:masterfrom
slucascore:serial-json-on-2.18.3

Conversation

@slucascore

Copy link
Copy Markdown
Collaborator

What

Brings nccl-tests-json master from NVIDIA v2.17.6 up to v2.18.3 (f727aa2) and layers CoreWeave's GPU-serial JSON enrichment on top.

The v2.18.3 base (f727aa2) is the same commit already shipped to prod coreweave/nccl-tests in NVIDIA#91 — so the upstream bump here isn't new risk, it just re-syncs this repo's master to match prod.

The CoreWeave delta (what to actually review)

Everything except the last two functional commits is stock NVIDIA — review per-commit. The non-upstream changes are only:

  • 0f70559 (Anton Gunnarsson) — GPU serial in JSON. Adds getGPUSerial()/nvmlDeviceGetSerial(), emits "serial" per device in the -J output. src/Makefile now links -lnvidia-ml + -luuid.
  • e4b8711 (slucas) — env-iteration buffer-overflow fix in jsonOutputInit (unbounded snprintf → segfault on env values ≥ 2048 B; release blocker for the JSON path).
  • 78b5633 (slucas) — trivial .gitignore for build dirs.

Combined non-upstream delta vs the 2.18.3 base: 2 files, +77 / −10 (src/util.cu, src/Makefile).

Evidence

Built from this branch and run on a real B200 via SUNK (slurm-b200-193-213): the -J JSON correctly emits a live GPU serial — data.config.devices[0].serial = 1652025084924 — and survives long env values (the e4b8711 fix).

Merge note

Please merge, not squash, so Anton's authorship on 0f70559 is preserved.

Follow-up

A separate PR against coreweave/nccl-tests will repoint its Dockerfile to build from this commit (currently pinned to the bare NVIDIA f727aa2), so prod picks up the serial field.

AddyLaddy and others added 30 commits November 3, 2025 11:23
Use -M 1 to dump library memory usage information
…TIIALIZER, query properties to check for device api support
Signed-off-by: David Addison <daddison@nvidia.com>
Signed-off-by: David Addison <daddison@nvidia.com>
Based on the changes in NCCL v2.29.3, update the alltoall test to
either provide a ginConnectionType or set ginForceEnable to true.

Signed-off-by: Ahsan Pervaiz <apervaiz@nvidia.com>
Signed-off-by: Theofilos Ioannis Manitaras <tmanitaras@nvidia.com>
Add optional testEngine.initCommConfig, invoked from initComms
after the shared ncclConfig_t setup.

sendrecv registers SendRecvInitCommConfig to set maxP2pPeers=2

Signed-off-by: David Addison <daddison@nvidia.com>
Signed-off-by: David Addison <daddison@nvidia.com>
strncpy(value, ptr+1, MAX_LINE) does not null-terminate when the source
is >= MAX_LINE bytes; the subsequent jsonStr(value) then reads past the
buffer until it hits an unmapped page. Segfaults reliably on interactive
shells with long FPATH/PATH values (e.g. FPATH=1995 chars triggers it
with MAX_LINE=2048).

Replace both strncpy calls with snprintf, which guarantees null
termination and honors the destination size precisely. Drop the now-
redundant memsets and intermediate `token` variable while we're here.

Verified: nccl-tests-json/build-v2.30u1/all_reduce_perf -J ... runs to
completion against full shell env (109 vars, FPATH=1796 chars), produces
valid JSON, no segfault.
When verifying against multiple NCCL versions, builds land in
build-2.19.3/, build-v2.30u1/, etc. Glob /build-* keeps the working
tree clean.
@slucascore

Copy link
Copy Markdown
Collaborator Author

@Eta0 — could you review when you have a cycle? This re-syncs master to NVIDIA v2.18.3 (the same base you shipped to prod coreweave/nccl-tests in NVIDIA#91) and adds Anton's GPU-serial JSON output (0f70559) plus an env-overflow fix (e4b8711). The CoreWeave delta to actually review is small — 2 files, +77/−10 in src/util.cu + src/Makefile; the rest is the stock NVIDIA bump. Verified emitting a live GPU serial on a real B200 over SUNK. Please merge, not squash, so Anton's authorship is preserved. A follow-up PR against coreweave/nccl-tests will repoint its Dockerfile here. Thanks!

@slucascore

Copy link
Copy Markdown
Collaborator Author

Closing for now — Anton (the likely merger on this repo) isn't available, and my account lacks write here. Pivoting to a self-contained patch-based PR against coreweave/nccl-tests that applies Anton's GPU-serial change on top of the existing NVIDIA v2.18.3 base (no dependency on landing it here first). The branch stays on my fork (serial-json-on-2.18.3) if we want to land it here later for commit-level attribution. cc @Eta0 — will tag you on the coreweave/nccl-tests PR.

@slucascore slucascore closed this Jun 5, 2026
@slucascore slucascore reopened this Jun 5, 2026
@slucascore

Copy link
Copy Markdown
Collaborator Author

Reopened — disregard the close note above; we're proceeding with this PR after all. @Eta0 review whenever you have a cycle. Summary unchanged: re-syncs master to NVIDIA v2.18.3 (same base as prod coreweave/nccl-tests NVIDIA#91) + Anton's GPU-serial JSON (0f70559) + env-overflow fix (e4b8711); CoreWeave delta to review is just src/util.cu + src/Makefile (+77/−10). Please merge, not squash, to preserve Anton's authorship.

Comment thread src/util.cu
Comment on lines 355 to 366
for(char **e = envp; *e; e++) {
jsonStr(*e);
char key[MAX_LINE];
char value[MAX_LINE];
char *ptr = strchr(*e, '=');
if(ptr != NULL) {
// snprintf null-terminates; strncpy did not, segfaulting jsonStr on
// env values approaching MAX_LINE bytes (e.g. long FPATH/PATH).
snprintf(key, sizeof(key), "%.*s", (int)(ptr - *e), *e);
snprintf(value, sizeof(value), "%s", ptr + 1);
jsonKey(key); jsonStr(value);
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value doesn't necessarily need to have its own buffer here, since the suffix of *e starting at ptr + 1 is already a null-terminated string. Although the JSON functions in here will eventually truncate it anyway. (Also, that comment probably doesn't need to be there, since it's talking about code that isn't there.)

Suggested change
for(char **e = envp; *e; e++) {
jsonStr(*e);
char key[MAX_LINE];
char value[MAX_LINE];
char *ptr = strchr(*e, '=');
if(ptr != NULL) {
// snprintf null-terminates; strncpy did not, segfaulting jsonStr on
// env values approaching MAX_LINE bytes (e.g. long FPATH/PATH).
snprintf(key, sizeof(key), "%.*s", (int)(ptr - *e), *e);
snprintf(value, sizeof(value), "%s", ptr + 1);
jsonKey(key); jsonStr(value);
}
}
for (char **e = envp; *e; e++) {
char *ptr = strchr(*e, '=');
if (!ptr) continue;
char key[MAX_LINE];
snprintf(key, sizeof(key), "%.*s", (size_t)(ptr - *e), *e);
jsonKey(key);
jsonStr(ptr + 1);
}

Comment thread src/util.cu Outdated
Comment thread src/util.cu Outdated
return testNotImplemented;
}
CUDACHECK(cudaGetDeviceProperties(&prop, cudaDev));
getGPUSerial(cudaDev, gpuSerial);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to check the return value of getGPUSerial here, since if this failed, gpuSerial may contain uninitialized or stale bytes. Handing those to the snprintf call unchecked is suspicious.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, also, this ends up initializing and shutting down NVML once per loop iteration. You could probably handle errors more precisely and be more efficient about it too if you hoist that aspect of it out of the loop. E.g. a failure from NVML failing to shut down wouldn't mean that the contents of gpuSerial need to be discarded.

Comment thread src/util.cu Outdated
slucascore and others added 7 commits June 5, 2026 15:53
Co-authored-by: Eta <24918963+Eta0@users.noreply.github.com>
Co-authored-by: Eta <24918963+Eta0@users.noreply.github.com>
- env JSON: drop the value[MAX_LINE] buffer + stale comment; jsonStr reads
  ptr+1 (a NUL-terminated suffix of *e) directly, avoiding truncation. (Eta)
- writeDeviceReport: check getGPUSerial() return; on NVML failure write
  "unknown" instead of emitting uninitialized/stale serial bytes. (Eta)
- jsonDouble: emit "inf"/"-inf" string for +/-inf instead of a bare
  inf token, which is invalid JSON (mirrors existing "nan" handling).
- getFloatStr: short-circuit +/-inf; +inf otherwise loops forever as the
  uint64_t magnitude counter wraps. Guarded with isinf (<math.h>).

These are pre-existing NCCL-upstream bugs, not part of the GPU-serial change.
The sscanf in parseRankInfo hardcodes the serial field width as %29; assert
that rankInfo_t::gpuSerial is 30 bytes so a change to NVML_DEVICE_SERIAL_BUFFER_SIZE
triggers a compile-time reminder to update it.
getGPUSerial used to nvmlInit()+nvmlShutdown() on every call, so NVML was
initialized and torn down once per GPU. Move the lifecycle into
writeDeviceReport: init once before the device loop, shut down once after.

getGPUSerial now only fetches the handle + serial (NVML pre-initialized by the
caller), so its nonzero return means a genuine serial lookup failure -> the
device gets "unknown". An NVML shutdown failure after the loop no longer
discards the serials already collected; it is just logged.
The hoisted nvmlInit() leaked the NVML handle if the in-loop
CUDACHECK(cudaGetDeviceProperties) bailed early. Capture the result, shut
NVML down on failure, then let CUDACHECK do its standard logging+return so
every post-init return path now releases NVML.
@slucascore

Copy link
Copy Markdown
Collaborator Author

@Eta0 ready for re-review when you have a moment. All your review points are addressed (pushed up to 3b6578f):

  • env loop (:368) — dropped the extra buffer; jsonStr(ptr+1) reads the value directly. Kept the (int) cast on %.*s (its precision arg must be int, not size_t).
  • getGPUSerial (:629) — now checks the return value (failed lookup → "unknown"), and NVML init/shutdown are hoisted out of the per-GPU loop (one lifecycle per report). A shutdown failure no longer discards already-collected serials.
  • sscanf widths + uuid stack alloc — applied.
  • static_assert on sizeof(rankInfo_t::gpuSerial)==30 added above parseRankInfo.
  • inf handling (the NCCL-inherited ones you flagged) — jsonDouble emits "inf"/"-inf" strings; getFloatStr short-circuits ±inf.
  • + follow-up: shut NVML down on the cudaGetDeviceProperties early-return path so the hoisted init can't leak the handle.

Compile-tested locally (nvcc -DMPI_SUPPORT, sm_75) and the serial was verified emitting on a real B200 over SUNK. Please merge, not squash to preserve Anton's authorship on 0f70559. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants