Update to NVIDIA nccl-tests v2.18.3 + GPU-serial JSON output (Anton) by slucascore · Pull Request #1 · coreweave/nccl-tests-json

slucascore · 2026-06-05T18:04:40Z

What

Brings nccl-tests-json master from NVIDIA v2.17.6 up to v2.18.3 (f727aa2) and layers CoreWeave's GPU-serial JSON enrichment on top.

The v2.18.3 base (f727aa2) is the same commit already shipped to prod coreweave/nccl-tests in NVIDIA#91 — so the upstream bump here isn't new risk, it just re-syncs this repo's master to match prod.

The CoreWeave delta (what to actually review)

Everything except the last two functional commits is stock NVIDIA — review per-commit. The non-upstream changes are only:

0f70559 (Anton Gunnarsson) — GPU serial in JSON. Adds getGPUSerial()/nvmlDeviceGetSerial(), emits "serial" per device in the -J output. src/Makefile now links -lnvidia-ml + -luuid.
e4b8711 (slucas) — env-iteration buffer-overflow fix in jsonOutputInit (unbounded snprintf → segfault on env values ≥ 2048 B; release blocker for the JSON path).
78b5633 (slucas) — trivial .gitignore for build dirs.

Combined non-upstream delta vs the 2.18.3 base: 2 files, +77 / −10 (src/util.cu, src/Makefile).

Evidence

Built from this branch and run on a real B200 via SUNK (slurm-b200-193-213): the -J JSON correctly emits a live GPU serial — data.config.devices[0].serial = 1652025084924 — and survives long env values (the e4b8711 fix).

Merge note

Please merge, not squash, so Anton's authorship on 0f70559 is preserved.

Follow-up

A separate PR against coreweave/nccl-tests will repoint its Dockerfile to build from this commit (currently pinned to the bare NVIDIA f727aa2), so prod picks up the serial field.

Use -M 1 to dump library memory usage information

…TIIALIZER, query properties to check for device api support

… error if there is a mismatch

Signed-off-by: David Addison <daddison@nvidia.com>

Add --extended-lambda to NVCUFLAGS

Signed-off-by: David Addison <daddison@nvidia.com>

Based on the changes in NCCL v2.29.3, update the alltoall test to either provide a ginConnectionType or set ginForceEnable to true. Signed-off-by: Ahsan Pervaiz <apervaiz@nvidia.com>

Signed-off-by: Theofilos Ioannis Manitaras <tmanitaras@nvidia.com>

Add optional testEngine.initCommConfig, invoked from initComms after the shared ncclConfig_t setup. sendrecv registers SendRecvInitCommConfig to set maxP2pPeers=2 Signed-off-by: David Addison <daddison@nvidia.com>

Signed-off-by: David Addison <daddison@nvidia.com>

strncpy(value, ptr+1, MAX_LINE) does not null-terminate when the source is >= MAX_LINE bytes; the subsequent jsonStr(value) then reads past the buffer until it hits an unmapped page. Segfaults reliably on interactive shells with long FPATH/PATH values (e.g. FPATH=1995 chars triggers it with MAX_LINE=2048). Replace both strncpy calls with snprintf, which guarantees null termination and honors the destination size precisely. Drop the now- redundant memsets and intermediate `token` variable while we're here. Verified: nccl-tests-json/build-v2.30u1/all_reduce_perf -J ... runs to completion against full shell env (109 vars, FPATH=1796 chars), produces valid JSON, no segfault.

When verifying against multiple NCCL versions, builds land in build-2.19.3/, build-v2.30u1/, etc. Glob /build-* keeps the working tree clean.

slucascore · 2026-06-05T18:05:13Z

@Eta0 — could you review when you have a cycle? This re-syncs master to NVIDIA v2.18.3 (the same base you shipped to prod coreweave/nccl-tests in NVIDIA#91) and adds Anton's GPU-serial JSON output (0f70559) plus an env-overflow fix (e4b8711). The CoreWeave delta to actually review is small — 2 files, +77/−10 in src/util.cu + src/Makefile; the rest is the stock NVIDIA bump. Verified emitting a live GPU serial on a real B200 over SUNK. Please merge, not squash, so Anton's authorship is preserved. A follow-up PR against coreweave/nccl-tests will repoint its Dockerfile here. Thanks!

slucascore · 2026-06-05T18:51:09Z

Closing for now — Anton (the likely merger on this repo) isn't available, and my account lacks write here. Pivoting to a self-contained patch-based PR against coreweave/nccl-tests that applies Anton's GPU-serial change on top of the existing NVIDIA v2.18.3 base (no dependency on landing it here first). The branch stays on my fork (serial-json-on-2.18.3) if we want to land it here later for commit-level attribution. cc @Eta0 — will tag you on the coreweave/nccl-tests PR.

slucascore · 2026-06-05T18:57:57Z

Reopened — disregard the close note above; we're proceeding with this PR after all. @Eta0 review whenever you have a cycle. Summary unchanged: re-syncs master to NVIDIA v2.18.3 (same base as prod coreweave/nccl-tests NVIDIA#91) + Anton's GPU-serial JSON (0f70559) + env-overflow fix (e4b8711); CoreWeave delta to review is just src/util.cu + src/Makefile (+77/−10). Please merge, not squash, to preserve Anton's authorship.

Eta0 · 2026-06-05T19:15:43Z

  for(char **e = envp; *e; e++) {
-    jsonStr(*e);
+    char key[MAX_LINE];
+    char value[MAX_LINE];
+    char *ptr = strchr(*e, '=');
+    if(ptr != NULL) {
+      // snprintf null-terminates; strncpy did not, segfaulting jsonStr on
+      // env values approaching MAX_LINE bytes (e.g. long FPATH/PATH).
+      snprintf(key, sizeof(key), "%.*s", (int)(ptr - *e), *e);
+      snprintf(value, sizeof(value), "%s", ptr + 1);
+      jsonKey(key); jsonStr(value);
+    }
  }


value doesn't necessarily need to have its own buffer here, since the suffix of *e starting at ptr + 1 is already a null-terminated string. Although the JSON functions in here will eventually truncate it anyway. (Also, that comment probably doesn't need to be there, since it's talking about code that isn't there.)

Suggested change

for(char **e = envp; *e; e++) {

jsonStr(*e);

char key[MAX_LINE];

char value[MAX_LINE];

char *ptr = strchr(*e, '=');

if(ptr != NULL) {

// snprintf null-terminates; strncpy did not, segfaulting jsonStr on

// env values approaching MAX_LINE bytes (e.g. long FPATH/PATH).

snprintf(key, sizeof(key), "%.*s", (int)(ptr - *e), *e);

snprintf(value, sizeof(value), "%s", ptr + 1);

jsonKey(key); jsonStr(value);

}

}

for (char **e = envp; *e; e++) {

char *ptr = strchr(*e, '=');

if (!ptr) continue;

char key[MAX_LINE];

snprintf(key, sizeof(key), "%.*s", (size_t)(ptr - *e), *e);

jsonKey(key);

jsonStr(ptr + 1);

}

Eta0 · 2026-06-05T19:47:58Z

      return testNotImplemented;
    }
    CUDACHECK(cudaGetDeviceProperties(&prop, cudaDev));
+    getGPUSerial(cudaDev, gpuSerial);


You need to check the return value of getGPUSerial here, since if this failed, gpuSerial may contain uninitialized or stale bytes. Handing those to the snprintf call unchecked is suspicious.

Oh, also, this ends up initializing and shutting down NVML once per loop iteration. You could probably handle errors more precisely and be more efficient about it too if you hoist that aspect of it out of the loop. E.g. a failure from NVML failing to shut down wouldn't mean that the contents of gpuSerial need to be discarded.

Co-authored-by: Eta <24918963+Eta0@users.noreply.github.com>

- env JSON: drop the value[MAX_LINE] buffer + stale comment; jsonStr reads ptr+1 (a NUL-terminated suffix of *e) directly, avoiding truncation. (Eta) - writeDeviceReport: check getGPUSerial() return; on NVML failure write "unknown" instead of emitting uninitialized/stale serial bytes. (Eta)

- jsonDouble: emit "inf"/"-inf" string for +/-inf instead of a bare inf token, which is invalid JSON (mirrors existing "nan" handling). - getFloatStr: short-circuit +/-inf; +inf otherwise loops forever as the uint64_t magnitude counter wraps. Guarded with isinf (<math.h>). These are pre-existing NCCL-upstream bugs, not part of the GPU-serial change.

The sscanf in parseRankInfo hardcodes the serial field width as %29; assert that rankInfo_t::gpuSerial is 30 bytes so a change to NVML_DEVICE_SERIAL_BUFFER_SIZE triggers a compile-time reminder to update it.

getGPUSerial used to nvmlInit()+nvmlShutdown() on every call, so NVML was initialized and torn down once per GPU. Move the lifecycle into writeDeviceReport: init once before the device loop, shut down once after. getGPUSerial now only fetches the handle + serial (NVML pre-initialized by the caller), so its nonzero return means a genuine serial lookup failure -> the device gets "unknown". An NVML shutdown failure after the loop no longer discards the serials already collected; it is just logged.

The hoisted nvmlInit() leaked the NVML handle if the in-loop CUDACHECK(cudaGetDeviceProperties) bailed early. Capture the result, shut NVML down on failure, then let CUDACHECK do its standard logging+return so every post-init return path now releases NVML.

slucascore · 2026-06-06T00:57:44Z

@Eta0 ready for re-review when you have a moment. All your review points are addressed (pushed up to 3b6578f):

env loop (:368) — dropped the extra buffer; jsonStr(ptr+1) reads the value directly. Kept the (int) cast on %.*s (its precision arg must be int, not size_t).
getGPUSerial (:629) — now checks the return value (failed lookup → "unknown"), and NVML init/shutdown are hoisted out of the per-GPU loop (one lifecycle per report). A shutdown failure no longer discards already-collected serials.
sscanf widths + uuid stack alloc — applied.
static_assert on sizeof(rankInfo_t::gpuSerial)==30 added above parseRankInfo.
inf handling (the NCCL-inherited ones you flagged) — jsonDouble emits "inf"/"-inf" strings; getFloatStr short-circuits ±inf.
+ follow-up: shut NVML down on the cudaGetDeviceProperties early-return path so the hoisted init can't leak the handle.

Compile-tested locally (nvcc -DMPI_SUPPORT, sm_75) and the serial was verified emitting on a real B200 over SUNK. Please merge, not squash to preserve Anton's authorship on 0f70559. Thanks!

AddyLaddy and others added 30 commits November 3, 2025 11:23

Remove trailing WS when timestamp option not used

51f2e7e

Add README.md text for -J option

4bc314a

Add memory usage report option

760c467

Use -M 1 to dump library memory usage information

Add include of <limits> due to compilation error

7106245

Compatibility with 2.29 device API: use NCCL_DEV_COMM_REQUIREMENTS_IN…

24874bd

…TIIALIZER, query properties to check for device api support

device api 2.28 is not compatible with 2.29. Check versions and print…

332e618

… error if there is a mismatch

refactor comm init

070d175

NCCL_TESTS_VERSION 2.17.7

2656c58

Clarified use of Mebibytes and Gibibytes for sizes

7278698

NCCL_TESTS_VERSION 2.17.8

81463c5

Add -M memory report option to README.md

88d7e33

Fix: corrected typos in the JSON output

85ca91d

Signed-off-by: David Addison <daddison@nvidia.com>

NCCL_TESTS_VERSION 2.17.9

2535da8

Fix compilation issues with latest NCCL release headers

9938d5a

Add --extended-lambda to NVCUFLAGS

Fix Clang compilation errors with VLA initialization

ae98985

Signed-off-by: David Addison <daddison@nvidia.com>

Request GIN to be explicitly enabled in all to all test

db221de

Based on the changes in NCCL v2.29.3, update the alltoall test to either provide a ginConnectionType or set ginForceEnable to true. Signed-off-by: Ahsan Pervaiz <apervaiz@nvidia.com>

NCCL_TESTS_VERSION 2.17.10

c379e19

Add -u <index> to force unaligned buffer addresses

e986a61

Add new unalign flag to README.md and update help text

115fb09

NCCL_TESTS_VERSION 2.18.0

dd0bafd

Allocate buffers during thread initialization

8d26b23

Signed-off-by: Theofilos Ioannis Manitaras <tmanitaras@nvidia.com>

Allow blocking collectives without MPI_Barrier in timing loop

ba52a70

Update -z option description in README.md

c1af7df

NCCL_TESTS_VERSION 2.18.1

e02c20b

Display unalign setting in output

eb0d3d2

NCCL_TESTS_VERSION 2.18.2

af1dcac

Add maxP2pPeers comm config for sendrecv

5dc0670

Add optional testEngine.initCommConfig, invoked from initComms after the shared ncclConfig_t setup. sendrecv registers SendRecvInitCommConfig to set maxP2pPeers=2 Signed-off-by: David Addison <daddison@nvidia.com>

NCCL_TESTS_VERSION 2.18.3

f727aa2

Signed-off-by: David Addison <daddison@nvidia.com>

feat: improveed JSON output

0f70559

gitignore: ignore alternate build-*/ directories

78b5633

When verifying against multiple NCCL versions, builds land in build-2.19.3/, build-v2.30u1/, etc. Glob /build-* keeps the working tree clean.

slucascore closed this Jun 5, 2026

slucascore reopened this Jun 5, 2026

Eta0 requested changes Jun 5, 2026

View reviewed changes

slucascore and others added 7 commits June 5, 2026 15:53

Update src/util.cu

fb9d0bd

Co-authored-by: Eta <24918963+Eta0@users.noreply.github.com>

Update src/util.cu

d9d9d53

Co-authored-by: Eta <24918963+Eta0@users.noreply.github.com>

Add static_assert guarding gpuSerial field width (Eta)

45b5419

The sscanf in parseRankInfo hardcodes the serial field width as %29; assert that rankInfo_t::gpuSerial is 30 bytes so a change to NVML_DEVICE_SERIAL_BUFFER_SIZE triggers a compile-time reminder to update it.

style: Clean up comments about old code

c582b4d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update to NVIDIA nccl-tests v2.18.3 + GPU-serial JSON output (Anton)#1

Update to NVIDIA nccl-tests v2.18.3 + GPU-serial JSON output (Anton)#1
slucascore wants to merge 39 commits into
coreweave:masterfrom
slucascore:serial-json-on-2.18.3

slucascore commented Jun 5, 2026

Uh oh!

slucascore commented Jun 5, 2026

Uh oh!

slucascore commented Jun 5, 2026

Uh oh!

slucascore commented Jun 5, 2026

Uh oh!

Eta0 Jun 5, 2026

Uh oh!

Uh oh!

Eta0 Jun 5, 2026

Uh oh!

Eta0 Jun 5, 2026

Uh oh!

Uh oh!

slucascore commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Uh oh!

Conversation

slucascore commented Jun 5, 2026

What

The CoreWeave delta (what to actually review)

Evidence

Merge note

Follow-up

Uh oh!

slucascore commented Jun 5, 2026

Uh oh!

slucascore commented Jun 5, 2026

Uh oh!

slucascore commented Jun 5, 2026

Uh oh!

Eta0 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Eta0 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Eta0 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

slucascore commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants