Update to NVIDIA nccl-tests v2.18.3 + GPU-serial JSON output (Anton)#1
Update to NVIDIA nccl-tests v2.18.3 + GPU-serial JSON output (Anton)#1slucascore wants to merge 39 commits into
Conversation
Use -M 1 to dump library memory usage information
…TIIALIZER, query properties to check for device api support
… error if there is a mismatch
Signed-off-by: David Addison <daddison@nvidia.com>
Add --extended-lambda to NVCUFLAGS
Signed-off-by: David Addison <daddison@nvidia.com>
Based on the changes in NCCL v2.29.3, update the alltoall test to either provide a ginConnectionType or set ginForceEnable to true. Signed-off-by: Ahsan Pervaiz <apervaiz@nvidia.com>
Signed-off-by: Theofilos Ioannis Manitaras <tmanitaras@nvidia.com>
Add optional testEngine.initCommConfig, invoked from initComms after the shared ncclConfig_t setup. sendrecv registers SendRecvInitCommConfig to set maxP2pPeers=2 Signed-off-by: David Addison <daddison@nvidia.com>
Signed-off-by: David Addison <daddison@nvidia.com>
strncpy(value, ptr+1, MAX_LINE) does not null-terminate when the source is >= MAX_LINE bytes; the subsequent jsonStr(value) then reads past the buffer until it hits an unmapped page. Segfaults reliably on interactive shells with long FPATH/PATH values (e.g. FPATH=1995 chars triggers it with MAX_LINE=2048). Replace both strncpy calls with snprintf, which guarantees null termination and honors the destination size precisely. Drop the now- redundant memsets and intermediate `token` variable while we're here. Verified: nccl-tests-json/build-v2.30u1/all_reduce_perf -J ... runs to completion against full shell env (109 vars, FPATH=1796 chars), produces valid JSON, no segfault.
When verifying against multiple NCCL versions, builds land in build-2.19.3/, build-v2.30u1/, etc. Glob /build-* keeps the working tree clean.
|
@Eta0 — could you review when you have a cycle? This re-syncs master to NVIDIA v2.18.3 (the same base you shipped to prod |
|
Closing for now — Anton (the likely merger on this repo) isn't available, and my account lacks write here. Pivoting to a self-contained patch-based PR against |
|
Reopened — disregard the close note above; we're proceeding with this PR after all. @Eta0 review whenever you have a cycle. Summary unchanged: re-syncs master to NVIDIA v2.18.3 (same base as prod |
| for(char **e = envp; *e; e++) { | ||
| jsonStr(*e); | ||
| char key[MAX_LINE]; | ||
| char value[MAX_LINE]; | ||
| char *ptr = strchr(*e, '='); | ||
| if(ptr != NULL) { | ||
| // snprintf null-terminates; strncpy did not, segfaulting jsonStr on | ||
| // env values approaching MAX_LINE bytes (e.g. long FPATH/PATH). | ||
| snprintf(key, sizeof(key), "%.*s", (int)(ptr - *e), *e); | ||
| snprintf(value, sizeof(value), "%s", ptr + 1); | ||
| jsonKey(key); jsonStr(value); | ||
| } | ||
| } |
There was a problem hiding this comment.
value doesn't necessarily need to have its own buffer here, since the suffix of *e starting at ptr + 1 is already a null-terminated string. Although the JSON functions in here will eventually truncate it anyway. (Also, that comment probably doesn't need to be there, since it's talking about code that isn't there.)
| for(char **e = envp; *e; e++) { | |
| jsonStr(*e); | |
| char key[MAX_LINE]; | |
| char value[MAX_LINE]; | |
| char *ptr = strchr(*e, '='); | |
| if(ptr != NULL) { | |
| // snprintf null-terminates; strncpy did not, segfaulting jsonStr on | |
| // env values approaching MAX_LINE bytes (e.g. long FPATH/PATH). | |
| snprintf(key, sizeof(key), "%.*s", (int)(ptr - *e), *e); | |
| snprintf(value, sizeof(value), "%s", ptr + 1); | |
| jsonKey(key); jsonStr(value); | |
| } | |
| } | |
| for (char **e = envp; *e; e++) { | |
| char *ptr = strchr(*e, '='); | |
| if (!ptr) continue; | |
| char key[MAX_LINE]; | |
| snprintf(key, sizeof(key), "%.*s", (size_t)(ptr - *e), *e); | |
| jsonKey(key); | |
| jsonStr(ptr + 1); | |
| } |
| return testNotImplemented; | ||
| } | ||
| CUDACHECK(cudaGetDeviceProperties(&prop, cudaDev)); | ||
| getGPUSerial(cudaDev, gpuSerial); |
There was a problem hiding this comment.
You need to check the return value of getGPUSerial here, since if this failed, gpuSerial may contain uninitialized or stale bytes. Handing those to the snprintf call unchecked is suspicious.
There was a problem hiding this comment.
Oh, also, this ends up initializing and shutting down NVML once per loop iteration. You could probably handle errors more precisely and be more efficient about it too if you hoist that aspect of it out of the loop. E.g. a failure from NVML failing to shut down wouldn't mean that the contents of gpuSerial need to be discarded.
Co-authored-by: Eta <24918963+Eta0@users.noreply.github.com>
Co-authored-by: Eta <24918963+Eta0@users.noreply.github.com>
- env JSON: drop the value[MAX_LINE] buffer + stale comment; jsonStr reads ptr+1 (a NUL-terminated suffix of *e) directly, avoiding truncation. (Eta) - writeDeviceReport: check getGPUSerial() return; on NVML failure write "unknown" instead of emitting uninitialized/stale serial bytes. (Eta)
- jsonDouble: emit "inf"/"-inf" string for +/-inf instead of a bare inf token, which is invalid JSON (mirrors existing "nan" handling). - getFloatStr: short-circuit +/-inf; +inf otherwise loops forever as the uint64_t magnitude counter wraps. Guarded with isinf (<math.h>). These are pre-existing NCCL-upstream bugs, not part of the GPU-serial change.
The sscanf in parseRankInfo hardcodes the serial field width as %29; assert that rankInfo_t::gpuSerial is 30 bytes so a change to NVML_DEVICE_SERIAL_BUFFER_SIZE triggers a compile-time reminder to update it.
getGPUSerial used to nvmlInit()+nvmlShutdown() on every call, so NVML was initialized and torn down once per GPU. Move the lifecycle into writeDeviceReport: init once before the device loop, shut down once after. getGPUSerial now only fetches the handle + serial (NVML pre-initialized by the caller), so its nonzero return means a genuine serial lookup failure -> the device gets "unknown". An NVML shutdown failure after the loop no longer discards the serials already collected; it is just logged.
The hoisted nvmlInit() leaked the NVML handle if the in-loop CUDACHECK(cudaGetDeviceProperties) bailed early. Capture the result, shut NVML down on failure, then let CUDACHECK do its standard logging+return so every post-init return path now releases NVML.
|
@Eta0 ready for re-review when you have a moment. All your review points are addressed (pushed up to
Compile-tested locally ( |
What
Brings
nccl-tests-jsonmaster from NVIDIA v2.17.6 up to v2.18.3 (f727aa2) and layers CoreWeave's GPU-serial JSON enrichment on top.The v2.18.3 base (
f727aa2) is the same commit already shipped to prodcoreweave/nccl-testsin NVIDIA#91 — so the upstream bump here isn't new risk, it just re-syncs this repo's master to match prod.The CoreWeave delta (what to actually review)
Everything except the last two functional commits is stock NVIDIA — review per-commit. The non-upstream changes are only:
0f70559(Anton Gunnarsson) — GPU serial in JSON. AddsgetGPUSerial()/nvmlDeviceGetSerial(), emits"serial"per device in the-Joutput.src/Makefilenow links-lnvidia-ml+-luuid.e4b8711(slucas) — env-iteration buffer-overflow fix injsonOutputInit(unbounded snprintf → segfault on env values ≥ 2048 B; release blocker for the JSON path).78b5633(slucas) — trivial.gitignorefor build dirs.Combined non-upstream delta vs the 2.18.3 base: 2 files, +77 / −10 (
src/util.cu,src/Makefile).Evidence
Built from this branch and run on a real B200 via SUNK (
slurm-b200-193-213): the-JJSON correctly emits a live GPU serial —data.config.devices[0].serial = 1652025084924— and survives long env values (thee4b8711fix).Merge note
Please merge, not squash, so Anton's authorship on
0f70559is preserved.Follow-up
A separate PR against
coreweave/nccl-testswill repoint its Dockerfile to build from this commit (currently pinned to the bare NVIDIAf727aa2), so prod picks up the serial field.