feat(validators): RoCE NET variant for nccl-all-reduce-bw#1428
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds RoCE fabric support for the NCCL all-reduce bandwidth NET validator. Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/validator/v1/job_plan_internal.go`:
- Around line 166-173: The ncclFabricEnv variable is being forwarded to the NCCL
check pod but is not protected from being overridden by catalog entries. Locate
the skip loop around line 194 that filters entry.Env variables (where
hfTokenEnvVar, requireScopedInferenceGatewayEnv, and inferencePerfNoCleanupEnv
are currently being skipped to prevent catalog overrides), and add ncclFabricEnv
to that same condition. This ensures that the orchestrator-selected fabric value
forwarded in the earlier block cannot be silently overridden by a duplicate
entry from the catalog, maintaining the trust boundary as documented in the
adjacent comments.
In `@validators/performance/nccl_all_reduce_bw_constraint.go`:
- Around line 684-700: The RoCE ResourceClaimTemplate named "nccl-roce-rct"
created in the applyNCCLResources function when fabric equals fabricRoCE is
never deleted in the cleanupNCCLResources function. Since the namespace is
persistent and reused across test runs, this causes subsequent RoCE test runs to
fail with AlreadyExists errors. Add cleanup logic in cleanupNCCLResources to
delete the ResourceClaimTemplate using resourceClaimTemplateGVR with the name
"nccl-roce-rct" and config.Namespace. Reference the cleanup pattern used in
inference_perf_constraint.go (lines 1494–1496) as an example of the correct
deletion approach.
In `@validators/performance/nccl_test.go`:
- Line 942: The templatePath call at line 942 hardcodes fabricEFA as the
parameter, which means the RoCE runtime template path is never parsed or
validated by this test. Add a new RoCE NET subtest that iterates through
roceNETSupportedServices and calls templatePath with fabricRoCE instead of
fabricEFA to ensure RoCE runtime templates are caught for any malformation
before deployment.
- Around line 617-627: The test case for RoCE NET starting at the "eks gb200 net
roce is accelerator-agnostic" entry claims accelerator-agnostic behavior in its
comment, but only validates this with the gb200 accelerator. Add another
identical test case entry to the same test table with a different accelerator
value (such as h100) while keeping the same service (EKS), variant (NET), fabric
(RoCE), filename, and expected path ("testdata/roce/eks/runtime-net.yaml") to
actually verify that the RoCE path is independent of the accelerator choice and
prove the accelerator-agnostic contract.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 804d4334-57bb-46b7-90f1-0993909631b4
📒 Files selected for processing (5)
pkg/validator/v1/job_plan_internal.govalidators/performance/nccl_all_reduce_bw_constraint.govalidators/performance/nccl_test.govalidators/performance/testdata/roce/eks/roce-claim.yamlvalidators/performance/testdata/roce/eks/runtime-net.yaml
0cd1326 to
66ca130
Compare
66ca130 to
b5f55cd
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
validators/performance/nccl_test.go (1)
924-945: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winParse the standalone RoCE claim template in this guard too.
This loop now covers
runtime-net.yaml, butapplyNCCLResourcesalso parsestestdata/roce/{service}/roce-claim.yamlbefore the runtime. A malformed claim template or missingROCE_DEVICE_COUNTplaceholder would still slip past this test and fail only during validation.Suggested extension
data := map[string]string{ "NAMESPACE": "aicr-validation", "WORKER_COUNT": "2", "GPU_COUNT_PER_NODE": "8", "GPU_COUNT": "16", + "ROCE_DEVICE_COUNT": "8", "TEST_TYPE": testType, "MIN_MESSAGE_SIZE": minMessageSize, "MAX_MESSAGE_SIZE": maxMessageSize, "EFA_RESOURCE_LIMITS": buildEFAResourceLine(1, efaIndent), "EFA_RESOURCE_REQUESTS": buildEFAResourceLine(1, efaIndent), @@ for service := range roceNETSupportedServices { name := strings.Join([]string{string(variantNET), string(service), string(fabricRoCE)}, "/") t.Run(name, func(t *testing.T) { + claimPath := filepath.Join("testdata", string(fabricRoCE), string(service), "roce-claim.yaml") + if _, err := parseYAMLTemplate(claimPath, data); err != nil { + t.Fatalf("supported RoCE NET combination has no parseable claim template %s: %v", claimPath, err) + } + path := templatePath(recipe.CriteriaAcceleratorH100, service, variantNET, fabricRoCE, "runtime.yaml") if _, err := parseYAMLTemplate(path, data); err != nil { t.Fatalf("supported RoCE NET combination has no parseable runtime template %s: %v", path, err) } }) }Also applies to: 961-972
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@validators/performance/nccl_test.go` around lines 924 - 945, Extend the `applyNCCLResources` test coverage to also parse the standalone RoCE claim template under `testdata/roce/{service}/roce-claim.yaml` before the runtime template. Reuse the existing validation flow around `buildEFAResourceLine`, `buildGKENetworkInterfacesAnnotation`, and `buildNRIDeviceAnnotation`, but add the RoCE claim path so a malformed claim or missing `ROCE_DEVICE_COUNT` placeholder is caught by this test instead of only at validation time. Also apply the same update to the other duplicated block referenced by the `Also applies to` range.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@validators/performance/nccl_all_reduce_bw_constraint.go`:
- Around line 145-152: ncclFabric() currently falls back to EFA for any unknown
value and lets RoCE flow into non-NET paths, so tighten validation and scope it
to NET-only cases. Update ncclFabric and the templatePath routing logic to
accept RoCE only when the selected variant is NET, treat unsupported env values
as invalid instead of defaulting silently, and keep non-NET/NVLS variants pinned
to EFA. Use the existing ncclFabricType/fabricRoCE/fabricEFA symbols and the
templatePath selector to make the validation and branching explicit.
- Around line 350-353: Register cleanup before calling applyNCCLResources so any
partial NCCL/RoCE objects are removed even if resource application fails partway
through. Update the flow around applyNCCLResources in the performance NCCL
bandwidth constraint path to defer cleanupNCCLResources immediately after the
validation namespace is known, and keep the existing error wrap on apply
failure; apply the same change in the additional NCCL validation path referenced
by the comment.
In `@validators/performance/testdata/roce/eks/runtime-net.yaml`:
- Around line 165-179: The node initContainer in the performance test YAML is
doing unnecessary package installation and sshd setup that will not persist into
the later node container. Remove the apt-get/openSSH-server and /var/run/sshd
steps from the fix-ssh-perms initContainer, and keep only the authorized_keys
copy into the shared ssh-config volume so the startup path is simpler and avoids
extra failure points. Use the fix-ssh-perms initContainer and its ssh-config
volume mount as the identifiers when updating this block.
---
Outside diff comments:
In `@validators/performance/nccl_test.go`:
- Around line 924-945: Extend the `applyNCCLResources` test coverage to also
parse the standalone RoCE claim template under
`testdata/roce/{service}/roce-claim.yaml` before the runtime template. Reuse the
existing validation flow around `buildEFAResourceLine`,
`buildGKENetworkInterfacesAnnotation`, and `buildNRIDeviceAnnotation`, but add
the RoCE claim path so a malformed claim or missing `ROCE_DEVICE_COUNT`
placeholder is caught by this test instead of only at validation time. Also
apply the same update to the other duplicated block referenced by the `Also
applies to` range.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 67d576ef-d07d-478d-82d8-b6d03dd9c58d
📒 Files selected for processing (5)
pkg/validator/v1/job_plan_internal.govalidators/performance/nccl_all_reduce_bw_constraint.govalidators/performance/nccl_test.govalidators/performance/testdata/roce/eks/roce-claim.yamlvalidators/performance/testdata/roce/eks/runtime-net.yaml
b5f55cd to
757b977
Compare
|
🌿 Preview your docs: https://nvidia-preview-feat-nccl-roce-fabric.docs.buildwithfern.com/aicr |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/user/validation.md`:
- Line 51: Update the `nccl-all-reduce-bw-net` validation row in
`docs/user/validation.md` to describe NET transport verification rather than
EFA-only behavior. Keep the `NET`/`AICR_NCCL_FABRIC=roce` distinction clear, and
change the description so it says the check verifies NET traffic on the
supported fabric paths, with the EFA silent-fallback caveat noted as the
default-path example when the NVIDIA driver preflight is relevant.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 6eb1c6ab-06eb-4032-a408-2b7da668da09
📒 Files selected for processing (7)
docs/user/validation.mdpkg/validator/v1/job_plan_internal.gorecipes/validators/README.mdvalidators/performance/nccl_all_reduce_bw_constraint.govalidators/performance/nccl_test.govalidators/performance/testdata/roce/eks/roce-claim.yamlvalidators/performance/testdata/roce/eks/runtime-net.yaml
757b977 to
911aff0
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/validators/README.md`:
- Line 50: The `nccl-all-reduce-bw-net` description is updated in the README,
but the source-of-truth catalog entry is still stale. Update the matching entry
in `catalog.yaml` for `nccl-all-reduce-bw-net` so its description matches the
new README wording, keeping the validator metadata consistent across CLI/CTRF
surfaces.
In `@validators/performance/nccl_test.go`:
- Around line 1000-1011: The NCCL wiring test in nccl_test.go only checks the
RoCE runtime template and misses the new roce-claim.yaml generated by
applyNCCLResources. Add a small parse-only loop alongside the existing RoCE NET
template checks that targets roce-claim.yaml for each supported service, using
the same templatePath/parseYAMLTemplate flow and populating ROCE_DEVICE_COUNT in
the test data so malformed claim templates are caught early.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 47ea1ec1-bab8-4867-abaf-2c877a14f2c4
📒 Files selected for processing (7)
docs/user/validation.mdpkg/validator/v1/job_plan_internal.gorecipes/validators/README.mdvalidators/performance/nccl_all_reduce_bw_constraint.govalidators/performance/nccl_test.govalidators/performance/testdata/roce/eks/roce-claim.yamlvalidators/performance/testdata/roce/eks/runtime-net.yaml
9f8adc5 to
7b65e2b
Compare
7b65e2b to
3585dfb
Compare
3585dfb to
0887059
Compare
Confirmed IssuesF1 [medium] — Missing catalog lock test + cross-package constant duplication F2 [minor] — Catalog description stale F3 [minor] — Orchestrator forwards F4 [minor] — F5 [minor] — Empty-env test doesn't exercise the unset branch F7 [cosmetic] — Cleanup unconditionally deletes RoCE RCT F13 [minor / follow-up] — GB300 not in F14 [minor] — Parameter explosion in Suggested refactor (separate PR / follow-up): ```go func templatePath(sel ncclSelection, filename string) string Drops Open Questions
|
0887059 to
9113e09
Compare
|
Thanks for the thorough pass. Addressed in 9113e09:
Deferred (folded into existing trackers rather than new issues):
F7 (unconditional RoCE-RCT delete in cleanup) — leaving as-is: it's NotFound-tolerant and intentionally mirrors the ComputeDomain cleanup pattern, keeping cleanup oblivious to the active variant rather than threading a fabric arg through. |
Recipe evidence checkNo leaf overlays affected by this PR. This gate is warning-only and never blocks merge. |
…FABRIC), default EFA
Adds a ConnectX RoCE path to the nccl-all-reduce-bw-net test, selected by
AICR_NCCL_FABRIC=roce (default efa — every existing EFA recipe is
unchanged). Fabric-keyed and accelerator-agnostic: the RoCE NET template
lives at testdata/roce/{service}/ and is shared across EKS RoCE nodes
rather than per-accelerator, so it does not depend on any specific
accelerator constant.
When fabric=roce, the NET test:
- uses NCCL's built-in IB/verbs transport over the ConnectX RoCE devices
(NCCL_IB_HCA=rocep, NCCL_NET_PLUGIN=none to bypass the bundled aws-ofi
EFA plugin), on a CUDA-13 pytorch image (sm_103-capable; the EFA
hpc-cloud image is CUDA 12 + EFA-only);
- claims RoCE NICs via a DRA ResourceClaimTemplate (roce.networking.k8s.aws)
instead of the vpc.amazonaws.com/efa extended resource;
- disables HPC-X UCC/HCOLL and forces the ob1 PML over TCP (their team-create
/ UCX PML segfault during MPI_Init on this image — MPI is only bootstrap).
The transport assertion (verifyTransportFromLogs) is unchanged: it already
accepts any non-Socket NET plugin, so 'Using network IB' passes.
AICR_NCCL_FABRIC is forwarded to the NET check pod via buildEnv (the test
runs in-Job), scoped to nccl-all-reduce-bw-net.
Validated end-to-end on a ConnectX-RoCE GB300 cluster: NCCL IB over 8
rocep* RoCE devices (GPUDirect RDMA), ~387 GB/s peak busbw. The env knob is
interim; snapshot-based fabric auto-detection is tracked in NVIDIA#1413.
Refs NVIDIA#1413, NVIDIA#1410.
9113e09 to
681760f
Compare
Summary
Adds a ConnectX RoCE path to the
nccl-all-reduce-bw-netperformance test, selected by an env knobAICR_NCCL_FABRIC=roce(defaultefa). Every existing EFA recipe is unchanged (default path untouched). Fabric-keyed and accelerator-agnostic: the RoCE NET template lives attestdata/roce/{service}/and is shared across EKS RoCE nodes, so it does not depend on any accelerator constant.Motivation / Context
nccl-all-reduce-bw-netis EFA-hardwired (forcesFI_PROVIDER=efa, requestsvpc.amazonaws.com/efa, uses an EFA-packaged CUDA-12 image). On a ConnectX RoCE fabric (e.g. DGXC GB300p6e-gb300r) it can't run, so--phase performanceon a RoCE recipe fails. This adds the RoCE variant so RoCE clusters can validate the NET transport.Fixes: N/A
Related: #1413 (RoCE NCCL NET test/validation — tracking issue), #1410 (EFA vs RoCE networking)
Type of Change
Component(s) Affected
pkg/validator)Implementation Notes
When
AICR_NCCL_FABRIC=roce, the NET test:NCCL_IB_HCA=rocep,NCCL_NET_PLUGIN=noneto bypass the bundled aws-ofi/EFA plugin), on a CUDA-13 pytorch image (sm_103-capable);ResourceClaimTemplate(roce.networking.k8s.aws) instead of the EFA extended resource (applied as a standalone object, sinceparseYAMLTemplateis single-document);MPI_Initon this image; MPI is only the launcher/bootstrap, NCCL carries the data).templatePathreturnstestdata/roce/{service}/...forfabric=roce(no per-accelerator dir, nogb300coupling). Support is keyed by service (roceNETSupportedServices).verifyTransportFromLogsis unchanged — it already accepts any non-Socket NET plugin, soUsing network IBpasses.AICR_NCCL_FABRICis forwarded to the NET check pod viabuildEnv(the test runs in-Job), scoped tonccl-all-reduce-bw-net.AICR_NCCL_FABRICis an interim override; snapshot-based fabric auto-detection (so the recipe doesn't need a manual knob) is tracked in #1413.Testing
Validated end-to-end on a ConnectX-RoCE GB300 cluster: NCCL IB over 8
rocep*RoCE devices (GPUDirect RDMA), ~387 GB/s peak busbw (Using network IB, not Socket). Details in #1413.Risk Assessment
efaleaves all existing behavior byte-identical; the RoCE branch is dormant unlessAICR_NCCL_FABRIC=roceis set on a RoCE cluster.Checklist
Follow-ups
Tracked under #1413, not blocking this interim NET variant:
nvcr.io/nvidia/pytorch(CUDA 13 / sm_103),since the common RoCE
nccl-testsimages are CUDA 12. Once a CUDA-13/sm_103nccl-testsimage is available it becomes the preferred base and removes theruntime apt-get-sshd step the pytorch image requires.
NCCL_IB_PCI_RELAXED_ORDERING=1— a RoCE perf knob not set here. Wealready hit ~387 GB/s, but it's a cheap candidate bandwidth gain.
rather than a manual env knob; this snapshot-based detection is the Support RoCE fabric for the nccl-all-reduce-bw NET test/validation #1413
direction that turns
AICR_NCCL_FABRICinto an override.