Skip to content

feat(validators): RoCE NET variant for nccl-all-reduce-bw#1428

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/nccl-roce-fabric
Jun 26, 2026
Merged

feat(validators): RoCE NET variant for nccl-all-reduce-bw#1428
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/nccl-roce-fabric

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a ConnectX RoCE path to the nccl-all-reduce-bw-net performance test, selected by an env knob AICR_NCCL_FABRIC=roce (default efa). Every existing EFA recipe is unchanged (default path untouched). Fabric-keyed and accelerator-agnostic: the RoCE NET template lives at testdata/roce/{service}/ and is shared across EKS RoCE nodes, so it does not depend on any accelerator constant.

Motivation / Context

nccl-all-reduce-bw-net is EFA-hardwired (forces FI_PROVIDER=efa, requests vpc.amazonaws.com/efa, uses an EFA-packaged CUDA-12 image). On a ConnectX RoCE fabric (e.g. DGXC GB300 p6e-gb300r) it can't run, so --phase performance on a RoCE recipe fails. This adds the RoCE variant so RoCE clusters can validate the NET transport.

Fixes: N/A
Related: #1413 (RoCE NCCL NET test/validation — tracking issue), #1410 (EFA vs RoCE networking)

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Validator (pkg/validator)

Implementation Notes

When AICR_NCCL_FABRIC=roce, the NET test:

  • NCCL built-in IB/verbs over the ConnectX RoCE devices (NCCL_IB_HCA=rocep, NCCL_NET_PLUGIN=none to bypass the bundled aws-ofi/EFA plugin), on a CUDA-13 pytorch image (sm_103-capable);
  • claims RoCE NICs via a DRA ResourceClaimTemplate (roce.networking.k8s.aws) instead of the EFA extended resource (applied as a standalone object, since parseYAMLTemplate is single-document);
  • disables HPC-X UCC/HCOLL and forces the ob1 PML over TCP (their team-create / the UCX PML segfault during MPI_Init on this image; MPI is only the launcher/bootstrap, NCCL carries the data).

templatePath returns testdata/roce/{service}/... for fabric=roce (no per-accelerator dir, no gb300 coupling). Support is keyed by service (roceNETSupportedServices). verifyTransportFromLogs is unchanged — it already accepts any non-Socket NET plugin, so Using network IB passes. AICR_NCCL_FABRIC is forwarded to the NET check pod via buildEnv (the test runs in-Job), scoped to nccl-all-reduce-bw-net.

AICR_NCCL_FABRIC is an interim override; snapshot-based fabric auto-detection (so the recipe doesn't need a manual knob) is tracked in #1413.

Testing

go test ./validators/performance/ -run 'TestTemplatePath|NCCL'   # pass
go test ./pkg/validator/v1/                                       # pass
golangci-lint run ./validators/performance/... ./pkg/validator/v1/...  # 0 issues

Validated end-to-end on a ConnectX-RoCE GB300 cluster: NCCL IB over 8 rocep* RoCE devices (GPUDirect RDMA), ~387 GB/s peak busbw (Using network IB, not Socket). Details in #1413.

Risk Assessment

  • Low — default efa leaves all existing behavior byte-identical; the RoCE branch is dormant unless AICR_NCCL_FABRIC=roce is set on a RoCE cluster.

Checklist

  • Tests pass locally (NCCL + job_plan); lint clean
  • Default-EFA: zero regression for existing recipes

Follow-ups

Tracked under #1413, not blocking this interim NET variant:

  1. Test image — this PR uses nvcr.io/nvidia/pytorch (CUDA 13 / sm_103),
    since the common RoCE nccl-tests images are CUDA 12. Once a CUDA-13/sm_103
    nccl-tests image is available it becomes the preferred base and removes the
    runtime apt-get-sshd step the pytorch image requires.
  2. NCCL_IB_PCI_RELAXED_ORDERING=1 — a RoCE perf knob not set here. We
    already hit ~387 GB/s, but it's a cheap candidate bandwidth gain.
  3. Fabric auto-detection — key fabric selection off DRA DeviceClass presence
    rather than a manual env knob; this snapshot-based detection is the Support RoCE fabric for the nccl-all-reduce-bw NET test/validation #1413
    direction that turns AICR_NCCL_FABRIC into an override.

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner June 23, 2026 20:03
@yuanchen8911 yuanchen8911 marked this pull request as draft June 23, 2026 20:07
@yuanchen8911 yuanchen8911 changed the title feat(validators): RoCE NET variant for nccl-all-reduce-bw (AICR_NCCL_FABRIC), default EFA WIP: feat(validators): RoCE NET variant for nccl-all-reduce-bw (AICR_NCCL_FABRIC), default EFA Jun 23, 2026
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds RoCE fabric support for the NCCL all-reduce bandwidth NET validator. AICR_NCCL_FABRIC is forwarded for the NET check, parsed at runtime, and used to branch validation, template selection, resource application, and cleanup. The PR also adds RoCE-specific manifests for the EKS runtime path and updates tests and docs for the new fabric handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/aicr#640: Modifies the same NCCL all-reduce bandwidth validator flow and template/runtime selection logic, which this PR extends with fabric-aware branching.

Suggested labels

area/tests, theme/validation

Suggested reviewers

  • mchmarny
  • lalitadithya
  • lockwobr
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly reflects the main change: adding RoCE NET support for nccl-all-reduce-bw.
Description check ✅ Passed The description clearly matches the changes, covering RoCE support, AICR_NCCL_FABRIC, template changes, and the EFA default path.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/validator/v1/job_plan_internal.go`:
- Around line 166-173: The ncclFabricEnv variable is being forwarded to the NCCL
check pod but is not protected from being overridden by catalog entries. Locate
the skip loop around line 194 that filters entry.Env variables (where
hfTokenEnvVar, requireScopedInferenceGatewayEnv, and inferencePerfNoCleanupEnv
are currently being skipped to prevent catalog overrides), and add ncclFabricEnv
to that same condition. This ensures that the orchestrator-selected fabric value
forwarded in the earlier block cannot be silently overridden by a duplicate
entry from the catalog, maintaining the trust boundary as documented in the
adjacent comments.

In `@validators/performance/nccl_all_reduce_bw_constraint.go`:
- Around line 684-700: The RoCE ResourceClaimTemplate named "nccl-roce-rct"
created in the applyNCCLResources function when fabric equals fabricRoCE is
never deleted in the cleanupNCCLResources function. Since the namespace is
persistent and reused across test runs, this causes subsequent RoCE test runs to
fail with AlreadyExists errors. Add cleanup logic in cleanupNCCLResources to
delete the ResourceClaimTemplate using resourceClaimTemplateGVR with the name
"nccl-roce-rct" and config.Namespace. Reference the cleanup pattern used in
inference_perf_constraint.go (lines 1494–1496) as an example of the correct
deletion approach.

In `@validators/performance/nccl_test.go`:
- Line 942: The templatePath call at line 942 hardcodes fabricEFA as the
parameter, which means the RoCE runtime template path is never parsed or
validated by this test. Add a new RoCE NET subtest that iterates through
roceNETSupportedServices and calls templatePath with fabricRoCE instead of
fabricEFA to ensure RoCE runtime templates are caught for any malformation
before deployment.
- Around line 617-627: The test case for RoCE NET starting at the "eks gb200 net
roce is accelerator-agnostic" entry claims accelerator-agnostic behavior in its
comment, but only validates this with the gb200 accelerator. Add another
identical test case entry to the same test table with a different accelerator
value (such as h100) while keeping the same service (EKS), variant (NET), fabric
(RoCE), filename, and expected path ("testdata/roce/eks/runtime-net.yaml") to
actually verify that the RoCE path is independent of the accelerator choice and
prove the accelerator-agnostic contract.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 804d4334-57bb-46b7-90f1-0993909631b4

📥 Commits

Reviewing files that changed from the base of the PR and between 6663eb5 and 9959c5c.

📒 Files selected for processing (5)
  • pkg/validator/v1/job_plan_internal.go
  • validators/performance/nccl_all_reduce_bw_constraint.go
  • validators/performance/nccl_test.go
  • validators/performance/testdata/roce/eks/roce-claim.yaml
  • validators/performance/testdata/roce/eks/runtime-net.yaml

Comment thread pkg/validator/v1/job_plan_internal.go
Comment thread validators/performance/nccl_all_reduce_bw_constraint.go
Comment thread validators/performance/nccl_test.go
Comment thread validators/performance/nccl_test.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
validators/performance/nccl_test.go (1)

924-945: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Parse the standalone RoCE claim template in this guard too.

This loop now covers runtime-net.yaml, but applyNCCLResources also parses testdata/roce/{service}/roce-claim.yaml before the runtime. A malformed claim template or missing ROCE_DEVICE_COUNT placeholder would still slip past this test and fail only during validation.

Suggested extension
 	data := map[string]string{
 		"NAMESPACE":             "aicr-validation",
 		"WORKER_COUNT":          "2",
 		"GPU_COUNT_PER_NODE":    "8",
 		"GPU_COUNT":             "16",
+		"ROCE_DEVICE_COUNT":     "8",
 		"TEST_TYPE":             testType,
 		"MIN_MESSAGE_SIZE":      minMessageSize,
 		"MAX_MESSAGE_SIZE":      maxMessageSize,
 		"EFA_RESOURCE_LIMITS":   buildEFAResourceLine(1, efaIndent),
 		"EFA_RESOURCE_REQUESTS": buildEFAResourceLine(1, efaIndent),
@@
 	for service := range roceNETSupportedServices {
 		name := strings.Join([]string{string(variantNET), string(service), string(fabricRoCE)}, "/")
 		t.Run(name, func(t *testing.T) {
+			claimPath := filepath.Join("testdata", string(fabricRoCE), string(service), "roce-claim.yaml")
+			if _, err := parseYAMLTemplate(claimPath, data); err != nil {
+				t.Fatalf("supported RoCE NET combination has no parseable claim template %s: %v", claimPath, err)
+			}
+
 			path := templatePath(recipe.CriteriaAcceleratorH100, service, variantNET, fabricRoCE, "runtime.yaml")
 			if _, err := parseYAMLTemplate(path, data); err != nil {
 				t.Fatalf("supported RoCE NET combination has no parseable runtime template %s: %v", path, err)
 			}
 		})
 	}

Also applies to: 961-972

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/performance/nccl_test.go` around lines 924 - 945, Extend the
`applyNCCLResources` test coverage to also parse the standalone RoCE claim
template under `testdata/roce/{service}/roce-claim.yaml` before the runtime
template. Reuse the existing validation flow around `buildEFAResourceLine`,
`buildGKENetworkInterfacesAnnotation`, and `buildNRIDeviceAnnotation`, but add
the RoCE claim path so a malformed claim or missing `ROCE_DEVICE_COUNT`
placeholder is caught by this test instead of only at validation time. Also
apply the same update to the other duplicated block referenced by the `Also
applies to` range.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/performance/nccl_all_reduce_bw_constraint.go`:
- Around line 145-152: ncclFabric() currently falls back to EFA for any unknown
value and lets RoCE flow into non-NET paths, so tighten validation and scope it
to NET-only cases. Update ncclFabric and the templatePath routing logic to
accept RoCE only when the selected variant is NET, treat unsupported env values
as invalid instead of defaulting silently, and keep non-NET/NVLS variants pinned
to EFA. Use the existing ncclFabricType/fabricRoCE/fabricEFA symbols and the
templatePath selector to make the validation and branching explicit.
- Around line 350-353: Register cleanup before calling applyNCCLResources so any
partial NCCL/RoCE objects are removed even if resource application fails partway
through. Update the flow around applyNCCLResources in the performance NCCL
bandwidth constraint path to defer cleanupNCCLResources immediately after the
validation namespace is known, and keep the existing error wrap on apply
failure; apply the same change in the additional NCCL validation path referenced
by the comment.

In `@validators/performance/testdata/roce/eks/runtime-net.yaml`:
- Around line 165-179: The node initContainer in the performance test YAML is
doing unnecessary package installation and sshd setup that will not persist into
the later node container. Remove the apt-get/openSSH-server and /var/run/sshd
steps from the fix-ssh-perms initContainer, and keep only the authorized_keys
copy into the shared ssh-config volume so the startup path is simpler and avoids
extra failure points. Use the fix-ssh-perms initContainer and its ssh-config
volume mount as the identifiers when updating this block.

---

Outside diff comments:
In `@validators/performance/nccl_test.go`:
- Around line 924-945: Extend the `applyNCCLResources` test coverage to also
parse the standalone RoCE claim template under
`testdata/roce/{service}/roce-claim.yaml` before the runtime template. Reuse the
existing validation flow around `buildEFAResourceLine`,
`buildGKENetworkInterfacesAnnotation`, and `buildNRIDeviceAnnotation`, but add
the RoCE claim path so a malformed claim or missing `ROCE_DEVICE_COUNT`
placeholder is caught by this test instead of only at validation time. Also
apply the same update to the other duplicated block referenced by the `Also
applies to` range.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 67d576ef-d07d-478d-82d8-b6d03dd9c58d

📥 Commits

Reviewing files that changed from the base of the PR and between 66ca130 and b5f55cd.

📒 Files selected for processing (5)
  • pkg/validator/v1/job_plan_internal.go
  • validators/performance/nccl_all_reduce_bw_constraint.go
  • validators/performance/nccl_test.go
  • validators/performance/testdata/roce/eks/roce-claim.yaml
  • validators/performance/testdata/roce/eks/runtime-net.yaml

Comment thread validators/performance/nccl_all_reduce_bw_constraint.go Outdated
Comment thread validators/performance/nccl_all_reduce_bw_constraint.go Outdated
Comment thread validators/performance/testdata/roce/eks/runtime-net.yaml
@github-actions

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/user/validation.md`:
- Line 51: Update the `nccl-all-reduce-bw-net` validation row in
`docs/user/validation.md` to describe NET transport verification rather than
EFA-only behavior. Keep the `NET`/`AICR_NCCL_FABRIC=roce` distinction clear, and
change the description so it says the check verifies NET traffic on the
supported fabric paths, with the EFA silent-fallback caveat noted as the
default-path example when the NVIDIA driver preflight is relevant.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 6eb1c6ab-06eb-4032-a408-2b7da668da09

📥 Commits

Reviewing files that changed from the base of the PR and between b5f55cd and 757b977.

📒 Files selected for processing (7)
  • docs/user/validation.md
  • pkg/validator/v1/job_plan_internal.go
  • recipes/validators/README.md
  • validators/performance/nccl_all_reduce_bw_constraint.go
  • validators/performance/nccl_test.go
  • validators/performance/testdata/roce/eks/roce-claim.yaml
  • validators/performance/testdata/roce/eks/runtime-net.yaml

Comment thread docs/user/validation.md
@yuanchen8911 yuanchen8911 force-pushed the feat/nccl-roce-fabric branch from 757b977 to 911aff0 Compare June 25, 2026 15:30

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/validators/README.md`:
- Line 50: The `nccl-all-reduce-bw-net` description is updated in the README,
but the source-of-truth catalog entry is still stale. Update the matching entry
in `catalog.yaml` for `nccl-all-reduce-bw-net` so its description matches the
new README wording, keeping the validator metadata consistent across CLI/CTRF
surfaces.

In `@validators/performance/nccl_test.go`:
- Around line 1000-1011: The NCCL wiring test in nccl_test.go only checks the
RoCE runtime template and misses the new roce-claim.yaml generated by
applyNCCLResources. Add a small parse-only loop alongside the existing RoCE NET
template checks that targets roce-claim.yaml for each supported service, using
the same templatePath/parseYAMLTemplate flow and populating ROCE_DEVICE_COUNT in
the test data so malformed claim templates are caught early.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 47ea1ec1-bab8-4867-abaf-2c877a14f2c4

📥 Commits

Reviewing files that changed from the base of the PR and between 757b977 and 911aff0.

📒 Files selected for processing (7)
  • docs/user/validation.md
  • pkg/validator/v1/job_plan_internal.go
  • recipes/validators/README.md
  • validators/performance/nccl_all_reduce_bw_constraint.go
  • validators/performance/nccl_test.go
  • validators/performance/testdata/roce/eks/roce-claim.yaml
  • validators/performance/testdata/roce/eks/runtime-net.yaml

Comment thread recipes/validators/README.md
Comment thread validators/performance/nccl_test.go
@yuanchen8911 yuanchen8911 force-pushed the feat/nccl-roce-fabric branch 4 times, most recently from 9f8adc5 to 7b65e2b Compare June 25, 2026 16:16
@yuanchen8911 yuanchen8911 marked this pull request as ready for review June 25, 2026 16:57
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner June 25, 2026 16:57
@yuanchen8911 yuanchen8911 requested review from mchmarny and xdu31 June 25, 2026 16:57
@yuanchen8911 yuanchen8911 changed the title WIP: feat(validators): RoCE NET variant for nccl-all-reduce-bw (AICR_NCCL_FABRIC), default EFA feat(validators): RoCE NET variant for nccl-all-reduce-bw, default EFA Jun 25, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/nccl-roce-fabric branch from 7b65e2b to 3585dfb Compare June 25, 2026 17:03
@yuanchen8911 yuanchen8911 changed the title feat(validators): RoCE NET variant for nccl-all-reduce-bw, default EFA feat(validators): RoCE NET variant for nccl-all-reduce-bw Jun 25, 2026
Comment thread pkg/validator/v1/job_plan_internal.go
@yuanchen8911 yuanchen8911 force-pushed the feat/nccl-roce-fabric branch from 3585dfb to 0887059 Compare June 25, 2026 18:38
@yuanchen8911 yuanchen8911 requested a review from xdu31 June 25, 2026 18:40
@xdu31

xdu31 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Confirmed Issues

F1 [medium] — Missing catalog lock test + cross-package constant duplication
pkg/validator/v1/job_plan_internal.go#L57, #L61 + validators/performance/nccl_all_reduce_bw_constraint.go#L131
ncclAllReduceBWNetCheckName is unexported and has no TestEmbeddedCatalog_*EntryExists lock test, unlike the sibling InferencePerfCheckName/InferenceGatewayCheckName constants whose godoc explicitly explains the lock exists to prevent silent forwarding no-ops on catalog rename. Additionally, ncclFabricEnv = "AICR_NCCL_FABRIC" is redeclared independently in two packages with no test asserting equality. Renaming the catalog entry, or fat-fingering one of the two env-string declarations, would silently no-op RoCE forwarding (validator pod sees no env value → ncclFabric() defaults to EFA) with no test failing.

F2 [minor] — Catalog description stale
recipes/validators/catalog.yaml#L234
Still reads "NET transport (EFA on EKS)". README and docs/user/validation.md were updated to mention RoCE; the catalog description (surfaced in CTRF reports) was not.

F3 [minor] — Orchestrator forwards AICR_NCCL_FABRIC without validation
pkg/validator/v1/job_plan_internal.go#L170
A typo (AICR_NCCL_FABRIC=roc) burns a full validator Job spin-up + RBAC + pod-pull before in-Job ncclFabric() rejects it. Compare to the adjacent inferencePerfNoCleanupEnv block at #L183, which gates on strconv.ParseBool before forwarding.

F4 [minor] — ROCE_DEVICE_COUNT only set inside the EKS discovery branch
validators/performance/nccl_all_reduce_bw_constraint.go#L669 vs #L717, #L729
Runtime/claim apply gates on fabric == fabricRoCE, but the template var is set inside if service == CriteriaServiceEKS. Dormant today (only EKS in roceNETSupportedServices). Adding a non-EKS service to that map without mirroring the EKS branch leaves literal ${ROCE_DEVICE_COUNT} in rendered YAML and fails parse. Fix is to move the assignment next to the if fabric == fabricRoCE block at line 717 so it's keyed by fabric, not service.

F5 [minor] — Empty-env test doesn't exercise the unset branch
pkg/validator/v1/job_plan_test.go#L550-L555
Uses t.Setenv("") (LookupEnv returns ok=true, v=""), not os.Unsetenv (ok=false). Behaviorally identical today under the ok && v != "" gate, but the validators-side TestNCCLFabric already calls os.Unsetenv for this reason — the orchestrator test could match.

F7 [cosmetic] — Cleanup unconditionally deletes RoCE RCT
validators/performance/nccl_all_reduce_bw_constraint.go#L1372
NotFound-tolerated; one extra API call per non-RoCE cleanup. Consistent with the pre-existing ComputeDomain cleanup pattern — keeps cleanup oblivious to the active variant rather than threading a fabric arg through.

F13 [minor / follow-up] — GB300 not in CriteriaAcceleratorType enum
pkg/recipe/criteria.go#L113-L120
GB300 is recognized at the GPU fingerprint layer (pkg/collector/gpu/device_ids.go#L37, with PCI IDs 31a1/31c2/31c3) and in measurement context maps, but no CriteriaAcceleratorGB300 constant exists. Recipes cannot declare accelerator: gb300 today — operators targeting the GB300 hardware this PR validates on must fall back to gb200 or any. Independent of this PR's fabric routing; surfaced by it. Per the project's "Enum/constant value added" convention, adding GB300 would touch the criteria type, OpenAPI contract, several doc pages, and issue templates — a separate change.

F14 [minor] — Parameter explosion in applyNCCLResources / runNCCLTrainJob / templatePath
validators/performance/nccl_all_reduce_bw_constraint.go#L184, #L339, #L626
The same (accelerator, service, variant, fabric) tuple now appears in three signatures. applyNCCLResources reaches 7 positional params, one of which (dynamicClient) duplicates ctx.DynamicClient. Each new routing dimension forces an edit to every call site + every test.

Suggested refactor (separate PR / follow-up):

```go
type ncclSelection struct {
Accelerator recipe.CriteriaAcceleratorType
Service recipe.CriteriaServiceType
Variant ncclVariant
Fabric ncclFabricType
}

func templatePath(sel ncclSelection, filename string) string
func runNCCLTrainJob(ctx *validators.Context, gpuConfig *gpuConfiguration, sel ncclSelection) (string, error)
func applyNCCLResources(ctx *validators.Context, config *gpuConfiguration, sel ncclSelection) error // drop redundant dynamicClient
```

Drops applyNCCLResources from 7 to 3 params; next dimension (e.g. an OS/topology key) lands without touching call sites.

Open Questions

  • F8ROCE_DEVICE_COUNT keyed off config.GPUCountPerNode, not the discovered RoCE NIC pool. No preflight if a future RoCE instance type has NICs < GPUs/node.
  • F9 — Claim path hand-built at nccl_all_reduce_bw_constraint.go#L722, bypasses templatePath(). Drift between the two would not be caught by tests.
  • F10 — Skip message at #L256 reports "requires Service + Accelerator" even when the RoCE path explicitly ignores accelerator. Diagnostic clarity only.
  • F11roceNETSupportedServices naming doesn't reflect that it's only consulted for variant == variantNET && fabric == fabricRoCE. Consider roceNETVariantSupportedServices.
  • F12 — Worker apt-get install openssh-server has no fast-fail; air-gapped clusters surface as a long launcher SSH-retry timeout (~minutes) instead of a clear "no apt egress" error. Documented as interim in Support RoCE fabric for the nccl-all-reduce-bw NET test/validation #1413.

@yuanchen8911

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough pass. Addressed in 9113e09:

  • F1 — exported NCCLAllReduceBWNetCheckName and added TestEmbeddedCatalog_NCCLAllReduceBWNetEntryExists (mirrors the inference-perf/gateway lock tests). Added env-name lock tests pinning AICR_NCCL_FABRIC to its canonical literal in both the orchestrator and validator-pod packages, so a fat-finger in either redeclaration fails CI.
  • F2 — catalog description now mentions the RoCE path.
  • F4 — moved the ROCE_DEVICE_COUNT assignment into the fabric == fabricRoCE block so it's keyed by fabric, not service; a future non-EKS RoCE service no longer leaves ${ROCE_DEVICE_COUNT} unrendered.
  • F5 — added an explicit unset (os.Unsetenv) subtest alongside the empty-string case.

Deferred (folded into existing trackers rather than new issues):

  • F8 (NIC-pool preflight) and F12 (apt fast-fail / air-gapped) → Support RoCE fabric for the nccl-all-reduce-bw NET test/validation #1413, where the fabric-detection and image work already lives.
  • F13 (CriteriaAcceleratorGB300 enum) → Add and validate GB300 recipe overlays #1318, since it's a prerequisite for first-class GB300 overlays and independent of this PR's fabric routing.
  • F3 (orchestrator-side validation), F9/F10/F11 (claim-path/templatePath, skip message, naming), F14 (ncclSelection param-struct refactor) — reasonable but low-value or explicitly separate-PR; left out to keep this interim PR tight.

F7 (unconditional RoCE-RCT delete in cleanup) — leaving as-is: it's NotFound-tolerant and intentionally mirrors the ComputeDomain cleanup pattern, keeping cleanup oblivious to the active variant rather than threading a fabric arg through.

@github-actions

Copy link
Copy Markdown
Contributor

Recipe evidence check

No leaf overlays affected by this PR.

This gate is warning-only and never blocks merge.

…FABRIC), default EFA

Adds a ConnectX RoCE path to the nccl-all-reduce-bw-net test, selected by
AICR_NCCL_FABRIC=roce (default efa — every existing EFA recipe is
unchanged). Fabric-keyed and accelerator-agnostic: the RoCE NET template
lives at testdata/roce/{service}/ and is shared across EKS RoCE nodes
rather than per-accelerator, so it does not depend on any specific
accelerator constant.

When fabric=roce, the NET test:
- uses NCCL's built-in IB/verbs transport over the ConnectX RoCE devices
  (NCCL_IB_HCA=rocep, NCCL_NET_PLUGIN=none to bypass the bundled aws-ofi
  EFA plugin), on a CUDA-13 pytorch image (sm_103-capable; the EFA
  hpc-cloud image is CUDA 12 + EFA-only);
- claims RoCE NICs via a DRA ResourceClaimTemplate (roce.networking.k8s.aws)
  instead of the vpc.amazonaws.com/efa extended resource;
- disables HPC-X UCC/HCOLL and forces the ob1 PML over TCP (their team-create
  / UCX PML segfault during MPI_Init on this image — MPI is only bootstrap).

The transport assertion (verifyTransportFromLogs) is unchanged: it already
accepts any non-Socket NET plugin, so 'Using network IB' passes.

AICR_NCCL_FABRIC is forwarded to the NET check pod via buildEnv (the test
runs in-Job), scoped to nccl-all-reduce-bw-net.

Validated end-to-end on a ConnectX-RoCE GB300 cluster: NCCL IB over 8
rocep* RoCE devices (GPUDirect RDMA), ~387 GB/s peak busbw. The env knob is
interim; snapshot-based fabric auto-detection is tracked in NVIDIA#1413.

Refs NVIDIA#1413, NVIDIA#1410.
@yuanchen8911 yuanchen8911 force-pushed the feat/nccl-roce-fabric branch from 9113e09 to 681760f Compare June 25, 2026 19:42
@mchmarny mchmarny merged commit 381ba2a into NVIDIA:main Jun 26, 2026
158 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants