WIP: feat(recipes): GB300 EKS service-bound overlays (re-file of #1319) by yuanchen8911 · Pull Request #1336 · NVIDIA/aicr

yuanchen8911 · 2026-06-12T16:36:19Z

WIP / Draft. Re-files #1319 (reverted in #1328) so the GB300 EKS recipe work isn't lost. The blocking fixes have now landed on this branch and GB300 validates clean (see Validation); remaining out-of-draft gating is the standalone validator bug #1323.

Summary

Re-introduces the GB300 EKS service-bound overlays + accelerator plumbing from #1319, and lands the fixes needed to make GB300 deploy and validate end-to-end on a real cluster.

Networking fabric (EFA vs RoCE) — RoCE default, EFA opt-in

GB300 EKS networking defaults to ConnectX RoCE via a vendored AWS networking DRA driver (aws-net-k8s-dra-plugin, DeviceClass roce.networking.k8s.aws). The DGXC GB300 instance variant (p6e-gb300r) exposes a ConnectX RoCE fabric with no EFA device, so the legacy aws-efa device plugin crash-loops there (not EFA enabled).

Standard EFAv4 GB300 (the AWS-documented default) is supported via an explicit bundle-time opt-in:

aicr bundle ... --set awsefa:enabled=true --set awsnetdra:enabled=false

This is the explicit-intent interim (no auto-detection). Automatic fabric detection (EFA vs RoCE), with the full options analysis, is tracked in #1410.

RoCE plugin image — AWS-owned ECR, no mirroring needed

The DRA driver image (eks/networking-device-dra-plugin) is pulled from AWS's own EKS-managed ECR — account 602401143452 is an AWS-owned, AWS-chosen registry (one per region), not an NVIDIA/DGXC account. It is digest-pinned in roce.yaml. The image is the general AWS networking-device DRA driver (driver dra.networking.k8s.aws); RoCE-specificity lives only in the DeviceClass CEL we ship, not the image.

A normal EKS GB300 deploy needs no mirroring and no pull secret: the GPU node's IAM role already carries ECR pull permission and AWS's repo policy grants pull to any authenticated AWS account, so kubelet pulls it transparently (verified on aicr-gb300 in us-east-2 pulling the us-west-2 image cross-region). The image is not anonymously pullable (no public.ecr.aws mirror, unlike aws-ebs-csi-driver) — it follows the same auth-gated pattern AICR already uses for the legacy aws-efa plugin. The open-source/transparency caveat (no anonymous pull, no source label/SBOM/signature) and the hardcoded us-west-2/region rigidity are tracked in #1410.

Changes on this branch

GB300 EKS overlays + gb300 accelerator plumbing (re-file of feat(recipes): add concrete GB300 EKS service-bound overlays #1319).
CUDA-13 vLLM runtime for Blackwell sm_103.
RoCE fabric: vendored aws-net-dra (manifest-only) + health check; aws-efa disabled on GB300 overlays (EFA opt-in documented above).
Kernel: nodewright NVIDIA_SETUP_KERNEL_ALLOW_NEWER=true. GB300 Grace nodes need the 64KB-page kernel (6.17.0-1007-aws-64k); the tuning was force-downgrading them to the 4KB 6.14.0-1018-aws, which breaks CUDA context creation + NVLink fabric (nvidia-smi works, CUDA fails "device unavailable"). This is a kernel-pin issue independent of GB200/GB300.
dynamo-platform 1.2.0 / grove alpha.8 (align gb300 overlay with main; 1.2.0 serves the v1beta1 CRD the validators require).
BOM regenerated for the new aws-net-dra image.

Validation

aicr validate --phase all on a DGXC GB300 cluster (p6e-gb300r, aicr built from main, validator :edge):

deployment 4/4 · conformance 10/10 · performance 1/1 — all green.

Cluster prerequisite (not in the recipe): dynamo 1.2.0's NATS discovery needs TCP 4222 (and Prometheus 9090) reachable GPU-nodegroup → system-nodegroup. On EKS this is a security-group ingress rule owned by cluster provisioning, not AICR — see docs/integrator/eks-dynamo-networking.md.

Open issues / status

EFA on AWS: legacy device plugin vs networking DRA driver — should we migrate? #1326 EFA vs networking DRA driver — addressed: RoCE default via aws-net-dra, EFA opt-in documented; fabric-aware auto-detection follow-up filed as Make EKS networking fabric-aware (EFA vs ConnectX RoCE) — GB300 is fabric-hardcoded #1410.
Allocate GPUs via device plugin, not DRA: align with driver default (DRA = ComputeDomain only) #1327 GPU allocation via device plugin vs DRA — GPUs are allocated via the DRA API (gpu.nvidia.com); the cuda-validator passes alongside DRA. The earlier "device busy/unavailable" was the kernel issue above (4KB vs 64KB), not device-plugin/DRA contention.
Add and validate GB300 recipe overlays #1318 CUDA-13 vLLM runtime (sm_103) — included.
Deployment validator stalls ~8m and reports 'other' on absent resources #1323 Deployment-validator context-deadline bug (P1, WIP) — standalone, still gating out-of-draft.

Tracking issue: #1318. Reverted by: #1328.

github-actions · 2026-06-12T16:38:29Z

Recipe evidence check

Broad impact: recipes/registry.yaml or recipes/overlays/base.yaml changed;
every leaf recipe is potentially affected. The list below covers all of them — each
one would ideally have refreshed evidence before merge.

Affected leaf overlays: 69

Recipe	Pointer	Verify	Digest match
`a100-aks-training`	⚠️ missing	—	—
`a100-aks-ubuntu-training-kubeflow`	⚠️ missing	—	—
`a100-aks-ubuntu-training`	⚠️ missing	—	—
`a100-eks-training`	⚠️ missing	—	—
`a100-eks-ubuntu-training-kubeflow`	⚠️ missing	—	—
`a100-eks-ubuntu-training`	⚠️ missing	—	—
`a100-gke-cos-training-kubeflow`	⚠️ missing	—	—
`a100-gke-cos-training`	⚠️ missing	—	—
`a100-oke-training`	⚠️ missing	—	—
`a100-oke-ubuntu-training-kubeflow`	⚠️ missing	—	—
`a100-oke-ubuntu-training`	⚠️ missing	—	—
`b200-gke-cos-inference-dynamo`	⚠️ missing	—	—
`b200-gke-cos-inference`	⚠️ missing	—	—
`b200-gke-cos-training-kubeflow`	⚠️ missing	—	—
`b200-gke-cos-training`	⚠️ missing	—	—
`gb200-eks-inference`	⚠️ missing	—	—
`gb200-eks-training`	⚠️ missing	—	—
`gb200-eks-ubuntu-inference-dynamo`	⚠️ missing	—	—
`gb200-eks-ubuntu-inference`	⚠️ missing	—	—
`gb200-eks-ubuntu-training-kubeflow`	⚠️ missing	—	—
`gb200-eks-ubuntu-training`	✅ present	✅ passed	⚠️ stale (`9aeea19f5b75…` vs current `07a8ff1e9d4b…`)
`gb200-oke-inference`	⚠️ missing	—	—
`gb200-oke-training`	⚠️ missing	—	—
`gb200-oke-ubuntu-inference-dynamo`	⚠️ missing	—	—
`gb200-oke-ubuntu-inference`	⚠️ missing	—	—
`gb200-oke-ubuntu-training-kubeflow`	⚠️ missing	—	—
`gb200-oke-ubuntu-training`	⚠️ missing	—	—
`gb300-eks-inference`	⚠️ missing	—	—
`gb300-eks-training`	⚠️ missing	—	—
`gb300-eks-ubuntu-inference-dynamo`	⚠️ missing	—	—
`gb300-eks-ubuntu-inference`	⚠️ missing	—	—
`gb300-eks-ubuntu-training-kubeflow`	⚠️ missing	—	—
`gb300-eks-ubuntu-training`	⚠️ missing	—	—
`h100-aks-inference`	⚠️ missing	—	—
`h100-aks-training`	⚠️ missing	—	—
`h100-aks-ubuntu-inference-dynamo`	⚠️ missing	—	—
`h100-aks-ubuntu-inference`	⚠️ missing	—	—
`h100-aks-ubuntu-training-kubeflow`	⚠️ missing	—	—
`h100-aks-ubuntu-training`	⚠️ missing	—	—
`h100-bcm-training`	⚠️ missing	—	—
`h100-bcm-ubuntu-training`	⚠️ missing	—	—
`h100-eks-inference`	⚠️ missing	—	—
`h100-eks-training`	⚠️ missing	—	—
`h100-eks-ubuntu-inference-dynamo`	⚠️ missing	—	—
`h100-eks-ubuntu-inference-nim`	⚠️ missing	—	—
`h100-eks-ubuntu-inference`	⚠️ missing	—	—
`h100-eks-ubuntu-training-kubeflow`	⚠️ missing	—	—
`h100-eks-ubuntu-training-slurm`	⚠️ missing	—	—
`h100-eks-ubuntu-training`	⚠️ missing	—	—
`h100-gke-cos-inference-dynamo`	⚠️ missing	—	—
`h100-gke-cos-inference`	⚠️ missing	—	—
`h100-gke-cos-training-kubeflow`	⚠️ missing	—	—
`h100-gke-cos-training-slurm`	⚠️ missing	—	—
`h100-gke-cos-training`	✅ present	✅ passed	⚠️ stale (`b82cade51fd2…` vs current `064f27fa945d…`)
`h100-kind-inference-dynamo`	⚠️ missing	—	—
`h100-kind-inference`	⚠️ missing	—	—
`h100-kind-training-kubeflow`	⚠️ missing	—	—
`h100-kind-training-slurm`	⚠️ missing	—	—
`h100-kind-training`	⚠️ missing	—	—
`h200-eks-inference`	⚠️ missing	—	—
`h200-eks-training`	⚠️ missing	—	—
`rtx-pro-6000-eks-inference`	⚠️ missing	—	—
`rtx-pro-6000-eks-ubuntu-inference-dynamo`	⚠️ missing	—	—
`rtx-pro-6000-eks-ubuntu-inference-nim`	⚠️ missing	—	—
`rtx-pro-6000-eks-ubuntu-inference`	⚠️ missing	—	—
`rtx-pro-6000-lke-inference`	⚠️ missing	—	—
`rtx-pro-6000-lke-training`	⚠️ missing	—	—
`rtx-pro-6000-lke-ubuntu-inference`	⚠️ missing	—	—
`rtx-pro-6000-lke-ubuntu-training`	⚠️ missing	—	—

How to refresh evidence

Run on a cluster matching the recipe's criteria:

aicr snapshot -o snapshot.yaml
aicr validate \
  -r recipes/overlays/<slug>.yaml \
  -s snapshot.yaml \
  --emit-attestation ./out \
  --push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yaml

This gate is warning-only and never blocks merge. See ADR-007 for the trust model.

coderabbitai · 2026-06-12T16:44:35Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds GB300 accelerator support end-to-end: defines a new exported CriteriaAcceleratorGB300 constant with parsing/validation logic, adds GB300 SKU detection in fingerprinting with proper ordering before B200, extends OpenAPI schemas and CLI/API reference documentation to include gb300, introduces the AWS DRA networking component for RoCE-based fabric connectivity, adds seven RecipeMetadata overlays covering GB300 on EKS (with training/inference, Ubuntu, Dynamo, and Kubeflow variants), extends NCCL and performance validators with GB300 support (template aliasing, combination matrix updates, preflight applicability), and updates pinned Dynamo vLLM runtime image tags in validator test fixtures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Add and validate GB300 recipe overlays #1318: Declares and implements GB300 support aligned with all three core success criteria in this PR—constant definition, recipe overlay manifests, and overlay metadata test coverage.
Make EKS networking fabric-aware (EFA vs ConnectX RoCE) — GB300 is fabric-hardcoded #1410: The interim "Option C" networking solution recommended in this issue is implemented here via the new aws-net-dra RoCE component, disabling aws-efa in GB300 EKS overlays, and establishing GB300 accelerator support layers that fabric-detection work would build upon.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main change as GB300 EKS service-bound overlays with re-filing context; it accurately reflects the primary objective in the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description comprehensively details the re-filed GB300 EKS overlays, the changes included, validation results, and relevant issue tracking.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

recipes/overlays/gb300-eks-training.yaml (1)
121-131: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

GB300/EKS still encodes DRA as the required allocation model across recipes/overlays/gb300-eks-training.yaml and recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml. The training recipe enforces dra-support, and the Dynamo leaf hard-codes nvidia-dra-driver-gpu plus the DRA-driven >= 1.34 floor. That shared root cause conflicts with the PR’s own open blockers (#1327/#1326) about GB300/EKS allocation and networking, so these overlays will validate the wrong deployment shape until the device-plugin-vs-DRA decision is resolved.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/gb300-eks-training.yaml` around lines 121 - 131, The overlay
still forces DRA by listing the conformance check "dra-support" and hard-coding
the Dynamo leaf to use "nvidia-dra-driver-gpu" with a ">= 1.34" floor; remove or
gate the "dra-support" entry from the conformance.checks array in
recipes/overlays/gb300-eks-training.yaml and remove the hard-coded
"nvidia-dra-driver-gpu" and its ">= 1.34" version constraint from
recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml (or replace both with a
feature-flag/conditional that defers to the device-plugin vs DRA decision), so
the overlays no longer validate deployments assuming DRA until the
device-plugin/DRA decision in PRs `#1327/`#1326 is resolved.
validators/performance/nccl_all_reduce_bw_constraint.go (1)
138-169: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't bake GB300 into the GB200/EFA NCCL path yet.

Lines 138-169 hard-code GB300 onto the GB200 templates and advertise EKS NET/NVLS support as if the transport/runtime shape were identical. That conflicts with the PR objective's open blocker #1326, which explicitly calls out that GB300 on AWS is RoCE rather than EFA. As written, this can route GB300 validation through the wrong runtime assets and also triggers the GB200-specific NVreg preflight downstream in validators/performance/nccl_preflight_nvreg.go. Either keep GB300 out of the EKS NET/NVLS matrix until that blocker is closed, or add GB300-specific templates/preflight once the transport is validated.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/performance/nccl_all_reduce_bw_constraint.go` around lines 138 -
169, Remove the temporary GB300-to-GB200 alias and any advertised GB300 support
in the NCCL combinations until AWS RoCE behavior is resolved: stop mapping
accelerator == recipe.CriteriaAcceleratorGB300 to
recipe.CriteriaAcceleratorGB200 in the templatePath logic (the block that
mutates the accelerator variable before returning filepath.Join) and remove
recipe.CriteriaAcceleratorGB300 from the lists under variantNET and variantNVLS
in the supportedNCCLCombinations map so GB300 is not listed for
recipe.CriteriaServiceEKS; keep GB300 out of any GB200-specific NVReg/preflight
paths until separate GB300 templates/preflight are added.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@recipes/overlays/gb300-eks-training.yaml`:
- Around line 121-131: The overlay still forces DRA by listing the conformance
check "dra-support" and hard-coding the Dynamo leaf to use
"nvidia-dra-driver-gpu" with a ">= 1.34" floor; remove or gate the "dra-support"
entry from the conformance.checks array in
recipes/overlays/gb300-eks-training.yaml and remove the hard-coded
"nvidia-dra-driver-gpu" and its ">= 1.34" version constraint from
recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml (or replace both with a
feature-flag/conditional that defers to the device-plugin vs DRA decision), so
the overlays no longer validate deployments assuming DRA until the
device-plugin/DRA decision in PRs `#1327/`#1326 is resolved.

In `@validators/performance/nccl_all_reduce_bw_constraint.go`:
- Around line 138-169: Remove the temporary GB300-to-GB200 alias and any
advertised GB300 support in the NCCL combinations until AWS RoCE behavior is
resolved: stop mapping accelerator == recipe.CriteriaAcceleratorGB300 to
recipe.CriteriaAcceleratorGB200 in the templatePath logic (the block that
mutates the accelerator variable before returning filepath.Join) and remove
recipe.CriteriaAcceleratorGB300 from the lists under variantNET and variantNVLS
in the supportedNCCLCombinations map so GB300 is not listed for
recipe.CriteriaServiceEKS; keep GB300 out of any GB200-specific NVReg/preflight
paths until separate GB300 templates/preflight are added.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 338fc99e-d81b-4381-93f2-0049b29848de

📥 Commits

Reviewing files that changed from the base of the PR and between e6d9301 and 9ad2d74.

📒 Files selected for processing (27)

.claude/skills/aicr-analyzing-snapshots/SKILL.md
.github/ISSUE_TEMPLATE/bug_report.yml
api/aicr/v1/server.yaml
docs/contributor/recipe.md
docs/user/api-reference.md
docs/user/cli-reference.md
pkg/cli/recipe.go
pkg/client/v1/types.go
pkg/fingerprint/doc.go
pkg/fingerprint/gpu_sku.go
pkg/fingerprint/gpu_sku_test.go
pkg/fingerprint/types.go
pkg/recipe/criteria.go
pkg/recipe/criteria_test.go
pkg/recipe/doc.go
pkg/recipe/metadata_test.go
recipes/overlays/gb300-any.yaml
recipes/overlays/gb300-eks-inference.yaml
recipes/overlays/gb300-eks-training.yaml
recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml
recipes/overlays/gb300-eks-ubuntu-inference.yaml
recipes/overlays/gb300-eks-ubuntu-training-kubeflow.yaml
recipes/overlays/gb300-eks-ubuntu-training.yaml
validators/performance/nccl_all_reduce_bw_constraint.go
validators/performance/nccl_preflight_nvreg.go
validators/performance/nccl_preflight_nvreg_test.go
validators/performance/nccl_test.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/performance/model_cache.go`:
- Line 95: The code pins cacheWorkerImage to a CUDA 13 vLLM runtime
unconditionally which can break model-cache population on older GPU
accelerators; update buildModelCachePopulateJob (and any code that references
cacheWorkerImage) to select the runtime image based on the cluster/node
accelerator type (e.g., H100, A100, GB200, GB300, H200, B200, RTX Pro) or a
configurable override, and update the worker YAML templates
(validators/performance/testdata/inference/dynamo-deployment*.yaml) to support
multiple runtime images or accept an image parameter; specifically, make
cacheWorkerImage a function or switch keyed by accelerator label, wire that into
ensureModelCache which creates the populate Job, and allow an env/flag to force
a particular image for testing so older device generations use compatible CUDA
runtimes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c03a6f6f-b82b-459c-859e-63ae0caa61f3

📥 Commits

Reviewing files that changed from the base of the PR and between 9ad2d74 and a0e4f2c.

📒 Files selected for processing (3)

validators/performance/model_cache.go
validators/performance/testdata/inference/dynamo-deployment-gateway-epp.yaml
validators/performance/testdata/inference/dynamo-deployment.yaml

github-actions · 2026-06-22T21:58:34Z

🌿 Preview your docs: https://nvidia-preview-feat-gb300-eks-overlays-v2.docs.buildwithfern.com/aicr

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

validators/performance/nccl_test.go (1)

601-691: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add an explicit GB300 template-alias test case.

templatePath now aliases gb300 to the gb200 template directory, but this table currently doesn’t assert that contract directly. A focused GB300 case would prevent silent regressions in future refactors.

💡 Suggested diff

@@
 		{
 			name:        "gb200 eks NET variant",
 			accelerator: recipe.CriteriaAcceleratorGB200,
 			service:     recipe.CriteriaServiceEKS,
 			variant:     variantNET,
 			filename:    "runtime.yaml",
 			expected:    filepath.Join("testdata", "gb200", "eks", "runtime-net.yaml"),
 		},
+		{
+			name:        "gb300 eks NET variant aliases gb200 template",
+			accelerator: recipe.CriteriaAcceleratorGB300,
+			service:     recipe.CriteriaServiceEKS,
+			variant:     variantNET,
+			filename:    "runtime.yaml",
+			expected:    filepath.Join("testdata", "gb200", "eks", "runtime-net.yaml"),
+		},

As per coding guidelines, “Check test coverage on affected packages before pushing a PR with Go changes.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/performance/nccl_test.go` around lines 601 - 691, The
TestTemplatePath function is missing a test case for the GB300 accelerator
alias. Add a new test case struct to the tests slice with the accelerator set to
recipe.CriteriaAcceleratorGB300, choose any appropriate service type like
recipe.CriteriaServiceEKS, set variant to variantDefault, set filename to
"runtime.yaml", and set the expected path to filepath.Join("testdata", "gb200",
"eks", "runtime.yaml") to verify that GB300 correctly aliases to the GB200
template directory. This ensures the gb300 aliasing contract is explicitly
tested and prevents silent regressions.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/components/aws-net-dra/manifests/roce.yaml`:
- Around line 110-131: The network-plugin container in the DaemonSet
specification lacks an explicit seccomp profile in its securityContext, which
relies on runtime defaults and weakens pod confinement. Add a seccompProfile
field to the securityContext of the network-plugin container to explicitly
define the seccomp profile (such as RuntimeDefault or a custom profile name)
rather than relying on environment-dependent defaults, ensuring consistent
security enforcement across different Kubernetes clusters.

---

Outside diff comments:
In `@validators/performance/nccl_test.go`:
- Around line 601-691: The TestTemplatePath function is missing a test case for
the GB300 accelerator alias. Add a new test case struct to the tests slice with
the accelerator set to recipe.CriteriaAcceleratorGB300, choose any appropriate
service type like recipe.CriteriaServiceEKS, set variant to variantDefault, set
filename to "runtime.yaml", and set the expected path to
filepath.Join("testdata", "gb200", "eks", "runtime.yaml") to verify that GB300
correctly aliases to the GB200 template directory. This ensures the gb300
aliasing contract is explicitly tested and prevents silent regressions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 3ca51af9-828c-41ee-a960-4f267f802d20

📥 Commits

Reviewing files that changed from the base of the PR and between a0e4f2c and 1493012.

📒 Files selected for processing (35)

.claude/skills/aicr-analyzing-snapshots/SKILL.md
.github/ISSUE_TEMPLATE/bug_report.yml
api/aicr/v1/server.yaml
docs/contributor/recipe.md
docs/user/api-reference.md
docs/user/cli-reference.md
docs/user/container-images.md
pkg/cli/recipe.go
pkg/client/v1/types.go
pkg/fingerprint/doc.go
pkg/fingerprint/gpu_sku.go
pkg/fingerprint/gpu_sku_test.go
pkg/fingerprint/types.go
pkg/recipe/criteria.go
pkg/recipe/criteria_test.go
pkg/recipe/doc.go
pkg/recipe/metadata_test.go
recipes/checks/aws-net-dra/health-check.yaml
recipes/components/aws-net-dra/manifests/roce.yaml
recipes/components/nodewright-customizations/manifests/tuning.yaml
recipes/overlays/gb300-any.yaml
recipes/overlays/gb300-eks-inference.yaml
recipes/overlays/gb300-eks-training.yaml
recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml
recipes/overlays/gb300-eks-ubuntu-inference.yaml
recipes/overlays/gb300-eks-ubuntu-training-kubeflow.yaml
recipes/overlays/gb300-eks-ubuntu-training.yaml
recipes/registry.yaml
validators/performance/model_cache.go
validators/performance/nccl_all_reduce_bw_constraint.go
validators/performance/nccl_preflight_nvreg.go
validators/performance/nccl_preflight_nvreg_test.go
validators/performance/nccl_test.go
validators/performance/testdata/inference/dynamo-deployment-gateway-epp.yaml
validators/performance/testdata/inference/dynamo-deployment.yaml

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/components/nodewright-customizations/manifests/tuning.yaml`:
- Line 82: The `NVIDIA_SETUP_KERNEL_ALLOW_NEWER` setting in the shared
tuning.yaml manifest is applying the "true" value globally across all 16
overlays and GPU families (GB300, H100, H200, A100, GB200, B200, RTX Pro, etc.).
If this kernel allow override is intended only for GB300 GPU families, move the
`NVIDIA_SETUP_KERNEL_ALLOW_NEWER: "true"` configuration from the shared
tuning.yaml manifest to a GB300-specific overlay file instead. If this change is
intentionally global across all GPU families, document this decision in the pull
request. Verify which GPU families actually require this kernel setting to avoid
unintended side effects on other hardware.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 16e87c61-744c-41d8-8861-368ef97637df

📥 Commits

Reviewing files that changed from the base of the PR and between fc3303f and 739e543.

📒 Files selected for processing (35)

.claude/skills/aicr-analyzing-snapshots/SKILL.md
.github/ISSUE_TEMPLATE/bug_report.yml
api/aicr/v1/server.yaml
docs/contributor/recipe.md
docs/user/api-reference.md
docs/user/cli-reference.md
docs/user/container-images.md
pkg/cli/recipe.go
pkg/client/v1/types.go
pkg/fingerprint/doc.go
pkg/fingerprint/gpu_sku.go
pkg/fingerprint/gpu_sku_test.go
pkg/fingerprint/types.go
pkg/recipe/criteria.go
pkg/recipe/criteria_test.go
pkg/recipe/doc.go
pkg/recipe/metadata_test.go
recipes/checks/aws-net-dra/health-check.yaml
recipes/components/aws-net-dra/manifests/roce.yaml
recipes/components/nodewright-customizations/manifests/tuning.yaml
recipes/overlays/gb300-any.yaml
recipes/overlays/gb300-eks-inference.yaml
recipes/overlays/gb300-eks-training.yaml
recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml
recipes/overlays/gb300-eks-ubuntu-inference.yaml
recipes/overlays/gb300-eks-ubuntu-training-kubeflow.yaml
recipes/overlays/gb300-eks-ubuntu-training.yaml
recipes/registry.yaml
validators/performance/model_cache.go
validators/performance/nccl_all_reduce_bw_constraint.go
validators/performance/nccl_preflight_nvreg.go
validators/performance/nccl_preflight_nvreg_test.go
validators/performance/nccl_test.go
validators/performance/testdata/inference/dynamo-deployment-gateway-epp.yaml
validators/performance/testdata/inference/dynamo-deployment.yaml

…1319)

The default vllm-runtime:1.2.0 is a CUDA 12.9 build whose flashinfer kernels lack sm_103 (GB300/Blackwell Ultra), so the inference-perf worker crash-loops with "no kernel image is available for execution on the device". Switch the inference workload + model-cache populate image to vllm-runtime:1.2.0-cuda13 (CUDA 13.0), which covers the Blackwell family. Verified end-to-end: GB300 (sm_103) and RTX Pro 6000 (sm_120) both serve Qwen3-8B and pass the inference-perf gate with the cuda13 image. Refs NVIDIA#1318

GB300 (p6e-gb300r) exposes a ConnectX RoCE fabric, not AWS EFA. The legacy aws-efa device plugin crash-loops on these nodes (no EFA hardware). Replace it on GB300 EKS overlays with a vendored AWS networking DRA plugin (aws-net-k8s-dra-plugin v1.0.0, DeviceClass roce.networking.k8s.aws), mirroring the DGXC reference GB300 fabric config. - Vendor aws-net-dra as a manifest-only component (chart is private on NGC; image is public ECR) + health check. - gb300-eks-{inference,training}: disable aws-efa, add aws-net-dra. Refs NVIDIA#1326

…Grace 64k kernel) The nodewright nvidia-setup-kernel package ran with ALLOW_NEWER=false, which force-downgrades a node to the exact pinned 6.14.0-1018-aws (4KB-page) kernel even when it already booted a newer one. On GB300/GB200 Grace (ARM) nodes the AMI ships 6.17.0-1007-aws-64k (64KB page); the downgrade to the 4KB kernel breaks CUDA context creation and NVLink fabric on Grace (nvidia-smi works, CUDA fails with 'device unavailable'). Set ALLOW_NEWER=true so the floor check (>= 6.14.0-1018) is kept but a newer, working kernel is accepted instead of downgraded. Verified on a GB300 EKS cluster: skyhook reconciles complete=2/2, keeps 6.17.0-1007-aws-64k, and the GPU operator CUDA validator passes.

…n with main) Match the dynamo-platform/grove versions main uses for the other dynamo overlays. dynamo 1.2.0 serves the DynamoGraphDeployment CRD at nvidia.com/v1beta1, which the current performance/conformance validators (inference-perf, robust-controller) expect; the stale 1.0.2 pin only served v1alpha1.

New manifest-only aws-net-dra component adds one image (eks/networking-device-dra-plugin:v1.0.0). Components 27->28, images 82->83. Per CLAUDE.md, BOM is regenerated and committed in the same PR as the registry/component change.

GB300 overlays default to the ConnectX RoCE fabric (aws-net-dra) for the DGXC p6e-gb300r variant. Standard EFAv4 GB300 is supported via an explicit bundle-time opt-in (--set awsefa:enabled=true --set awsnetdra:enabled=false), verified to swap the bundle to aws-efa. Automatic fabric detection is tracked in NVIDIA#1410.

Address CodeRabbit: the aws-net-dra (RoCE DRA) pod runs privileged (SYS_ADMIN, hostNetwork, host mounts) but inherited the runtime's default (Unconfined) seccomp. Pin spec.template.spec.securityContext.seccompProfile=RuntimeDefault for defense-in-depth; compatible with the SYS_ADMIN capability the plugin needs. (Trivy KSV-0105 runAsUser:0 is not actionable here — the plugin requires root to set up /dev/infiniband devices, matching the upstream chart.)

CodeRabbit review on NVIDIA#1336: - NVIDIA_SETUP_KERNEL_ALLOW_NEWER was flipped to true in the shared nodewright-customizations tuning manifest, which is consumed by 9 non-GB300 overlays (h100/h200/a100/gb200/b200). Only GB300's Grace ARM64 host needs the newer 64KB-page kernel. Template the value from $cust.kernelAllowNewer (default false, restoring prior behavior for x86 families) and opt in only from the two GB300 overlays. - Document why cacheWorkerImage stays a single global CUDA 13 image: CUDA 13 is required by GB300 (sm_103) and backward-compatible to sm_50, and the populate Job downloads weights without running GPU kernels, so it does not gate older generations.

TestComponentManifestImagesAreDigestPinned (ADR-006 layer 2) requires every vendored component-manifest image to be digest-pinned or exempted. The RoCE DRA plugin image was tag-pinned (:v1.0.0). Pin it to its resolved digest (sha256:7c87f397...) — it is a plain DaemonSet image, so the CRD-rejects-digests exemption does not apply.

github-actions · 2026-06-26T12:30:47Z

@yuanchen8911 this PR now has merge conflicts with main. Please rebase to resolve them.

yuanchen8911 added area/recipes area/validator labels Jun 12, 2026

github-actions Bot added area/ci area/docs area/cli area/api size/XL and removed area/validator labels Jun 12, 2026

yuanchen8911 mentioned this pull request Jun 12, 2026

Add and validate GB300 recipe overlays #1318

Open

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread validators/performance/model_cache.go

mchmarny assigned yuanchen8911 Jun 12, 2026

yuanchen8911 mentioned this pull request Jun 22, 2026

EFA on AWS: legacy device plugin vs networking DRA driver — should we migrate? #1326

Open

yuanchen8911 force-pushed the feat/gb300-eks-overlays-v2 branch from a0e4f2c to 1493012 Compare June 22, 2026 21:57

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread recipes/components/aws-net-dra/manifests/roce.yaml

yuanchen8911 mentioned this pull request Jun 22, 2026

Make EKS networking fabric-aware (EFA vs ConnectX RoCE) — GB300 is fabric-hardcoded #1410

Closed

yuanchen8911 force-pushed the feat/gb300-eks-overlays-v2 branch from fc3303f to 739e543 Compare June 23, 2026 00:40

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread recipes/components/nodewright-customizations/manifests/tuning.yaml Outdated

yuanchen8911 added 9 commits June 22, 2026 19:30

feat(recipes): add concrete GB300 EKS service-bound overlays (NVIDIA#…

13d2571

…1319)

yuanchen8911 force-pushed the feat/gb300-eks-overlays-v2 branch from 739e543 to 785df42 Compare June 23, 2026 02:30

github-actions Bot added the needs-rebase label Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: feat(recipes): GB300 EKS service-bound overlays (re-file of #1319)#1336

WIP: feat(recipes): GB300 EKS service-bound overlays (re-file of #1319)#1336
yuanchen8911 wants to merge 10 commits into
NVIDIA:mainfrom
yuanchen8911:feat/gb300-eks-overlays-v2

yuanchen8911 commented Jun 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

Reviews paused

Walkthrough

Estimated code review effort

Possibly related issues

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yuanchen8911 commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Networking fabric (EFA vs RoCE) — RoCE default, EFA opt-in

RoCE plugin image — AWS-owned ECR, no mirroring needed

Changes on this branch

Validation

Open issues / status

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Recipe evidence check

How to refresh evidence

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Possibly related issues

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yuanchen8911 commented Jun 12, 2026 •

edited

Loading

github-actions Bot commented Jun 12, 2026 •

edited

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading