Skip to content

WIP: feat(recipes): GB300 EKS service-bound overlays (re-file of #1319)#1336

Draft
yuanchen8911 wants to merge 10 commits into
NVIDIA:mainfrom
yuanchen8911:feat/gb300-eks-overlays-v2
Draft

WIP: feat(recipes): GB300 EKS service-bound overlays (re-file of #1319)#1336
yuanchen8911 wants to merge 10 commits into
NVIDIA:mainfrom
yuanchen8911:feat/gb300-eks-overlays-v2

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

WIP / Draft. Re-files #1319 (reverted in #1328) so the GB300 EKS recipe work isn't lost. The blocking fixes have now landed on this branch and GB300 validates clean (see Validation); remaining out-of-draft gating is the standalone validator bug #1323.

Summary

Re-introduces the GB300 EKS service-bound overlays + accelerator plumbing from #1319, and lands the fixes needed to make GB300 deploy and validate end-to-end on a real cluster.

Networking fabric (EFA vs RoCE) — RoCE default, EFA opt-in

GB300 EKS networking defaults to ConnectX RoCE via a vendored AWS networking DRA driver (aws-net-k8s-dra-plugin, DeviceClass roce.networking.k8s.aws). The DGXC GB300 instance variant (p6e-gb300r) exposes a ConnectX RoCE fabric with no EFA device, so the legacy aws-efa device plugin crash-loops there (not EFA enabled).

Standard EFAv4 GB300 (the AWS-documented default) is supported via an explicit bundle-time opt-in:

aicr bundle ... --set awsefa:enabled=true --set awsnetdra:enabled=false

This is the explicit-intent interim (no auto-detection). Automatic fabric detection (EFA vs RoCE), with the full options analysis, is tracked in #1410.

RoCE plugin image — AWS-owned ECR, no mirroring needed

The DRA driver image (eks/networking-device-dra-plugin) is pulled from AWS's own EKS-managed ECR — account 602401143452 is an AWS-owned, AWS-chosen registry (one per region), not an NVIDIA/DGXC account. It is digest-pinned in roce.yaml. The image is the general AWS networking-device DRA driver (driver dra.networking.k8s.aws); RoCE-specificity lives only in the DeviceClass CEL we ship, not the image.

A normal EKS GB300 deploy needs no mirroring and no pull secret: the GPU node's IAM role already carries ECR pull permission and AWS's repo policy grants pull to any authenticated AWS account, so kubelet pulls it transparently (verified on aicr-gb300 in us-east-2 pulling the us-west-2 image cross-region). The image is not anonymously pullable (no public.ecr.aws mirror, unlike aws-ebs-csi-driver) — it follows the same auth-gated pattern AICR already uses for the legacy aws-efa plugin. The open-source/transparency caveat (no anonymous pull, no source label/SBOM/signature) and the hardcoded us-west-2/region rigidity are tracked in #1410.

Changes on this branch

  • GB300 EKS overlays + gb300 accelerator plumbing (re-file of feat(recipes): add concrete GB300 EKS service-bound overlays #1319).
  • CUDA-13 vLLM runtime for Blackwell sm_103.
  • RoCE fabric: vendored aws-net-dra (manifest-only) + health check; aws-efa disabled on GB300 overlays (EFA opt-in documented above).
  • Kernel: nodewright NVIDIA_SETUP_KERNEL_ALLOW_NEWER=true. GB300 Grace nodes need the 64KB-page kernel (6.17.0-1007-aws-64k); the tuning was force-downgrading them to the 4KB 6.14.0-1018-aws, which breaks CUDA context creation + NVLink fabric (nvidia-smi works, CUDA fails "device unavailable"). This is a kernel-pin issue independent of GB200/GB300.
  • dynamo-platform 1.2.0 / grove alpha.8 (align gb300 overlay with main; 1.2.0 serves the v1beta1 CRD the validators require).
  • BOM regenerated for the new aws-net-dra image.

Validation

aicr validate --phase all on a DGXC GB300 cluster (p6e-gb300r, aicr built from main, validator :edge):

  • deployment 4/4 · conformance 10/10 · performance 1/1 — all green.

Cluster prerequisite (not in the recipe): dynamo 1.2.0's NATS discovery needs TCP 4222 (and Prometheus 9090) reachable GPU-nodegroup → system-nodegroup. On EKS this is a security-group ingress rule owned by cluster provisioning, not AICR — see docs/integrator/eks-dynamo-networking.md.

Open issues / status

Tracking issue: #1318. Reverted by: #1328.

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Recipe evidence check

Broad impact: recipes/registry.yaml or recipes/overlays/base.yaml changed;
every leaf recipe is potentially affected. The list below covers all of them — each
one would ideally have refreshed evidence before merge.

Affected leaf overlays: 69

Recipe Pointer Verify Digest match
a100-aks-training ⚠️ missing
a100-aks-ubuntu-training-kubeflow ⚠️ missing
a100-aks-ubuntu-training ⚠️ missing
a100-eks-training ⚠️ missing
a100-eks-ubuntu-training-kubeflow ⚠️ missing
a100-eks-ubuntu-training ⚠️ missing
a100-gke-cos-training-kubeflow ⚠️ missing
a100-gke-cos-training ⚠️ missing
a100-oke-training ⚠️ missing
a100-oke-ubuntu-training-kubeflow ⚠️ missing
a100-oke-ubuntu-training ⚠️ missing
b200-gke-cos-inference-dynamo ⚠️ missing
b200-gke-cos-inference ⚠️ missing
b200-gke-cos-training-kubeflow ⚠️ missing
b200-gke-cos-training ⚠️ missing
gb200-eks-inference ⚠️ missing
gb200-eks-training ⚠️ missing
gb200-eks-ubuntu-inference-dynamo ⚠️ missing
gb200-eks-ubuntu-inference ⚠️ missing
gb200-eks-ubuntu-training-kubeflow ⚠️ missing
gb200-eks-ubuntu-training ✅ present ✅ passed ⚠️ stale (9aeea19f5b75… vs current 07a8ff1e9d4b…)
gb200-oke-inference ⚠️ missing
gb200-oke-training ⚠️ missing
gb200-oke-ubuntu-inference-dynamo ⚠️ missing
gb200-oke-ubuntu-inference ⚠️ missing
gb200-oke-ubuntu-training-kubeflow ⚠️ missing
gb200-oke-ubuntu-training ⚠️ missing
gb300-eks-inference ⚠️ missing
gb300-eks-training ⚠️ missing
gb300-eks-ubuntu-inference-dynamo ⚠️ missing
gb300-eks-ubuntu-inference ⚠️ missing
gb300-eks-ubuntu-training-kubeflow ⚠️ missing
gb300-eks-ubuntu-training ⚠️ missing
h100-aks-inference ⚠️ missing
h100-aks-training ⚠️ missing
h100-aks-ubuntu-inference-dynamo ⚠️ missing
h100-aks-ubuntu-inference ⚠️ missing
h100-aks-ubuntu-training-kubeflow ⚠️ missing
h100-aks-ubuntu-training ⚠️ missing
h100-bcm-training ⚠️ missing
h100-bcm-ubuntu-training ⚠️ missing
h100-eks-inference ⚠️ missing
h100-eks-training ⚠️ missing
h100-eks-ubuntu-inference-dynamo ⚠️ missing
h100-eks-ubuntu-inference-nim ⚠️ missing
h100-eks-ubuntu-inference ⚠️ missing
h100-eks-ubuntu-training-kubeflow ⚠️ missing
h100-eks-ubuntu-training-slurm ⚠️ missing
h100-eks-ubuntu-training ⚠️ missing
h100-gke-cos-inference-dynamo ⚠️ missing
h100-gke-cos-inference ⚠️ missing
h100-gke-cos-training-kubeflow ⚠️ missing
h100-gke-cos-training-slurm ⚠️ missing
h100-gke-cos-training ✅ present ✅ passed ⚠️ stale (b82cade51fd2… vs current 064f27fa945d…)
h100-kind-inference-dynamo ⚠️ missing
h100-kind-inference ⚠️ missing
h100-kind-training-kubeflow ⚠️ missing
h100-kind-training-slurm ⚠️ missing
h100-kind-training ⚠️ missing
h200-eks-inference ⚠️ missing
h200-eks-training ⚠️ missing
rtx-pro-6000-eks-inference ⚠️ missing
rtx-pro-6000-eks-ubuntu-inference-dynamo ⚠️ missing
rtx-pro-6000-eks-ubuntu-inference-nim ⚠️ missing
rtx-pro-6000-eks-ubuntu-inference ⚠️ missing
rtx-pro-6000-lke-inference ⚠️ missing
rtx-pro-6000-lke-training ⚠️ missing
rtx-pro-6000-lke-ubuntu-inference ⚠️ missing
rtx-pro-6000-lke-ubuntu-training ⚠️ missing

How to refresh evidence

Run on a cluster matching the recipe's criteria:

aicr snapshot -o snapshot.yaml
aicr validate \
  -r recipes/overlays/<slug>.yaml \
  -s snapshot.yaml \
  --emit-attestation ./out \
  --push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yaml

This gate is warning-only and never blocks merge. See ADR-007 for the trust model.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds GB300 accelerator support end-to-end: defines a new exported CriteriaAcceleratorGB300 constant with parsing/validation logic, adds GB300 SKU detection in fingerprinting with proper ordering before B200, extends OpenAPI schemas and CLI/API reference documentation to include gb300, introduces the AWS DRA networking component for RoCE-based fabric connectivity, adds seven RecipeMetadata overlays covering GB300 on EKS (with training/inference, Ubuntu, Dynamo, and Kubeflow variants), extends NCCL and performance validators with GB300 support (template aliasing, combination matrix updates, preflight applicability), and updates pinned Dynamo vLLM runtime image tags in validator test fixtures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change as GB300 EKS service-bound overlays with re-filing context; it accurately reflects the primary objective in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description comprehensively details the re-filed GB300 EKS overlays, the changes included, validation results, and relevant issue tracking.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
recipes/overlays/gb300-eks-training.yaml (1)

121-131: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

GB300/EKS still encodes DRA as the required allocation model across recipes/overlays/gb300-eks-training.yaml and recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml. The training recipe enforces dra-support, and the Dynamo leaf hard-codes nvidia-dra-driver-gpu plus the DRA-driven >= 1.34 floor. That shared root cause conflicts with the PR’s own open blockers (#1327/#1326) about GB300/EKS allocation and networking, so these overlays will validate the wrong deployment shape until the device-plugin-vs-DRA decision is resolved.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/gb300-eks-training.yaml` around lines 121 - 131, The overlay
still forces DRA by listing the conformance check "dra-support" and hard-coding
the Dynamo leaf to use "nvidia-dra-driver-gpu" with a ">= 1.34" floor; remove or
gate the "dra-support" entry from the conformance.checks array in
recipes/overlays/gb300-eks-training.yaml and remove the hard-coded
"nvidia-dra-driver-gpu" and its ">= 1.34" version constraint from
recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml (or replace both with a
feature-flag/conditional that defers to the device-plugin vs DRA decision), so
the overlays no longer validate deployments assuming DRA until the
device-plugin/DRA decision in PRs `#1327/`#1326 is resolved.
validators/performance/nccl_all_reduce_bw_constraint.go (1)

138-169: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't bake GB300 into the GB200/EFA NCCL path yet.

Lines 138-169 hard-code GB300 onto the GB200 templates and advertise EKS NET/NVLS support as if the transport/runtime shape were identical. That conflicts with the PR objective's open blocker #1326, which explicitly calls out that GB300 on AWS is RoCE rather than EFA. As written, this can route GB300 validation through the wrong runtime assets and also triggers the GB200-specific NVreg preflight downstream in validators/performance/nccl_preflight_nvreg.go. Either keep GB300 out of the EKS NET/NVLS matrix until that blocker is closed, or add GB300-specific templates/preflight once the transport is validated.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/performance/nccl_all_reduce_bw_constraint.go` around lines 138 -
169, Remove the temporary GB300-to-GB200 alias and any advertised GB300 support
in the NCCL combinations until AWS RoCE behavior is resolved: stop mapping
accelerator == recipe.CriteriaAcceleratorGB300 to
recipe.CriteriaAcceleratorGB200 in the templatePath logic (the block that
mutates the accelerator variable before returning filepath.Join) and remove
recipe.CriteriaAcceleratorGB300 from the lists under variantNET and variantNVLS
in the supportedNCCLCombinations map so GB300 is not listed for
recipe.CriteriaServiceEKS; keep GB300 out of any GB200-specific NVReg/preflight
paths until separate GB300 templates/preflight are added.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@recipes/overlays/gb300-eks-training.yaml`:
- Around line 121-131: The overlay still forces DRA by listing the conformance
check "dra-support" and hard-coding the Dynamo leaf to use
"nvidia-dra-driver-gpu" with a ">= 1.34" floor; remove or gate the "dra-support"
entry from the conformance.checks array in
recipes/overlays/gb300-eks-training.yaml and remove the hard-coded
"nvidia-dra-driver-gpu" and its ">= 1.34" version constraint from
recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml (or replace both with a
feature-flag/conditional that defers to the device-plugin vs DRA decision), so
the overlays no longer validate deployments assuming DRA until the
device-plugin/DRA decision in PRs `#1327/`#1326 is resolved.

In `@validators/performance/nccl_all_reduce_bw_constraint.go`:
- Around line 138-169: Remove the temporary GB300-to-GB200 alias and any
advertised GB300 support in the NCCL combinations until AWS RoCE behavior is
resolved: stop mapping accelerator == recipe.CriteriaAcceleratorGB300 to
recipe.CriteriaAcceleratorGB200 in the templatePath logic (the block that
mutates the accelerator variable before returning filepath.Join) and remove
recipe.CriteriaAcceleratorGB300 from the lists under variantNET and variantNVLS
in the supportedNCCLCombinations map so GB300 is not listed for
recipe.CriteriaServiceEKS; keep GB300 out of any GB200-specific NVReg/preflight
paths until separate GB300 templates/preflight are added.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 338fc99e-d81b-4381-93f2-0049b29848de

📥 Commits

Reviewing files that changed from the base of the PR and between e6d9301 and 9ad2d74.

📒 Files selected for processing (27)
  • .claude/skills/aicr-analyzing-snapshots/SKILL.md
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/contributor/recipe.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/cli/recipe.go
  • pkg/client/v1/types.go
  • pkg/fingerprint/doc.go
  • pkg/fingerprint/gpu_sku.go
  • pkg/fingerprint/gpu_sku_test.go
  • pkg/fingerprint/types.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • pkg/recipe/metadata_test.go
  • recipes/overlays/gb300-any.yaml
  • recipes/overlays/gb300-eks-inference.yaml
  • recipes/overlays/gb300-eks-training.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference.yaml
  • recipes/overlays/gb300-eks-ubuntu-training-kubeflow.yaml
  • recipes/overlays/gb300-eks-ubuntu-training.yaml
  • validators/performance/nccl_all_reduce_bw_constraint.go
  • validators/performance/nccl_preflight_nvreg.go
  • validators/performance/nccl_preflight_nvreg_test.go
  • validators/performance/nccl_test.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/performance/model_cache.go`:
- Line 95: The code pins cacheWorkerImage to a CUDA 13 vLLM runtime
unconditionally which can break model-cache population on older GPU
accelerators; update buildModelCachePopulateJob (and any code that references
cacheWorkerImage) to select the runtime image based on the cluster/node
accelerator type (e.g., H100, A100, GB200, GB300, H200, B200, RTX Pro) or a
configurable override, and update the worker YAML templates
(validators/performance/testdata/inference/dynamo-deployment*.yaml) to support
multiple runtime images or accept an image parameter; specifically, make
cacheWorkerImage a function or switch keyed by accelerator label, wire that into
ensureModelCache which creates the populate Job, and allow an env/flag to force
a particular image for testing so older device generations use compatible CUDA
runtimes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c03a6f6f-b82b-459c-859e-63ae0caa61f3

📥 Commits

Reviewing files that changed from the base of the PR and between 9ad2d74 and a0e4f2c.

📒 Files selected for processing (3)
  • validators/performance/model_cache.go
  • validators/performance/testdata/inference/dynamo-deployment-gateway-epp.yaml
  • validators/performance/testdata/inference/dynamo-deployment.yaml

Comment thread validators/performance/model_cache.go
@github-actions

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
validators/performance/nccl_test.go (1)

601-691: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add an explicit GB300 template-alias test case.

templatePath now aliases gb300 to the gb200 template directory, but this table currently doesn’t assert that contract directly. A focused GB300 case would prevent silent regressions in future refactors.

💡 Suggested diff
@@
 		{
 			name:        "gb200 eks NET variant",
 			accelerator: recipe.CriteriaAcceleratorGB200,
 			service:     recipe.CriteriaServiceEKS,
 			variant:     variantNET,
 			filename:    "runtime.yaml",
 			expected:    filepath.Join("testdata", "gb200", "eks", "runtime-net.yaml"),
 		},
+		{
+			name:        "gb300 eks NET variant aliases gb200 template",
+			accelerator: recipe.CriteriaAcceleratorGB300,
+			service:     recipe.CriteriaServiceEKS,
+			variant:     variantNET,
+			filename:    "runtime.yaml",
+			expected:    filepath.Join("testdata", "gb200", "eks", "runtime-net.yaml"),
+		},

As per coding guidelines, “Check test coverage on affected packages before pushing a PR with Go changes.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/performance/nccl_test.go` around lines 601 - 691, The
TestTemplatePath function is missing a test case for the GB300 accelerator
alias. Add a new test case struct to the tests slice with the accelerator set to
recipe.CriteriaAcceleratorGB300, choose any appropriate service type like
recipe.CriteriaServiceEKS, set variant to variantDefault, set filename to
"runtime.yaml", and set the expected path to filepath.Join("testdata", "gb200",
"eks", "runtime.yaml") to verify that GB300 correctly aliases to the GB200
template directory. This ensures the gb300 aliasing contract is explicitly
tested and prevents silent regressions.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/components/aws-net-dra/manifests/roce.yaml`:
- Around line 110-131: The network-plugin container in the DaemonSet
specification lacks an explicit seccomp profile in its securityContext, which
relies on runtime defaults and weakens pod confinement. Add a seccompProfile
field to the securityContext of the network-plugin container to explicitly
define the seccomp profile (such as RuntimeDefault or a custom profile name)
rather than relying on environment-dependent defaults, ensuring consistent
security enforcement across different Kubernetes clusters.

---

Outside diff comments:
In `@validators/performance/nccl_test.go`:
- Around line 601-691: The TestTemplatePath function is missing a test case for
the GB300 accelerator alias. Add a new test case struct to the tests slice with
the accelerator set to recipe.CriteriaAcceleratorGB300, choose any appropriate
service type like recipe.CriteriaServiceEKS, set variant to variantDefault, set
filename to "runtime.yaml", and set the expected path to
filepath.Join("testdata", "gb200", "eks", "runtime.yaml") to verify that GB300
correctly aliases to the GB200 template directory. This ensures the gb300
aliasing contract is explicitly tested and prevents silent regressions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 3ca51af9-828c-41ee-a960-4f267f802d20

📥 Commits

Reviewing files that changed from the base of the PR and between a0e4f2c and 1493012.

📒 Files selected for processing (35)
  • .claude/skills/aicr-analyzing-snapshots/SKILL.md
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/contributor/recipe.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • docs/user/container-images.md
  • pkg/cli/recipe.go
  • pkg/client/v1/types.go
  • pkg/fingerprint/doc.go
  • pkg/fingerprint/gpu_sku.go
  • pkg/fingerprint/gpu_sku_test.go
  • pkg/fingerprint/types.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • pkg/recipe/metadata_test.go
  • recipes/checks/aws-net-dra/health-check.yaml
  • recipes/components/aws-net-dra/manifests/roce.yaml
  • recipes/components/nodewright-customizations/manifests/tuning.yaml
  • recipes/overlays/gb300-any.yaml
  • recipes/overlays/gb300-eks-inference.yaml
  • recipes/overlays/gb300-eks-training.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference.yaml
  • recipes/overlays/gb300-eks-ubuntu-training-kubeflow.yaml
  • recipes/overlays/gb300-eks-ubuntu-training.yaml
  • recipes/registry.yaml
  • validators/performance/model_cache.go
  • validators/performance/nccl_all_reduce_bw_constraint.go
  • validators/performance/nccl_preflight_nvreg.go
  • validators/performance/nccl_preflight_nvreg_test.go
  • validators/performance/nccl_test.go
  • validators/performance/testdata/inference/dynamo-deployment-gateway-epp.yaml
  • validators/performance/testdata/inference/dynamo-deployment.yaml

Comment thread recipes/components/aws-net-dra/manifests/roce.yaml

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/components/nodewright-customizations/manifests/tuning.yaml`:
- Line 82: The `NVIDIA_SETUP_KERNEL_ALLOW_NEWER` setting in the shared
tuning.yaml manifest is applying the "true" value globally across all 16
overlays and GPU families (GB300, H100, H200, A100, GB200, B200, RTX Pro, etc.).
If this kernel allow override is intended only for GB300 GPU families, move the
`NVIDIA_SETUP_KERNEL_ALLOW_NEWER: "true"` configuration from the shared
tuning.yaml manifest to a GB300-specific overlay file instead. If this change is
intentionally global across all GPU families, document this decision in the pull
request. Verify which GPU families actually require this kernel setting to avoid
unintended side effects on other hardware.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 16e87c61-744c-41d8-8861-368ef97637df

📥 Commits

Reviewing files that changed from the base of the PR and between fc3303f and 739e543.

📒 Files selected for processing (35)
  • .claude/skills/aicr-analyzing-snapshots/SKILL.md
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/contributor/recipe.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • docs/user/container-images.md
  • pkg/cli/recipe.go
  • pkg/client/v1/types.go
  • pkg/fingerprint/doc.go
  • pkg/fingerprint/gpu_sku.go
  • pkg/fingerprint/gpu_sku_test.go
  • pkg/fingerprint/types.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • pkg/recipe/metadata_test.go
  • recipes/checks/aws-net-dra/health-check.yaml
  • recipes/components/aws-net-dra/manifests/roce.yaml
  • recipes/components/nodewright-customizations/manifests/tuning.yaml
  • recipes/overlays/gb300-any.yaml
  • recipes/overlays/gb300-eks-inference.yaml
  • recipes/overlays/gb300-eks-training.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference.yaml
  • recipes/overlays/gb300-eks-ubuntu-training-kubeflow.yaml
  • recipes/overlays/gb300-eks-ubuntu-training.yaml
  • recipes/registry.yaml
  • validators/performance/model_cache.go
  • validators/performance/nccl_all_reduce_bw_constraint.go
  • validators/performance/nccl_preflight_nvreg.go
  • validators/performance/nccl_preflight_nvreg_test.go
  • validators/performance/nccl_test.go
  • validators/performance/testdata/inference/dynamo-deployment-gateway-epp.yaml
  • validators/performance/testdata/inference/dynamo-deployment.yaml

Comment thread recipes/components/nodewright-customizations/manifests/tuning.yaml Outdated
The default vllm-runtime:1.2.0 is a CUDA 12.9 build whose flashinfer
kernels lack sm_103 (GB300/Blackwell Ultra), so the inference-perf worker
crash-loops with "no kernel image is available for execution on the
device". Switch the inference workload + model-cache populate image to
vllm-runtime:1.2.0-cuda13 (CUDA 13.0), which covers the Blackwell family.

Verified end-to-end: GB300 (sm_103) and RTX Pro 6000 (sm_120) both serve
Qwen3-8B and pass the inference-perf gate with the cuda13 image.

Refs NVIDIA#1318
GB300 (p6e-gb300r) exposes a ConnectX RoCE fabric, not AWS EFA. The legacy
aws-efa device plugin crash-loops on these nodes (no EFA hardware). Replace it
on GB300 EKS overlays with a vendored AWS networking DRA plugin
(aws-net-k8s-dra-plugin v1.0.0, DeviceClass roce.networking.k8s.aws), mirroring
the DGXC reference GB300 fabric config.

- Vendor aws-net-dra as a manifest-only component (chart is private on NGC;
  image is public ECR) + health check.
- gb300-eks-{inference,training}: disable aws-efa, add aws-net-dra.

Refs NVIDIA#1326
…Grace 64k kernel)

The nodewright nvidia-setup-kernel package ran with ALLOW_NEWER=false, which
force-downgrades a node to the exact pinned 6.14.0-1018-aws (4KB-page) kernel
even when it already booted a newer one. On GB300/GB200 Grace (ARM) nodes the
AMI ships 6.17.0-1007-aws-64k (64KB page); the downgrade to the 4KB kernel
breaks CUDA context creation and NVLink fabric on Grace (nvidia-smi works, CUDA
fails with 'device unavailable'). Set ALLOW_NEWER=true so the floor check
(>= 6.14.0-1018) is kept but a newer, working kernel is accepted instead of
downgraded. Verified on a GB300 EKS cluster: skyhook reconciles complete=2/2,
keeps 6.17.0-1007-aws-64k, and the GPU operator CUDA validator passes.
…n with main)

Match the dynamo-platform/grove versions main uses for the other dynamo
overlays. dynamo 1.2.0 serves the DynamoGraphDeployment CRD at nvidia.com/v1beta1,
which the current performance/conformance validators (inference-perf,
robust-controller) expect; the stale 1.0.2 pin only served v1alpha1.
New manifest-only aws-net-dra component adds one image
(eks/networking-device-dra-plugin:v1.0.0). Components 27->28, images 82->83.
Per CLAUDE.md, BOM is regenerated and committed in the same PR as the
registry/component change.
GB300 overlays default to the ConnectX RoCE fabric (aws-net-dra) for the DGXC
p6e-gb300r variant. Standard EFAv4 GB300 is supported via an explicit bundle-time
opt-in (--set awsefa:enabled=true --set awsnetdra:enabled=false), verified to
swap the bundle to aws-efa. Automatic fabric detection is tracked in NVIDIA#1410.
Address CodeRabbit: the aws-net-dra (RoCE DRA) pod runs privileged (SYS_ADMIN,
hostNetwork, host mounts) but inherited the runtime's default (Unconfined)
seccomp. Pin spec.template.spec.securityContext.seccompProfile=RuntimeDefault
for defense-in-depth; compatible with the SYS_ADMIN capability the plugin needs.
(Trivy KSV-0105 runAsUser:0 is not actionable here — the plugin requires root
to set up /dev/infiniband devices, matching the upstream chart.)
CodeRabbit review on NVIDIA#1336:

- NVIDIA_SETUP_KERNEL_ALLOW_NEWER was flipped to true in the shared
  nodewright-customizations tuning manifest, which is consumed by 9
  non-GB300 overlays (h100/h200/a100/gb200/b200). Only GB300's Grace
  ARM64 host needs the newer 64KB-page kernel. Template the value from
  $cust.kernelAllowNewer (default false, restoring prior behavior for
  x86 families) and opt in only from the two GB300 overlays.
- Document why cacheWorkerImage stays a single global CUDA 13 image:
  CUDA 13 is required by GB300 (sm_103) and backward-compatible to
  sm_50, and the populate Job downloads weights without running GPU
  kernels, so it does not gate older generations.
@yuanchen8911 yuanchen8911 force-pushed the feat/gb300-eks-overlays-v2 branch from 739e543 to 785df42 Compare June 23, 2026 02:30
TestComponentManifestImagesAreDigestPinned (ADR-006 layer 2) requires
every vendored component-manifest image to be digest-pinned or exempted.
The RoCE DRA plugin image was tag-pinned (:v1.0.0). Pin it to its
resolved digest (sha256:7c87f397...) — it is a plain DaemonSet image, so
the CRD-rejects-digests exemption does not apply.
@github-actions

Copy link
Copy Markdown
Contributor

@yuanchen8911 this PR now has merge conflicts with main. Please rebase to resolve them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant