WIP: feat(recipes): GB300 EKS service-bound overlays (re-file of #1319)#1336
WIP: feat(recipes): GB300 EKS service-bound overlays (re-file of #1319)#1336yuanchen8911 wants to merge 10 commits into
Conversation
Recipe evidence check
Affected leaf overlays: 69
How to refresh evidenceRun on a cluster matching the recipe's aicr snapshot -o snapshot.yaml
aicr validate \
-r recipes/overlays/<slug>.yaml \
-s snapshot.yaml \
--emit-attestation ./out \
--push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yamlThis gate is warning-only and never blocks merge. See ADR-007 for the trust model. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds GB300 accelerator support end-to-end: defines a new exported Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
recipes/overlays/gb300-eks-training.yaml (1)
121-131:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGB300/EKS still encodes DRA as the required allocation model across
recipes/overlays/gb300-eks-training.yamlandrecipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml. The training recipe enforcesdra-support, and the Dynamo leaf hard-codesnvidia-dra-driver-gpuplus the DRA-driven>= 1.34floor. That shared root cause conflicts with the PR’s own open blockers (#1327/#1326) about GB300/EKS allocation and networking, so these overlays will validate the wrong deployment shape until the device-plugin-vs-DRA decision is resolved.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/gb300-eks-training.yaml` around lines 121 - 131, The overlay still forces DRA by listing the conformance check "dra-support" and hard-coding the Dynamo leaf to use "nvidia-dra-driver-gpu" with a ">= 1.34" floor; remove or gate the "dra-support" entry from the conformance.checks array in recipes/overlays/gb300-eks-training.yaml and remove the hard-coded "nvidia-dra-driver-gpu" and its ">= 1.34" version constraint from recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml (or replace both with a feature-flag/conditional that defers to the device-plugin vs DRA decision), so the overlays no longer validate deployments assuming DRA until the device-plugin/DRA decision in PRs `#1327/`#1326 is resolved.validators/performance/nccl_all_reduce_bw_constraint.go (1)
138-169:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDon't bake GB300 into the GB200/EFA NCCL path yet.
Lines 138-169 hard-code GB300 onto the GB200 templates and advertise EKS NET/NVLS support as if the transport/runtime shape were identical. That conflicts with the PR objective's open blocker
#1326, which explicitly calls out that GB300 on AWS is RoCE rather than EFA. As written, this can route GB300 validation through the wrong runtime assets and also triggers the GB200-specific NVreg preflight downstream invalidators/performance/nccl_preflight_nvreg.go. Either keep GB300 out of the EKS NET/NVLS matrix until that blocker is closed, or add GB300-specific templates/preflight once the transport is validated.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@validators/performance/nccl_all_reduce_bw_constraint.go` around lines 138 - 169, Remove the temporary GB300-to-GB200 alias and any advertised GB300 support in the NCCL combinations until AWS RoCE behavior is resolved: stop mapping accelerator == recipe.CriteriaAcceleratorGB300 to recipe.CriteriaAcceleratorGB200 in the templatePath logic (the block that mutates the accelerator variable before returning filepath.Join) and remove recipe.CriteriaAcceleratorGB300 from the lists under variantNET and variantNVLS in the supportedNCCLCombinations map so GB300 is not listed for recipe.CriteriaServiceEKS; keep GB300 out of any GB200-specific NVReg/preflight paths until separate GB300 templates/preflight are added.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@recipes/overlays/gb300-eks-training.yaml`:
- Around line 121-131: The overlay still forces DRA by listing the conformance
check "dra-support" and hard-coding the Dynamo leaf to use
"nvidia-dra-driver-gpu" with a ">= 1.34" floor; remove or gate the "dra-support"
entry from the conformance.checks array in
recipes/overlays/gb300-eks-training.yaml and remove the hard-coded
"nvidia-dra-driver-gpu" and its ">= 1.34" version constraint from
recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml (or replace both with a
feature-flag/conditional that defers to the device-plugin vs DRA decision), so
the overlays no longer validate deployments assuming DRA until the
device-plugin/DRA decision in PRs `#1327/`#1326 is resolved.
In `@validators/performance/nccl_all_reduce_bw_constraint.go`:
- Around line 138-169: Remove the temporary GB300-to-GB200 alias and any
advertised GB300 support in the NCCL combinations until AWS RoCE behavior is
resolved: stop mapping accelerator == recipe.CriteriaAcceleratorGB300 to
recipe.CriteriaAcceleratorGB200 in the templatePath logic (the block that
mutates the accelerator variable before returning filepath.Join) and remove
recipe.CriteriaAcceleratorGB300 from the lists under variantNET and variantNVLS
in the supportedNCCLCombinations map so GB300 is not listed for
recipe.CriteriaServiceEKS; keep GB300 out of any GB200-specific NVReg/preflight
paths until separate GB300 templates/preflight are added.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 338fc99e-d81b-4381-93f2-0049b29848de
📒 Files selected for processing (27)
.claude/skills/aicr-analyzing-snapshots/SKILL.md.github/ISSUE_TEMPLATE/bug_report.ymlapi/aicr/v1/server.yamldocs/contributor/recipe.mddocs/user/api-reference.mddocs/user/cli-reference.mdpkg/cli/recipe.gopkg/client/v1/types.gopkg/fingerprint/doc.gopkg/fingerprint/gpu_sku.gopkg/fingerprint/gpu_sku_test.gopkg/fingerprint/types.gopkg/recipe/criteria.gopkg/recipe/criteria_test.gopkg/recipe/doc.gopkg/recipe/metadata_test.gorecipes/overlays/gb300-any.yamlrecipes/overlays/gb300-eks-inference.yamlrecipes/overlays/gb300-eks-training.yamlrecipes/overlays/gb300-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/gb300-eks-ubuntu-inference.yamlrecipes/overlays/gb300-eks-ubuntu-training-kubeflow.yamlrecipes/overlays/gb300-eks-ubuntu-training.yamlvalidators/performance/nccl_all_reduce_bw_constraint.govalidators/performance/nccl_preflight_nvreg.govalidators/performance/nccl_preflight_nvreg_test.govalidators/performance/nccl_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@validators/performance/model_cache.go`:
- Line 95: The code pins cacheWorkerImage to a CUDA 13 vLLM runtime
unconditionally which can break model-cache population on older GPU
accelerators; update buildModelCachePopulateJob (and any code that references
cacheWorkerImage) to select the runtime image based on the cluster/node
accelerator type (e.g., H100, A100, GB200, GB300, H200, B200, RTX Pro) or a
configurable override, and update the worker YAML templates
(validators/performance/testdata/inference/dynamo-deployment*.yaml) to support
multiple runtime images or accept an image parameter; specifically, make
cacheWorkerImage a function or switch keyed by accelerator label, wire that into
ensureModelCache which creates the populate Job, and allow an env/flag to force
a particular image for testing so older device generations use compatible CUDA
runtimes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: c03a6f6f-b82b-459c-859e-63ae0caa61f3
📒 Files selected for processing (3)
validators/performance/model_cache.govalidators/performance/testdata/inference/dynamo-deployment-gateway-epp.yamlvalidators/performance/testdata/inference/dynamo-deployment.yaml
a0e4f2c to
1493012
Compare
|
🌿 Preview your docs: https://nvidia-preview-feat-gb300-eks-overlays-v2.docs.buildwithfern.com/aicr |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
validators/performance/nccl_test.go (1)
601-691: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd an explicit GB300 template-alias test case.
templatePathnow aliasesgb300to thegb200template directory, but this table currently doesn’t assert that contract directly. A focused GB300 case would prevent silent regressions in future refactors.💡 Suggested diff
@@ { name: "gb200 eks NET variant", accelerator: recipe.CriteriaAcceleratorGB200, service: recipe.CriteriaServiceEKS, variant: variantNET, filename: "runtime.yaml", expected: filepath.Join("testdata", "gb200", "eks", "runtime-net.yaml"), }, + { + name: "gb300 eks NET variant aliases gb200 template", + accelerator: recipe.CriteriaAcceleratorGB300, + service: recipe.CriteriaServiceEKS, + variant: variantNET, + filename: "runtime.yaml", + expected: filepath.Join("testdata", "gb200", "eks", "runtime-net.yaml"), + },As per coding guidelines, “Check test coverage on affected packages before pushing a PR with Go changes.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@validators/performance/nccl_test.go` around lines 601 - 691, The TestTemplatePath function is missing a test case for the GB300 accelerator alias. Add a new test case struct to the tests slice with the accelerator set to recipe.CriteriaAcceleratorGB300, choose any appropriate service type like recipe.CriteriaServiceEKS, set variant to variantDefault, set filename to "runtime.yaml", and set the expected path to filepath.Join("testdata", "gb200", "eks", "runtime.yaml") to verify that GB300 correctly aliases to the GB200 template directory. This ensures the gb300 aliasing contract is explicitly tested and prevents silent regressions.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/components/aws-net-dra/manifests/roce.yaml`:
- Around line 110-131: The network-plugin container in the DaemonSet
specification lacks an explicit seccomp profile in its securityContext, which
relies on runtime defaults and weakens pod confinement. Add a seccompProfile
field to the securityContext of the network-plugin container to explicitly
define the seccomp profile (such as RuntimeDefault or a custom profile name)
rather than relying on environment-dependent defaults, ensuring consistent
security enforcement across different Kubernetes clusters.
---
Outside diff comments:
In `@validators/performance/nccl_test.go`:
- Around line 601-691: The TestTemplatePath function is missing a test case for
the GB300 accelerator alias. Add a new test case struct to the tests slice with
the accelerator set to recipe.CriteriaAcceleratorGB300, choose any appropriate
service type like recipe.CriteriaServiceEKS, set variant to variantDefault, set
filename to "runtime.yaml", and set the expected path to
filepath.Join("testdata", "gb200", "eks", "runtime.yaml") to verify that GB300
correctly aliases to the GB200 template directory. This ensures the gb300
aliasing contract is explicitly tested and prevents silent regressions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 3ca51af9-828c-41ee-a960-4f267f802d20
📒 Files selected for processing (35)
.claude/skills/aicr-analyzing-snapshots/SKILL.md.github/ISSUE_TEMPLATE/bug_report.ymlapi/aicr/v1/server.yamldocs/contributor/recipe.mddocs/user/api-reference.mddocs/user/cli-reference.mddocs/user/container-images.mdpkg/cli/recipe.gopkg/client/v1/types.gopkg/fingerprint/doc.gopkg/fingerprint/gpu_sku.gopkg/fingerprint/gpu_sku_test.gopkg/fingerprint/types.gopkg/recipe/criteria.gopkg/recipe/criteria_test.gopkg/recipe/doc.gopkg/recipe/metadata_test.gorecipes/checks/aws-net-dra/health-check.yamlrecipes/components/aws-net-dra/manifests/roce.yamlrecipes/components/nodewright-customizations/manifests/tuning.yamlrecipes/overlays/gb300-any.yamlrecipes/overlays/gb300-eks-inference.yamlrecipes/overlays/gb300-eks-training.yamlrecipes/overlays/gb300-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/gb300-eks-ubuntu-inference.yamlrecipes/overlays/gb300-eks-ubuntu-training-kubeflow.yamlrecipes/overlays/gb300-eks-ubuntu-training.yamlrecipes/registry.yamlvalidators/performance/model_cache.govalidators/performance/nccl_all_reduce_bw_constraint.govalidators/performance/nccl_preflight_nvreg.govalidators/performance/nccl_preflight_nvreg_test.govalidators/performance/nccl_test.govalidators/performance/testdata/inference/dynamo-deployment-gateway-epp.yamlvalidators/performance/testdata/inference/dynamo-deployment.yaml
fc3303f to
739e543
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/components/nodewright-customizations/manifests/tuning.yaml`:
- Line 82: The `NVIDIA_SETUP_KERNEL_ALLOW_NEWER` setting in the shared
tuning.yaml manifest is applying the "true" value globally across all 16
overlays and GPU families (GB300, H100, H200, A100, GB200, B200, RTX Pro, etc.).
If this kernel allow override is intended only for GB300 GPU families, move the
`NVIDIA_SETUP_KERNEL_ALLOW_NEWER: "true"` configuration from the shared
tuning.yaml manifest to a GB300-specific overlay file instead. If this change is
intentionally global across all GPU families, document this decision in the pull
request. Verify which GPU families actually require this kernel setting to avoid
unintended side effects on other hardware.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 16e87c61-744c-41d8-8861-368ef97637df
📒 Files selected for processing (35)
.claude/skills/aicr-analyzing-snapshots/SKILL.md.github/ISSUE_TEMPLATE/bug_report.ymlapi/aicr/v1/server.yamldocs/contributor/recipe.mddocs/user/api-reference.mddocs/user/cli-reference.mddocs/user/container-images.mdpkg/cli/recipe.gopkg/client/v1/types.gopkg/fingerprint/doc.gopkg/fingerprint/gpu_sku.gopkg/fingerprint/gpu_sku_test.gopkg/fingerprint/types.gopkg/recipe/criteria.gopkg/recipe/criteria_test.gopkg/recipe/doc.gopkg/recipe/metadata_test.gorecipes/checks/aws-net-dra/health-check.yamlrecipes/components/aws-net-dra/manifests/roce.yamlrecipes/components/nodewright-customizations/manifests/tuning.yamlrecipes/overlays/gb300-any.yamlrecipes/overlays/gb300-eks-inference.yamlrecipes/overlays/gb300-eks-training.yamlrecipes/overlays/gb300-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/gb300-eks-ubuntu-inference.yamlrecipes/overlays/gb300-eks-ubuntu-training-kubeflow.yamlrecipes/overlays/gb300-eks-ubuntu-training.yamlrecipes/registry.yamlvalidators/performance/model_cache.govalidators/performance/nccl_all_reduce_bw_constraint.govalidators/performance/nccl_preflight_nvreg.govalidators/performance/nccl_preflight_nvreg_test.govalidators/performance/nccl_test.govalidators/performance/testdata/inference/dynamo-deployment-gateway-epp.yamlvalidators/performance/testdata/inference/dynamo-deployment.yaml
The default vllm-runtime:1.2.0 is a CUDA 12.9 build whose flashinfer kernels lack sm_103 (GB300/Blackwell Ultra), so the inference-perf worker crash-loops with "no kernel image is available for execution on the device". Switch the inference workload + model-cache populate image to vllm-runtime:1.2.0-cuda13 (CUDA 13.0), which covers the Blackwell family. Verified end-to-end: GB300 (sm_103) and RTX Pro 6000 (sm_120) both serve Qwen3-8B and pass the inference-perf gate with the cuda13 image. Refs NVIDIA#1318
GB300 (p6e-gb300r) exposes a ConnectX RoCE fabric, not AWS EFA. The legacy
aws-efa device plugin crash-loops on these nodes (no EFA hardware). Replace it
on GB300 EKS overlays with a vendored AWS networking DRA plugin
(aws-net-k8s-dra-plugin v1.0.0, DeviceClass roce.networking.k8s.aws), mirroring
the DGXC reference GB300 fabric config.
- Vendor aws-net-dra as a manifest-only component (chart is private on NGC;
image is public ECR) + health check.
- gb300-eks-{inference,training}: disable aws-efa, add aws-net-dra.
Refs NVIDIA#1326
…Grace 64k kernel) The nodewright nvidia-setup-kernel package ran with ALLOW_NEWER=false, which force-downgrades a node to the exact pinned 6.14.0-1018-aws (4KB-page) kernel even when it already booted a newer one. On GB300/GB200 Grace (ARM) nodes the AMI ships 6.17.0-1007-aws-64k (64KB page); the downgrade to the 4KB kernel breaks CUDA context creation and NVLink fabric on Grace (nvidia-smi works, CUDA fails with 'device unavailable'). Set ALLOW_NEWER=true so the floor check (>= 6.14.0-1018) is kept but a newer, working kernel is accepted instead of downgraded. Verified on a GB300 EKS cluster: skyhook reconciles complete=2/2, keeps 6.17.0-1007-aws-64k, and the GPU operator CUDA validator passes.
…n with main) Match the dynamo-platform/grove versions main uses for the other dynamo overlays. dynamo 1.2.0 serves the DynamoGraphDeployment CRD at nvidia.com/v1beta1, which the current performance/conformance validators (inference-perf, robust-controller) expect; the stale 1.0.2 pin only served v1alpha1.
New manifest-only aws-net-dra component adds one image (eks/networking-device-dra-plugin:v1.0.0). Components 27->28, images 82->83. Per CLAUDE.md, BOM is regenerated and committed in the same PR as the registry/component change.
GB300 overlays default to the ConnectX RoCE fabric (aws-net-dra) for the DGXC p6e-gb300r variant. Standard EFAv4 GB300 is supported via an explicit bundle-time opt-in (--set awsefa:enabled=true --set awsnetdra:enabled=false), verified to swap the bundle to aws-efa. Automatic fabric detection is tracked in NVIDIA#1410.
Address CodeRabbit: the aws-net-dra (RoCE DRA) pod runs privileged (SYS_ADMIN, hostNetwork, host mounts) but inherited the runtime's default (Unconfined) seccomp. Pin spec.template.spec.securityContext.seccompProfile=RuntimeDefault for defense-in-depth; compatible with the SYS_ADMIN capability the plugin needs. (Trivy KSV-0105 runAsUser:0 is not actionable here — the plugin requires root to set up /dev/infiniband devices, matching the upstream chart.)
CodeRabbit review on NVIDIA#1336: - NVIDIA_SETUP_KERNEL_ALLOW_NEWER was flipped to true in the shared nodewright-customizations tuning manifest, which is consumed by 9 non-GB300 overlays (h100/h200/a100/gb200/b200). Only GB300's Grace ARM64 host needs the newer 64KB-page kernel. Template the value from $cust.kernelAllowNewer (default false, restoring prior behavior for x86 families) and opt in only from the two GB300 overlays. - Document why cacheWorkerImage stays a single global CUDA 13 image: CUDA 13 is required by GB300 (sm_103) and backward-compatible to sm_50, and the populate Job downloads weights without running GPU kernels, so it does not gate older generations.
739e543 to
785df42
Compare
TestComponentManifestImagesAreDigestPinned (ADR-006 layer 2) requires every vendored component-manifest image to be digest-pinned or exempted. The RoCE DRA plugin image was tag-pinned (:v1.0.0). Pin it to its resolved digest (sha256:7c87f397...) — it is a plain DaemonSet image, so the CRD-rejects-digests exemption does not apply.
|
@yuanchen8911 this PR now has merge conflicts with |
Summary
Re-introduces the GB300 EKS service-bound overlays + accelerator plumbing from #1319, and lands the fixes needed to make GB300 deploy and validate end-to-end on a real cluster.
Networking fabric (EFA vs RoCE) — RoCE default, EFA opt-in
GB300 EKS networking defaults to ConnectX RoCE via a vendored AWS networking DRA driver (
aws-net-k8s-dra-plugin, DeviceClassroce.networking.k8s.aws). The DGXC GB300 instance variant (p6e-gb300r) exposes a ConnectX RoCE fabric with no EFA device, so the legacyaws-efadevice plugin crash-loops there (not EFA enabled).Standard EFAv4 GB300 (the AWS-documented default) is supported via an explicit bundle-time opt-in:
This is the explicit-intent interim (no auto-detection). Automatic fabric detection (EFA vs RoCE), with the full options analysis, is tracked in #1410.
RoCE plugin image — AWS-owned ECR, no mirroring needed
The DRA driver image (
eks/networking-device-dra-plugin) is pulled from AWS's own EKS-managed ECR — account602401143452is an AWS-owned, AWS-chosen registry (one per region), not an NVIDIA/DGXC account. It is digest-pinned inroce.yaml. The image is the general AWS networking-device DRA driver (driverdra.networking.k8s.aws); RoCE-specificity lives only in the DeviceClass CEL we ship, not the image.A normal EKS GB300 deploy needs no mirroring and no pull secret: the GPU node's IAM role already carries ECR pull permission and AWS's repo policy grants pull to any authenticated AWS account, so kubelet pulls it transparently (verified on aicr-gb300 in us-east-2 pulling the us-west-2 image cross-region). The image is not anonymously pullable (no
public.ecr.awsmirror, unlikeaws-ebs-csi-driver) — it follows the same auth-gated pattern AICR already uses for the legacyaws-efaplugin. The open-source/transparency caveat (no anonymous pull, no source label/SBOM/signature) and the hardcodedus-west-2/region rigidity are tracked in #1410.Changes on this branch
gb300accelerator plumbing (re-file of feat(recipes): add concrete GB300 EKS service-bound overlays #1319).sm_103.aws-net-dra(manifest-only) + health check;aws-efadisabled on GB300 overlays (EFA opt-in documented above).NVIDIA_SETUP_KERNEL_ALLOW_NEWER=true. GB300 Grace nodes need the 64KB-page kernel (6.17.0-1007-aws-64k); the tuning was force-downgrading them to the 4KB6.14.0-1018-aws, which breaks CUDA context creation + NVLink fabric (nvidia-smi works, CUDA fails "device unavailable"). This is a kernel-pin issue independent of GB200/GB300.v1beta1CRD the validators require).aws-net-draimage.Validation
aicr validate --phase allon a DGXC GB300 cluster (p6e-gb300r,aicrbuilt frommain, validator:edge):Cluster prerequisite (not in the recipe): dynamo 1.2.0's NATS discovery needs TCP 4222 (and Prometheus 9090) reachable GPU-nodegroup → system-nodegroup. On EKS this is a security-group ingress rule owned by cluster provisioning, not AICR — see
docs/integrator/eks-dynamo-networking.md.Open issues / status
aws-net-dra, EFA opt-in documented; fabric-aware auto-detection follow-up filed as Make EKS networking fabric-aware (EFA vs ConnectX RoCE) — GB300 is fabric-hardcoded #1410.gpu.nvidia.com); the cuda-validator passes alongside DRA. The earlier "device busy/unavailable" was the kernel issue above (4KB vs 64KB), not device-plugin/DRA contention.Tracking issue: #1318. Reverted by: #1328.