feat(recipe): add L40S accelerator support by atif1996 · Pull Request #1510 · NVIDIA/aicr

atif1996 · 2026-06-27T22:52:40Z

Summary

Add L40S as a supported accelerator for OKE, introduce ol as the Oracle Linux OS enum, and require explicit OS selection for all OKE overlays (mirrors the GKE/COS model — no os: any fallback).

Motivation / Context

Two gaps addressed:

L40S on OKE had no recipe — the fingerprint engine already mapped PCI 26b9 → "l40s" but there were no OKE overlays for it.
OKE overlays previously had no os: field (implicit any). Oracle's Gen2 GPU OKE image runs Oracle Linux 8.10 with the NVIDIA driver pre-installed at boot (driver.enabled: false). This PR makes the OS assumption explicit: OKE now requires --os ol, the same way GKE requires --os cos. A query without an OS flag returns no match — intentional, not a gap. Operators must confirm which node image they're running before a recipe is issued.

Fixes: #1003
Fixes: #1517
Related: N/A

Type of Change

New feature (non-breaking change that adds functionality)

Component(s) Affected

Recipe engine / data (pkg/recipe)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
CLI (cmd/aicr, pkg/cli)
Docs/examples (docs/, examples/)
Other: fingerprint (pkg/fingerprint), overlays (recipes/), UAT tests

Implementation Notes

ol OS enum: Oracle Linux reports ID=ol in /etc/os-release; mirrors cos (short OS name). Added to oskind, criteria.go, normalizeOSID(), OpenAPI spec, CLI flags, and all doc enum lists.
OKE OS model: All non-ubuntu OKE overlays (oke, oke-training, oke-inference, a100-oke-training, gb200-oke-{training,inference}, l40s-oke-{training,inference}) now carry os: ol. Ubuntu OKE overlays are unchanged. No os: any fallback — same as GKE requiring --os cos.
l40s-any.yaml: Cross-cutting deployment-phase floor (4 checks + gpu-operator >= v24.6.0), mirroring a100-any.yaml.
cert-manager disabled in oke.yaml via overrides.enabled: false — OKE pre-installs it as a managed addon.
driver.enabled: false / toolkit.enabled: false in values-oke.yaml — Oracle Linux Gen2 GPU OKE image pre-installs both at the OS level.

Testing

unset GITLAB_TOKEN && make qualify  # all tests, lint, e2e, scan — clean pass

UAT test suite added at tests/uat/oci/ for CUJ1 (L40S OKE Oracle Linux training).

Risk Assessment

Low — Additive only. Callers already passing --os ol or using snapshot-driven criteria (fingerprinted Oracle Linux nodes) are unaffected. Callers passing no --os for OKE now get no match instead of the Oracle Linux recipe — this is the intended behavior.

Rollout notes: N/A

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

coderabbitai · 2026-06-27T23:01:22Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds L40S as a supported GPU accelerator and Oracle Linux (ol) as a supported OS family across the AICR stack. It updates GPU SKU parsing and OS normalization, adds new criteria constants and parsers, expands OpenAPI enums and documentation, introduces new L40S/OKE recipe overlays, updates existing OKE overlays to target ol, and adds OCI UAT cluster, runner, and Chainsaw validation fixtures.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

area/validator, enhancement

Suggested reviewers

mchmarny
njhensley
xdu31

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Linked Issues check	⚠️ Warning	`#1003` is only partially met: the PR adds OKE L40S overlays, but it also adds deferred training overlays and a separate l40s enum split.	Remove the training overlay and avoid splitting the l40/L40S criteria unless you open a separate issue for that follow-up.
Out of Scope Changes check	⚠️ Warning	The PR includes broad Oracle Linux, fingerprinting, collector, docs, and UAT changes that are not required by `#1003`'s overlay scope.	Split the non-overlay work into separate PRs or explicitly expand the linked issue scope before merging.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding L40S accelerator support.
Description check	✅ Passed	The description clearly matches the changeset, covering L40S support, Oracle Linux enums, OKE overlays, and fingerprinting updates.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/oke-l40s-support

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.claude/skills/aicr-analyzing-snapshots/SKILL.md (1)

152-158: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Resolve the merge conflict before merge.

The conflict markers make this Markdown invalid, and the branch version also drops ocp, which is still part of the supported service enum in api/aicr/v1/server.yaml. Keep this table in sync with the API contract.

Proposed cleanup

-<<<<<<< HEAD
-| `service` | `CriteriaServiceType` | `any` or empty | `eks`, `gke`, `aks`, `oke`, `ocp`, `kind`, `lke`, `bcm` |
-| `accelerator` | `CriteriaAcceleratorType` | `any` or empty | `h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `rtx-pro-6000` |
-=======
-| `service` | `CriteriaServiceType` | `any` or empty | `eks`, `gke`, `aks`, `oke`, `kind`, `lke`, `bcm` |
-| `accelerator` | `CriteriaAcceleratorType` | `any` or empty | `h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `l40s`, `rtx-pro-6000` |
->>>>>>> 6c7ce4c2 (feat(recipe): add L40S accelerator support (OKE overlays + fingerprint fix))
+| `service` | `CriteriaServiceType` | `any` or empty | `eks`, `gke`, `aks`, `oke`, `ocp`, `kind`, `lke`, `bcm` |
+| `accelerator` | `CriteriaAcceleratorType` | `any` or empty | `h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `l40s`, `rtx-pro-6000` |

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/aicr-analyzing-snapshots/SKILL.md around lines 152 - 158,
Remove the merge conflict markers from the Markdown table in the snapshot
analysis skill document and reconcile the branch/base content into valid table
rows. While fixing the table, make sure the supported service list stays aligned
with the service enum in api/aicr/v1/server.yaml by preserving the ocp entry
alongside the other supported services.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/contributor/recipe.md`:
- Around line 152-158: Resolve the merge conflict in the criteria table by
removing the conflict markers and merging both valid updates into a single
Markdown row set. Keep the `service` values aligned with the API contract by
preserving `ocp`, and keep the `accelerator` values aligned with the latest
support by including `l40s`; verify the `CriteriaServiceType` and
`CriteriaAcceleratorType` entries remain consistent with the documented schema.

In `@docs/user/api-reference.md`:
- Around line 121-127: Resolve the merge conflict in the API reference table by
removing the conflict markers and choosing the entries that match the API
schema. In the table for the documented accelerator/service fields, keep the
`service` list aligned with `api/aicr/v1/server.yaml` by retaining `ocp`, and
keep the `accelerator` list consistent with the supported values by including
`l40s` from the feature branch. Use the `service` and `accelerator` rows in the
affected table as the anchors when updating the docs.

In `@docs/user/cli-reference.md`:
- Around line 389-395: Resolve the merge conflict in the CLI reference table by
removing the conflict markers and merging both intended updates. Keep
`--service` aligned with `api/aicr/v1/server.yaml` by preserving `ocp`, and
merge the `--accelerator` values so the list includes the new `l40s` entry
alongside the existing GPU types. Verify the final table formatting in
`docs/user/cli-reference.md` remains valid after the conflict is removed.

In `@pkg/recipe/doc.go`:
- Line 29: The godoc in doc.go is inconsistent because the detailed accelerator
list under the Criteria types section still omits CriteriaAcceleratorL40S even
though the summary comment already mentions l40s. Update the comment block
around the CriteriaAcceleratorType / CriteriaAcceleratorL40 bullet list to
include CriteriaAcceleratorL40S alongside CriteriaAcceleratorL40, keeping the
documented accelerator set aligned with the API changes.

In `@tests/uat/oci/run`:
- Around line 125-130: The helm-diff installation step is unpinned, so UAT runs
can change when upstream releases change. Update the install logic in the run
script’s helm-diff plugin block to request a specific helm-diff version instead
of the default latest release, and apply the same version pinning pattern in the
matching helm-diff install blocks in tests/uat/aws/run and tests/uat/gcp/run so
all UAT environments behave consistently.

In `@tests/uat/oci/tests/cuj1-training/assert-recipe.yaml`:
- Around line 36-38: Update the CUJ1 training assertion fixture to verify the
OKE cert-manager addon is explicitly disabled, not just present in
componentRefs. In assert-recipe.yaml, extend the existing cert-manager check so
the rendered recipe/manifest is validated against the disabled state for
cert-manager, using the cert-manager and componentRefs assertions together to
catch any regression that re-enables the managed addon.

In `@tests/uat/oci/tests/l40s-training-config.yaml`:
- Around line 77-81: The validate agent block only sets tolerations, so the
validation Jobs can still schedule onto the system pool; update
spec.validate.agent to also mirror the GPU node selector used by
spec.snapshot.agent and keep the existing namespace/tolerations in place. Use
the agent stanza under spec.validate as the target and align it with the
nodeSelector settings from spec.snapshot.agent so validator jobs are constrained
to the L40S GPU pool.

---

Outside diff comments:
In @.claude/skills/aicr-analyzing-snapshots/SKILL.md:
- Around line 152-158: Remove the merge conflict markers from the Markdown table
in the snapshot analysis skill document and reconcile the branch/base content
into valid table rows. While fixing the table, make sure the supported service
list stays aligned with the service enum in api/aicr/v1/server.yaml by
preserving the ocp entry alongside the other supported services.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 78c5e234-83dd-4d27-947f-8a1c6c1d0054

📥 Commits

Reviewing files that changed from the base of the PR and between 9be382b and 0769c3a.

📒 Files selected for processing (31)

.claude/skills/aicr-analyzing-snapshots/SKILL.md
.github/ISSUE_TEMPLATE/bug_report.yml
api/aicr/v1/server.yaml
docs/contributor/recipe.md
docs/user/api-reference.md
docs/user/cli-reference.md
pkg/cli/recipe.go
pkg/client/v1/types.go
pkg/fingerprint/doc.go
pkg/fingerprint/from_measurements_test.go
pkg/fingerprint/gpu_sku.go
pkg/fingerprint/gpu_sku_test.go
pkg/fingerprint/types.go
pkg/recipe/criteria.go
pkg/recipe/criteria_test.go
pkg/recipe/doc.go
pkg/recipe/yaml_test.go
recipes/overlays/l40s-any.yaml
recipes/overlays/l40s-oke-inference.yaml
recipes/overlays/l40s-oke-training.yaml
recipes/overlays/l40s-oke-ubuntu-inference.yaml
recipes/overlays/l40s-oke-ubuntu-training-kubeflow.yaml
recipes/overlays/l40s-oke-ubuntu-training.yaml
recipes/overlays/oke.yaml
tests/uat/oci/cluster-config.yaml
tests/uat/oci/run
tests/uat/oci/tests/cuj1-training/assert-recipe.yaml
tests/uat/oci/tests/cuj1-training/assert-validate-deployment.yaml
tests/uat/oci/tests/cuj1-training/assert-validate-multiphase.yaml
tests/uat/oci/tests/cuj1-training/chainsaw-test.yaml
tests/uat/oci/tests/l40s-training-config.yaml

coderabbitai · 2026-06-27T23:01:26Z

+  echo "::group::Install helm-diff plugin (idempotent)"
+  if ! helm plugin list 2>/dev/null | awk '{print $1}' | grep -qx diff; then
+    helm plugin install https://github.com/databus23/helm-diff
+  else
+    echo "helm-diff already installed"
+  fi


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Inspect existing repo pins/usages for helm-diff:" fd -HI 'uat-oci\.ya?ml|\.settings\.yaml|Makefile' . .github tests 2>/dev/null | sort rg -n "helm-diff|plugin install|HELM_DIFF_VERSION" .github tests .settings.yaml Makefile

Repository: NVIDIA/aicr

Length of output: 1870

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Check repo pins for helm-diff and related tool setup:" rg -n "helm-diff|HELM_DIFF_VERSION|plugin install .*databus23/helm-diff|helm plugin install" .settings.yaml Makefile tests/uat -g '!vendor/**' echo echo "Inspect relevant sections:" sed -n '1,220p' .settings.yaml echo sed -n '1,260p' Makefile echo sed -n '110,165p' tests/uat/oci/run echo sed -n '130,170p' tests/uat/aws/run echo sed -n '130,170p' tests/uat/gcp/run

Repository: NVIDIA/aicr

Length of output: 24716

Pin the helm-diff plugin version in this script. This installs whatever upstream release is current at runtime, so UAT behavior can drift across runs; the same pattern also appears in tests/uat/aws/run and tests/uat/gcp/run.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/uat/oci/run` around lines 125 - 130, The helm-diff installation step is unpinned, so UAT runs can change when upstream releases change. Update the install logic in the run script’s helm-diff plugin block to request a specific helm-diff version instead of the default latest release, and apply the same version pinning pattern in the matching helm-diff install blocks in tests/uat/aws/run and tests/uat/gcp/run so all UAT environments behave consistently.

Source: Coding guidelines

coderabbitai · 2026-06-27T23:01:26Z

+componentRefs: ## alphabetically sorted
+  - name: cert-manager
+  - name: gpu-operator


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Assert the OKE cert-manager disablement explicitly.

This fixture only checks that cert-manager is present. It will not catch a regression that re-enables the managed addon, which is one of the key behaviors introduced in this PR. Please assert the exact rendered disabled state as well.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/uat/oci/tests/cuj1-training/assert-recipe.yaml` around lines 36 - 38, Update the CUJ1 training assertion fixture to verify the OKE cert-manager addon is explicitly disabled, not just present in componentRefs. In assert-recipe.yaml, extend the existing cert-manager check so the rendered recipe/manifest is validated against the disabled state for cert-manager, using the cert-manager and componentRefs assertions together to catch any regression that re-enables the managed addon.

coderabbitai · 2026-06-27T23:01:26Z

+    agent:
+      namespace: aicr-validation
+      tolerations:
+        - skyhook.nvidia.com=runtime-required:NoSchedule
+        - nvidia.com/gpu=present:NoSchedule


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Constrain validator jobs to the GPU node pool.

spec.validate.agent only adds tolerations. That still allows the validation Jobs to land on the untainted system pool instead of the L40S nodes, so this UAT can miss the GPU-specific path entirely. Mirror the GPU node selector used under spec.snapshot.agent.

Proposed fix

validate: execution: failFast: true input: recipe: recipe.yaml snapshot: snapshot.yaml agent: namespace: aicr-validation + nodeSelector: + node.dgxc.nvidia.com/has-gpu: "true" tolerations: - skyhook.nvidia.com=runtime-required:NoSchedule - nvidia.com/gpu=present:NoSchedule

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

agent:

namespace: aicr-validation

tolerations:

- skyhook.nvidia.com=runtime-required:NoSchedule

- nvidia.com/gpu=present:NoSchedule

agent:

namespace: aicr-validation

nodeSelector:

node.dgxc.nvidia.com/has-gpu: "true"

tolerations:

- skyhook.nvidia.com=runtime-required:NoSchedule

- nvidia.com/gpu=present:NoSchedule

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/uat/oci/tests/l40s-training-config.yaml` around lines 77 - 81, The validate agent block only sets tolerations, so the validation Jobs can still schedule onto the system pool; update spec.validate.agent to also mirror the GPU node selector used by spec.snapshot.agent and keep the existing namespace/tolerations in place. Use the agent stanza under spec.validate as the target and align it with the nodeSelector settings from spec.snapshot.agent so validator jobs are constrained to the L40S GPU pool.

- Add l40s-oke-training and l40s-oke-inference overlays for OKE - Add `ol` OS enum (Oracle Linux, maps ID=ol from /etc/os-release) - Require explicit `os: ol` on all non-ubuntu OKE overlays; mirrors GKE/COS model — no `os: any` fallback (query must specify OS) - Disable cert-manager in the OKE base overlay via enabled: false - Add fingerprint normalization for Oracle Linux (normalizeOSID) - Update OpenAPI spec, CLI flags, and all doc enum lists with `ol` - Add OCI OKE L40S training UAT test suite Signed-off-by: Atif Mahmood <amahmood@nvidia.com> Signed-off-by: Atif Mahmood <atif1996@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/recipe/metadata_test.go (1)
2193-2210: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add the new L40S/OKE leaves to this coverage table.

This PR adds l40s-oke-training and l40s-oke-inference, but TestNFDTopologyUpdater_OverlayCoverage still doesn't exercise them. A topology-updater regression on the new OKE/L40S inheritance path would currently pass unnoticed.
Suggested additions
+		{"l40s-oke-training", criteria{CriteriaServiceOKE, CriteriaAcceleratorL40S, CriteriaOSOracleLinux, CriteriaIntentTraining, ""}, true},
+		{"l40s-oke-inference", criteria{CriteriaServiceOKE, CriteriaAcceleratorL40S, CriteriaOSOracleLinux, CriteriaIntentInference, ""}, true},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/recipe/metadata_test.go` around lines 2193 - 2210,
TestNFDTopologyUpdater_OverlayCoverage is missing the new OKE/L40S cases, so add
coverage entries for the new l40s-oke-training and l40s-oke-inference leaves in
the existing criteria table. Update the table in metadata_test.go alongside the
other real-cluster GPU leaves so the overlay inheritance path for
CriteriaServiceOKE with CriteriaAcceleratorL40S and the training/inference
intents is exercised and regressions in the new leaf coverage are caught.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/uat/oci/tests/cuj1-training/chainsaw-test.yaml`:
- Around line 21-29: Update the CUJ1 Chainsaw UAT to exercise the new Oracle
Linux OKE path instead of the old Ubuntu selector. In the test spec’s
descriptive text and any recipe/bundle generation steps that still reference
`ubuntu`, change them to `ol` and ensure the `generate recipe`/`generate bundle`
flow explicitly passes `--os ol` so it matches the new OKE overlays.
Double-check the CUJ1 scenario steps and any related assertions in
`chainsaw-test.yaml` so the test validates the OKE/L40S/training path with
Oracle Linux throughout.

---

Outside diff comments:
In `@pkg/recipe/metadata_test.go`:
- Around line 2193-2210: TestNFDTopologyUpdater_OverlayCoverage is missing the
new OKE/L40S cases, so add coverage entries for the new l40s-oke-training and
l40s-oke-inference leaves in the existing criteria table. Update the table in
metadata_test.go alongside the other real-cluster GPU leaves so the overlay
inheritance path for CriteriaServiceOKE with CriteriaAcceleratorL40S and the
training/inference intents is exercised and regressions in the new leaf coverage
are caught.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 470dc7bb-f5ae-44d8-a673-3e9ab22d8098

📥 Commits

Reviewing files that changed from the base of the PR and between 8c4e842 and ee608da.

📒 Files selected for processing (38)

.claude/skills/aicr-analyzing-snapshots/SKILL.md
.github/ISSUE_TEMPLATE/bug_report.yml
api/aicr/v1/server.yaml
docs/contributor/recipe.md
docs/user/api-reference.md
docs/user/cli-reference.md
pkg/cli/recipe.go
pkg/cli/snapshot.go
pkg/client/v1/types.go
pkg/fingerprint/doc.go
pkg/fingerprint/from_measurements.go
pkg/fingerprint/from_measurements_test.go
pkg/fingerprint/gpu_sku.go
pkg/fingerprint/gpu_sku_test.go
pkg/fingerprint/types.go
pkg/recipe/criteria.go
pkg/recipe/criteria_test.go
pkg/recipe/doc.go
pkg/recipe/metadata_test.go
pkg/recipe/oskind/oskind.go
pkg/recipe/oskind/oskind_test.go
pkg/recipe/yaml_test.go
recipes/overlays/a100-oke-training.yaml
recipes/overlays/gb200-oke-inference.yaml
recipes/overlays/gb200-oke-training.yaml
recipes/overlays/l40s-any.yaml
recipes/overlays/l40s-oke-inference.yaml
recipes/overlays/l40s-oke-training.yaml
recipes/overlays/oke-inference.yaml
recipes/overlays/oke-training.yaml
recipes/overlays/oke.yaml
tests/uat/oci/cluster-config.yaml
tests/uat/oci/run
tests/uat/oci/tests/cuj1-training/assert-recipe.yaml
tests/uat/oci/tests/cuj1-training/assert-validate-deployment.yaml
tests/uat/oci/tests/cuj1-training/assert-validate-multiphase.yaml
tests/uat/oci/tests/cuj1-training/chainsaw-test.yaml
tests/uat/oci/tests/l40s-training-config.yaml

coderabbitai · 2026-06-28T01:32:51Z

+  description: |
+    UAT CUJ1: Training workload on live OKE cluster with L40S GPU nodes.
+    Tests the aicr workflow against a real OKE cluster (BM.GPU.L40S.4):
+      Step 1: Snapshot the live cluster
+      Step 2: Generate recipe (OKE/L40S/training/ubuntu)
+      Step 3: Validate deployment against live snapshot
+      Step 4: Generate bundle with node scheduling
+      Step 5: Validate bundle structure
+      Step 6: Multi-phase validation


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Switch this UAT to Oracle Linux.

Lines 25 and 60 still document/use ubuntu, but this PR moves OKE overlays to os: ol and requires explicit --os ol. As written, this test is no longer exercising the new OKE path and can fail once the old selector stops resolving.

Suggested fix

- Tests the aicr workflow against a real OKE cluster (BM.GPU.L40S.4): + Tests the aicr workflow against a real OKE cluster (BM.GPU.L40S.4): Step 1: Snapshot the live cluster - Step 2: Generate recipe (OKE/L40S/training/ubuntu) + Step 2: Generate recipe (OKE/L40S/training/ol) Step 3: Validate deployment against live snapshot @@ - --os ubuntu \ + --os ol \

Also applies to: 56-61

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/uat/oci/tests/cuj1-training/chainsaw-test.yaml` around lines 21 - 29, Update the CUJ1 Chainsaw UAT to exercise the new Oracle Linux OKE path instead of the old Ubuntu selector. In the test spec’s descriptive text and any recipe/bundle generation steps that still reference `ubuntu`, change them to `ol` and ensure the `generate recipe`/`generate bundle` flow explicitly passes `--os ol` so it matches the new OKE overlays. Double-check the CUJ1 scenario steps and any related assertions in `chainsaw-test.yaml` so the test validates the OKE/L40S/training path with Oracle Linux throughout.

OKE nodes set spec.providerID to a bare OCID ("ocid1.instance.oc1...") with no scheme prefix, so the existing "oci://" split never matched and the raw OCID leaked into the service fingerprint. Fix in two layers: - pkg/collector/k8s: parseProvider checks strings.HasPrefix("ocid1.") before the "://" split; removes the unreachable "oci" case. - pkg/fingerprint: normalizeProviderID applied at read time so snapshots produced by older agent images are also corrected. Signed-off-by: Atif Mahmood <amahmood@nvidia.com> Signed-off-by: Atif Mahmood <atif1996@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/fingerprint/from_measurements.go`:
- Around line 323-333: Add a regression test for the raw-OCID compatibility path
in TestFromMeasurements_ServiceDetection so normalizeProviderID is covered for
older measurements that still carry a bare ocid1.* providerID. Extend the
existing table-driven cases with a raw OCI OCID input and assert it resolves to
the oke service type, alongside the already-normalized provider name cases. Use
the normalizeProviderID helper and the TestFromMeasurements_ServiceDetection
table as the main anchors when updating the test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: a34e66a6-5302-453d-865f-b0296bd697b9

📥 Commits

Reviewing files that changed from the base of the PR and between ee608da and ff8442b.

📒 Files selected for processing (3)

pkg/collector/k8s/node.go
pkg/collector/k8s/node_test.go
pkg/fingerprint/from_measurements.go

coderabbitai · 2026-06-28T14:26:03Z

+// normalizeProviderID maps a raw Kubernetes spec.providerID (or an already-
+// normalized name stored by the collector) to the service type string used in
+// recipe criteria. OKE nodes emit a bare OCID ("ocid1.instance.oc1...") with
+// no scheme prefix; other providers use "<scheme>://..." which the collector
+// already normalizes, but we handle both here for resilience.
+func normalizeProviderID(v string) string {
+	if strings.HasPrefix(v, "ocid1.") {
+		return "oke"
+	}
+	return v
+}


📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a raw-OCID regression case for this compatibility path.

normalizeProviderID was added specifically to recover older measurements that still carry a raw ocid1.* value, but the shown TestFromMeasurements_ServiceDetection table only exercises already-normalized provider names. Please add a raw-OCID case there so this backward-compat branch cannot regress silently. As per coding guidelines, "Write tests for new functionality."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/fingerprint/from_measurements.go` around lines 323 - 333, Add a regression test for the raw-OCID compatibility path in TestFromMeasurements_ServiceDetection so normalizeProviderID is covered for older measurements that still carry a bare ocid1.* providerID. Extend the existing table-driven cases with a raw OCI OCID input and assert it resolves to the oke service type, alongside the already-normalized provider name cases. Use the normalizeProviderID helper and the TestFromMeasurements_ServiceDetection table as the main anchors when updating the test.

Source: Coding guidelines

atif1996 requested review from a team as code owners June 27, 2026 22:52

github-actions Bot added area/recipes area/ci area/tests area/docs area/cli area/api size/XL labels Jun 27, 2026

coderabbitai Bot reviewed Jun 27, 2026

View reviewed changes

atif1996 force-pushed the feat/oke-l40s-support branch from 0769c3a to 287af6a Compare June 28, 2026 00:41

atif1996 force-pushed the feat/oke-l40s-support branch from 8c4e842 to ee608da Compare June 28, 2026 01:26

coderabbitai Bot reviewed Jun 28, 2026

View reviewed changes

mchmarny assigned atif1996 Jun 28, 2026

github-actions Bot added the area/collector label Jun 28, 2026

coderabbitai Bot reviewed Jun 28, 2026

View reviewed changes

atif1996 mentioned this pull request Jun 28, 2026

Snapshot fingerprint reports raw OCI OCID instead of "oke" for OKE clusters #1517

Open

Uh oh!

Conversation

atif1996 commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

atif1996 commented Jun 27, 2026 •

edited

Loading

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading