Skip to content

feat(recipe): add L40S accelerator support#1510

Open
atif1996 wants to merge 2 commits into
mainfrom
feat/oke-l40s-support
Open

feat(recipe): add L40S accelerator support#1510
atif1996 wants to merge 2 commits into
mainfrom
feat/oke-l40s-support

Conversation

@atif1996

@atif1996 atif1996 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Add L40S as a supported accelerator for OKE, introduce ol as the Oracle Linux OS enum, and require explicit OS selection for all OKE overlays (mirrors the GKE/COS model — no os: any fallback).

Motivation / Context

Two gaps addressed:

  1. L40S on OKE had no recipe — the fingerprint engine already mapped PCI 26b9 → "l40s" but there were no OKE overlays for it.
  2. OKE overlays previously had no os: field (implicit any). Oracle's Gen2 GPU OKE image runs Oracle Linux 8.10 with the NVIDIA driver pre-installed at boot (driver.enabled: false). This PR makes the OS assumption explicit: OKE now requires --os ol, the same way GKE requires --os cos. A query without an OS flag returns no match — intentional, not a gap. Operators must confirm which node image they're running before a recipe is issued.

Fixes: #1003
Fixes: #1517
Related: N/A

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • CLI (cmd/aicr, pkg/cli)
  • Docs/examples (docs/, examples/)
  • Other: fingerprint (pkg/fingerprint), overlays (recipes/), UAT tests

Implementation Notes

  • ol OS enum: Oracle Linux reports ID=ol in /etc/os-release; mirrors cos (short OS name). Added to oskind, criteria.go, normalizeOSID(), OpenAPI spec, CLI flags, and all doc enum lists.
  • OKE OS model: All non-ubuntu OKE overlays (oke, oke-training, oke-inference, a100-oke-training, gb200-oke-{training,inference}, l40s-oke-{training,inference}) now carry os: ol. Ubuntu OKE overlays are unchanged. No os: any fallback — same as GKE requiring --os cos.
  • l40s-any.yaml: Cross-cutting deployment-phase floor (4 checks + gpu-operator >= v24.6.0), mirroring a100-any.yaml.
  • cert-manager disabled in oke.yaml via overrides.enabled: false — OKE pre-installs it as a managed addon.
  • driver.enabled: false / toolkit.enabled: false in values-oke.yaml — Oracle Linux Gen2 GPU OKE image pre-installs both at the OS level.

Testing

unset GITLAB_TOKEN && make qualify  # all tests, lint, e2e, scan — clean pass

UAT test suite added at tests/uat/oci/ for CUJ1 (L40S OKE Oracle Linux training).

Risk Assessment

  • Low — Additive only. Callers already passing --os ol or using snapshot-driven criteria (fingerprinted Oracle Linux nodes) are unaffected. Callers passing no --os for OKE now get no match instead of the Oracle Linux recipe — this is the intended behavior.

Rollout notes: N/A

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds L40S as a supported GPU accelerator and Oracle Linux (ol) as a supported OS family across the AICR stack. It updates GPU SKU parsing and OS normalization, adds new criteria constants and parsers, expands OpenAPI enums and documentation, introduces new L40S/OKE recipe overlays, updates existing OKE overlays to target ol, and adds OCI UAT cluster, runner, and Chainsaw validation fixtures.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

area/validator, enhancement

Suggested reviewers

  • mchmarny
  • njhensley
  • xdu31
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning #1003 is only partially met: the PR adds OKE L40S overlays, but it also adds deferred training overlays and a separate l40s enum split. Remove the training overlay and avoid splitting the l40/L40S criteria unless you open a separate issue for that follow-up.
Out of Scope Changes check ⚠️ Warning The PR includes broad Oracle Linux, fingerprinting, collector, docs, and UAT changes that are not required by #1003's overlay scope. Split the non-overlay work into separate PRs or explicitly expand the linked issue scope before merging.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding L40S accelerator support.
Description check ✅ Passed The description clearly matches the changeset, covering L40S support, Oracle Linux enums, OKE overlays, and fingerprinting updates.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/oke-l40s-support

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.claude/skills/aicr-analyzing-snapshots/SKILL.md (1)

152-158: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Resolve the merge conflict before merge.

The conflict markers make this Markdown invalid, and the branch version also drops ocp, which is still part of the supported service enum in api/aicr/v1/server.yaml. Keep this table in sync with the API contract.

Proposed cleanup
-<<<<<<< HEAD
-| `service` | `CriteriaServiceType` | `any` or empty | `eks`, `gke`, `aks`, `oke`, `ocp`, `kind`, `lke`, `bcm` |
-| `accelerator` | `CriteriaAcceleratorType` | `any` or empty | `h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `rtx-pro-6000` |
-=======
-| `service` | `CriteriaServiceType` | `any` or empty | `eks`, `gke`, `aks`, `oke`, `kind`, `lke`, `bcm` |
-| `accelerator` | `CriteriaAcceleratorType` | `any` or empty | `h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `l40s`, `rtx-pro-6000` |
->>>>>>> 6c7ce4c2 (feat(recipe): add L40S accelerator support (OKE overlays + fingerprint fix))
+| `service` | `CriteriaServiceType` | `any` or empty | `eks`, `gke`, `aks`, `oke`, `ocp`, `kind`, `lke`, `bcm` |
+| `accelerator` | `CriteriaAcceleratorType` | `any` or empty | `h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `l40s`, `rtx-pro-6000` |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/aicr-analyzing-snapshots/SKILL.md around lines 152 - 158,
Remove the merge conflict markers from the Markdown table in the snapshot
analysis skill document and reconcile the branch/base content into valid table
rows. While fixing the table, make sure the supported service list stays aligned
with the service enum in api/aicr/v1/server.yaml by preserving the ocp entry
alongside the other supported services.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/contributor/recipe.md`:
- Around line 152-158: Resolve the merge conflict in the criteria table by
removing the conflict markers and merging both valid updates into a single
Markdown row set. Keep the `service` values aligned with the API contract by
preserving `ocp`, and keep the `accelerator` values aligned with the latest
support by including `l40s`; verify the `CriteriaServiceType` and
`CriteriaAcceleratorType` entries remain consistent with the documented schema.

In `@docs/user/api-reference.md`:
- Around line 121-127: Resolve the merge conflict in the API reference table by
removing the conflict markers and choosing the entries that match the API
schema. In the table for the documented accelerator/service fields, keep the
`service` list aligned with `api/aicr/v1/server.yaml` by retaining `ocp`, and
keep the `accelerator` list consistent with the supported values by including
`l40s` from the feature branch. Use the `service` and `accelerator` rows in the
affected table as the anchors when updating the docs.

In `@docs/user/cli-reference.md`:
- Around line 389-395: Resolve the merge conflict in the CLI reference table by
removing the conflict markers and merging both intended updates. Keep
`--service` aligned with `api/aicr/v1/server.yaml` by preserving `ocp`, and
merge the `--accelerator` values so the list includes the new `l40s` entry
alongside the existing GPU types. Verify the final table formatting in
`docs/user/cli-reference.md` remains valid after the conflict is removed.

In `@pkg/recipe/doc.go`:
- Line 29: The godoc in doc.go is inconsistent because the detailed accelerator
list under the Criteria types section still omits CriteriaAcceleratorL40S even
though the summary comment already mentions l40s. Update the comment block
around the CriteriaAcceleratorType / CriteriaAcceleratorL40 bullet list to
include CriteriaAcceleratorL40S alongside CriteriaAcceleratorL40, keeping the
documented accelerator set aligned with the API changes.

In `@tests/uat/oci/run`:
- Around line 125-130: The helm-diff installation step is unpinned, so UAT runs
can change when upstream releases change. Update the install logic in the run
script’s helm-diff plugin block to request a specific helm-diff version instead
of the default latest release, and apply the same version pinning pattern in the
matching helm-diff install blocks in tests/uat/aws/run and tests/uat/gcp/run so
all UAT environments behave consistently.

In `@tests/uat/oci/tests/cuj1-training/assert-recipe.yaml`:
- Around line 36-38: Update the CUJ1 training assertion fixture to verify the
OKE cert-manager addon is explicitly disabled, not just present in
componentRefs. In assert-recipe.yaml, extend the existing cert-manager check so
the rendered recipe/manifest is validated against the disabled state for
cert-manager, using the cert-manager and componentRefs assertions together to
catch any regression that re-enables the managed addon.

In `@tests/uat/oci/tests/l40s-training-config.yaml`:
- Around line 77-81: The validate agent block only sets tolerations, so the
validation Jobs can still schedule onto the system pool; update
spec.validate.agent to also mirror the GPU node selector used by
spec.snapshot.agent and keep the existing namespace/tolerations in place. Use
the agent stanza under spec.validate as the target and align it with the
nodeSelector settings from spec.snapshot.agent so validator jobs are constrained
to the L40S GPU pool.

---

Outside diff comments:
In @.claude/skills/aicr-analyzing-snapshots/SKILL.md:
- Around line 152-158: Remove the merge conflict markers from the Markdown table
in the snapshot analysis skill document and reconcile the branch/base content
into valid table rows. While fixing the table, make sure the supported service
list stays aligned with the service enum in api/aicr/v1/server.yaml by
preserving the ocp entry alongside the other supported services.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 78c5e234-83dd-4d27-947f-8a1c6c1d0054

📥 Commits

Reviewing files that changed from the base of the PR and between 9be382b and 0769c3a.

📒 Files selected for processing (31)
  • .claude/skills/aicr-analyzing-snapshots/SKILL.md
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/contributor/recipe.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/cli/recipe.go
  • pkg/client/v1/types.go
  • pkg/fingerprint/doc.go
  • pkg/fingerprint/from_measurements_test.go
  • pkg/fingerprint/gpu_sku.go
  • pkg/fingerprint/gpu_sku_test.go
  • pkg/fingerprint/types.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • pkg/recipe/yaml_test.go
  • recipes/overlays/l40s-any.yaml
  • recipes/overlays/l40s-oke-inference.yaml
  • recipes/overlays/l40s-oke-training.yaml
  • recipes/overlays/l40s-oke-ubuntu-inference.yaml
  • recipes/overlays/l40s-oke-ubuntu-training-kubeflow.yaml
  • recipes/overlays/l40s-oke-ubuntu-training.yaml
  • recipes/overlays/oke.yaml
  • tests/uat/oci/cluster-config.yaml
  • tests/uat/oci/run
  • tests/uat/oci/tests/cuj1-training/assert-recipe.yaml
  • tests/uat/oci/tests/cuj1-training/assert-validate-deployment.yaml
  • tests/uat/oci/tests/cuj1-training/assert-validate-multiphase.yaml
  • tests/uat/oci/tests/cuj1-training/chainsaw-test.yaml
  • tests/uat/oci/tests/l40s-training-config.yaml

Comment thread docs/contributor/recipe.md Outdated
Comment thread docs/user/api-reference.md Outdated
Comment thread docs/user/cli-reference.md Outdated
Comment thread pkg/recipe/doc.go
Comment thread tests/uat/oci/run
Comment on lines +125 to +130
echo "::group::Install helm-diff plugin (idempotent)"
if ! helm plugin list 2>/dev/null | awk '{print $1}' | grep -qx diff; then
helm plugin install https://github.com/databus23/helm-diff
else
echo "helm-diff already installed"
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Inspect existing repo pins/usages for helm-diff:"
fd -HI 'uat-oci\.ya?ml|\.settings\.yaml|Makefile' . .github tests 2>/dev/null | sort
rg -n "helm-diff|plugin install|HELM_DIFF_VERSION" .github tests .settings.yaml Makefile

Repository: NVIDIA/aicr

Length of output: 1870


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Check repo pins for helm-diff and related tool setup:"
rg -n "helm-diff|HELM_DIFF_VERSION|plugin install .*databus23/helm-diff|helm plugin install" .settings.yaml Makefile tests/uat -g '!vendor/**'

echo
echo "Inspect relevant sections:"
sed -n '1,220p' .settings.yaml
echo
sed -n '1,260p' Makefile
echo
sed -n '110,165p' tests/uat/oci/run
echo
sed -n '130,170p' tests/uat/aws/run
echo
sed -n '130,170p' tests/uat/gcp/run

Repository: NVIDIA/aicr

Length of output: 24716


Pin the helm-diff plugin version in this script. This installs whatever upstream release is current at runtime, so UAT behavior can drift across runs; the same pattern also appears in tests/uat/aws/run and tests/uat/gcp/run.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/uat/oci/run` around lines 125 - 130, The helm-diff installation step is
unpinned, so UAT runs can change when upstream releases change. Update the
install logic in the run script’s helm-diff plugin block to request a specific
helm-diff version instead of the default latest release, and apply the same
version pinning pattern in the matching helm-diff install blocks in
tests/uat/aws/run and tests/uat/gcp/run so all UAT environments behave
consistently.

Source: Coding guidelines

Comment on lines +36 to +38
componentRefs: ## alphabetically sorted
- name: cert-manager
- name: gpu-operator

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Assert the OKE cert-manager disablement explicitly.

This fixture only checks that cert-manager is present. It will not catch a regression that re-enables the managed addon, which is one of the key behaviors introduced in this PR. Please assert the exact rendered disabled state as well.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/uat/oci/tests/cuj1-training/assert-recipe.yaml` around lines 36 - 38,
Update the CUJ1 training assertion fixture to verify the OKE cert-manager addon
is explicitly disabled, not just present in componentRefs. In
assert-recipe.yaml, extend the existing cert-manager check so the rendered
recipe/manifest is validated against the disabled state for cert-manager, using
the cert-manager and componentRefs assertions together to catch any regression
that re-enables the managed addon.

Comment on lines +77 to +81
agent:
namespace: aicr-validation
tolerations:
- skyhook.nvidia.com=runtime-required:NoSchedule
- nvidia.com/gpu=present:NoSchedule

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Constrain validator jobs to the GPU node pool.

spec.validate.agent only adds tolerations. That still allows the validation Jobs to land on the untainted system pool instead of the L40S nodes, so this UAT can miss the GPU-specific path entirely. Mirror the GPU node selector used under spec.snapshot.agent.

Proposed fix
   validate:
     execution:
       failFast: true
     input:
       recipe: recipe.yaml
       snapshot: snapshot.yaml
     agent:
       namespace: aicr-validation
+      nodeSelector:
+        node.dgxc.nvidia.com/has-gpu: "true"
       tolerations:
         - skyhook.nvidia.com=runtime-required:NoSchedule
         - nvidia.com/gpu=present:NoSchedule
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
agent:
namespace: aicr-validation
tolerations:
- skyhook.nvidia.com=runtime-required:NoSchedule
- nvidia.com/gpu=present:NoSchedule
agent:
namespace: aicr-validation
nodeSelector:
node.dgxc.nvidia.com/has-gpu: "true"
tolerations:
- skyhook.nvidia.com=runtime-required:NoSchedule
- nvidia.com/gpu=present:NoSchedule
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/uat/oci/tests/l40s-training-config.yaml` around lines 77 - 81, The
validate agent block only sets tolerations, so the validation Jobs can still
schedule onto the system pool; update spec.validate.agent to also mirror the GPU
node selector used by spec.snapshot.agent and keep the existing
namespace/tolerations in place. Use the agent stanza under spec.validate as the
target and align it with the nodeSelector settings from spec.snapshot.agent so
validator jobs are constrained to the L40S GPU pool.

@atif1996 atif1996 force-pushed the feat/oke-l40s-support branch from 0769c3a to 287af6a Compare June 28, 2026 00:41
- Add l40s-oke-training and l40s-oke-inference overlays for OKE
- Add `ol` OS enum (Oracle Linux, maps ID=ol from /etc/os-release)
- Require explicit `os: ol` on all non-ubuntu OKE overlays; mirrors
  GKE/COS model — no `os: any` fallback (query must specify OS)
- Disable cert-manager in the OKE base overlay via enabled: false
- Add fingerprint normalization for Oracle Linux (normalizeOSID)
- Update OpenAPI spec, CLI flags, and all doc enum lists with `ol`
- Add OCI OKE L40S training UAT test suite

Signed-off-by: Atif Mahmood <amahmood@nvidia.com>
Signed-off-by: Atif Mahmood <atif1996@users.noreply.github.com>
@atif1996 atif1996 force-pushed the feat/oke-l40s-support branch from 8c4e842 to ee608da Compare June 28, 2026 01:26

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/recipe/metadata_test.go (1)

2193-2210: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add the new L40S/OKE leaves to this coverage table.

This PR adds l40s-oke-training and l40s-oke-inference, but TestNFDTopologyUpdater_OverlayCoverage still doesn't exercise them. A topology-updater regression on the new OKE/L40S inheritance path would currently pass unnoticed.

Suggested additions
+		{"l40s-oke-training", criteria{CriteriaServiceOKE, CriteriaAcceleratorL40S, CriteriaOSOracleLinux, CriteriaIntentTraining, ""}, true},
+		{"l40s-oke-inference", criteria{CriteriaServiceOKE, CriteriaAcceleratorL40S, CriteriaOSOracleLinux, CriteriaIntentInference, ""}, true},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/recipe/metadata_test.go` around lines 2193 - 2210,
TestNFDTopologyUpdater_OverlayCoverage is missing the new OKE/L40S cases, so add
coverage entries for the new l40s-oke-training and l40s-oke-inference leaves in
the existing criteria table. Update the table in metadata_test.go alongside the
other real-cluster GPU leaves so the overlay inheritance path for
CriteriaServiceOKE with CriteriaAcceleratorL40S and the training/inference
intents is exercised and regressions in the new leaf coverage are caught.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/uat/oci/tests/cuj1-training/chainsaw-test.yaml`:
- Around line 21-29: Update the CUJ1 Chainsaw UAT to exercise the new Oracle
Linux OKE path instead of the old Ubuntu selector. In the test spec’s
descriptive text and any recipe/bundle generation steps that still reference
`ubuntu`, change them to `ol` and ensure the `generate recipe`/`generate bundle`
flow explicitly passes `--os ol` so it matches the new OKE overlays.
Double-check the CUJ1 scenario steps and any related assertions in
`chainsaw-test.yaml` so the test validates the OKE/L40S/training path with
Oracle Linux throughout.

---

Outside diff comments:
In `@pkg/recipe/metadata_test.go`:
- Around line 2193-2210: TestNFDTopologyUpdater_OverlayCoverage is missing the
new OKE/L40S cases, so add coverage entries for the new l40s-oke-training and
l40s-oke-inference leaves in the existing criteria table. Update the table in
metadata_test.go alongside the other real-cluster GPU leaves so the overlay
inheritance path for CriteriaServiceOKE with CriteriaAcceleratorL40S and the
training/inference intents is exercised and regressions in the new leaf coverage
are caught.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 470dc7bb-f5ae-44d8-a673-3e9ab22d8098

📥 Commits

Reviewing files that changed from the base of the PR and between 8c4e842 and ee608da.

📒 Files selected for processing (38)
  • .claude/skills/aicr-analyzing-snapshots/SKILL.md
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/contributor/recipe.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/cli/recipe.go
  • pkg/cli/snapshot.go
  • pkg/client/v1/types.go
  • pkg/fingerprint/doc.go
  • pkg/fingerprint/from_measurements.go
  • pkg/fingerprint/from_measurements_test.go
  • pkg/fingerprint/gpu_sku.go
  • pkg/fingerprint/gpu_sku_test.go
  • pkg/fingerprint/types.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • pkg/recipe/metadata_test.go
  • pkg/recipe/oskind/oskind.go
  • pkg/recipe/oskind/oskind_test.go
  • pkg/recipe/yaml_test.go
  • recipes/overlays/a100-oke-training.yaml
  • recipes/overlays/gb200-oke-inference.yaml
  • recipes/overlays/gb200-oke-training.yaml
  • recipes/overlays/l40s-any.yaml
  • recipes/overlays/l40s-oke-inference.yaml
  • recipes/overlays/l40s-oke-training.yaml
  • recipes/overlays/oke-inference.yaml
  • recipes/overlays/oke-training.yaml
  • recipes/overlays/oke.yaml
  • tests/uat/oci/cluster-config.yaml
  • tests/uat/oci/run
  • tests/uat/oci/tests/cuj1-training/assert-recipe.yaml
  • tests/uat/oci/tests/cuj1-training/assert-validate-deployment.yaml
  • tests/uat/oci/tests/cuj1-training/assert-validate-multiphase.yaml
  • tests/uat/oci/tests/cuj1-training/chainsaw-test.yaml
  • tests/uat/oci/tests/l40s-training-config.yaml

Comment on lines +21 to +29
description: |
UAT CUJ1: Training workload on live OKE cluster with L40S GPU nodes.
Tests the aicr workflow against a real OKE cluster (BM.GPU.L40S.4):
Step 1: Snapshot the live cluster
Step 2: Generate recipe (OKE/L40S/training/ubuntu)
Step 3: Validate deployment against live snapshot
Step 4: Generate bundle with node scheduling
Step 5: Validate bundle structure
Step 6: Multi-phase validation

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Switch this UAT to Oracle Linux.

Lines 25 and 60 still document/use ubuntu, but this PR moves OKE overlays to os: ol and requires explicit --os ol. As written, this test is no longer exercising the new OKE path and can fail once the old selector stops resolving.

Suggested fix
-    Tests the aicr workflow against a real OKE cluster (BM.GPU.L40S.4):
+    Tests the aicr workflow against a real OKE cluster (BM.GPU.L40S.4):
       Step 1: Snapshot the live cluster
-      Step 2: Generate recipe (OKE/L40S/training/ubuntu)
+      Step 2: Generate recipe (OKE/L40S/training/ol)
       Step 3: Validate deployment against live snapshot
@@
-                --os ubuntu \
+                --os ol \

Also applies to: 56-61

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/uat/oci/tests/cuj1-training/chainsaw-test.yaml` around lines 21 - 29,
Update the CUJ1 Chainsaw UAT to exercise the new Oracle Linux OKE path instead
of the old Ubuntu selector. In the test spec’s descriptive text and any
recipe/bundle generation steps that still reference `ubuntu`, change them to
`ol` and ensure the `generate recipe`/`generate bundle` flow explicitly passes
`--os ol` so it matches the new OKE overlays. Double-check the CUJ1 scenario
steps and any related assertions in `chainsaw-test.yaml` so the test validates
the OKE/L40S/training path with Oracle Linux throughout.

OKE nodes set spec.providerID to a bare OCID ("ocid1.instance.oc1...")
with no scheme prefix, so the existing "oci://" split never matched and
the raw OCID leaked into the service fingerprint.

Fix in two layers:
- pkg/collector/k8s: parseProvider checks strings.HasPrefix("ocid1.")
  before the "://" split; removes the unreachable "oci" case.
- pkg/fingerprint: normalizeProviderID applied at read time so snapshots
  produced by older agent images are also corrected.

Signed-off-by: Atif Mahmood <amahmood@nvidia.com>
Signed-off-by: Atif Mahmood <atif1996@users.noreply.github.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/fingerprint/from_measurements.go`:
- Around line 323-333: Add a regression test for the raw-OCID compatibility path
in TestFromMeasurements_ServiceDetection so normalizeProviderID is covered for
older measurements that still carry a bare ocid1.* providerID. Extend the
existing table-driven cases with a raw OCI OCID input and assert it resolves to
the oke service type, alongside the already-normalized provider name cases. Use
the normalizeProviderID helper and the TestFromMeasurements_ServiceDetection
table as the main anchors when updating the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: a34e66a6-5302-453d-865f-b0296bd697b9

📥 Commits

Reviewing files that changed from the base of the PR and between ee608da and ff8442b.

📒 Files selected for processing (3)
  • pkg/collector/k8s/node.go
  • pkg/collector/k8s/node_test.go
  • pkg/fingerprint/from_measurements.go

Comment on lines +323 to +333
// normalizeProviderID maps a raw Kubernetes spec.providerID (or an already-
// normalized name stored by the collector) to the service type string used in
// recipe criteria. OKE nodes emit a bare OCID ("ocid1.instance.oc1...") with
// no scheme prefix; other providers use "<scheme>://..." which the collector
// already normalizes, but we handle both here for resilience.
func normalizeProviderID(v string) string {
if strings.HasPrefix(v, "ocid1.") {
return "oke"
}
return v
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a raw-OCID regression case for this compatibility path.

normalizeProviderID was added specifically to recover older measurements that still carry a raw ocid1.* value, but the shown TestFromMeasurements_ServiceDetection table only exercises already-normalized provider names. Please add a raw-OCID case there so this backward-compat branch cannot regress silently. As per coding guidelines, "Write tests for new functionality."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/fingerprint/from_measurements.go` around lines 323 - 333, Add a
regression test for the raw-OCID compatibility path in
TestFromMeasurements_ServiceDetection so normalizeProviderID is covered for
older measurements that still carry a bare ocid1.* providerID. Extend the
existing table-driven cases with a raw OCI OCID input and assert it resolves to
the oke service type, alongside the already-normalized provider name cases. Use
the normalizeProviderID helper and the TestFromMeasurements_ServiceDetection
table as the main anchors when updating the test.

Source: Coding guidelines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Snapshot fingerprint reports raw OCI OCID instead of "oke" for OKE clusters feat(recipes): add concrete L40 / L40S service-bound overlays

1 participant