feat(recipe): add L40S accelerator support#1510
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds L40S as a supported GPU accelerator and Oracle Linux ( Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.claude/skills/aicr-analyzing-snapshots/SKILL.md (1)
152-158: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick winResolve the merge conflict before merge.
The conflict markers make this Markdown invalid, and the branch version also drops
ocp, which is still part of the supported service enum inapi/aicr/v1/server.yaml. Keep this table in sync with the API contract.Proposed cleanup
-<<<<<<< HEAD -| `service` | `CriteriaServiceType` | `any` or empty | `eks`, `gke`, `aks`, `oke`, `ocp`, `kind`, `lke`, `bcm` | -| `accelerator` | `CriteriaAcceleratorType` | `any` or empty | `h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `rtx-pro-6000` | -======= -| `service` | `CriteriaServiceType` | `any` or empty | `eks`, `gke`, `aks`, `oke`, `kind`, `lke`, `bcm` | -| `accelerator` | `CriteriaAcceleratorType` | `any` or empty | `h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `l40s`, `rtx-pro-6000` | ->>>>>>> 6c7ce4c2 (feat(recipe): add L40S accelerator support (OKE overlays + fingerprint fix)) +| `service` | `CriteriaServiceType` | `any` or empty | `eks`, `gke`, `aks`, `oke`, `ocp`, `kind`, `lke`, `bcm` | +| `accelerator` | `CriteriaAcceleratorType` | `any` or empty | `h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `l40s`, `rtx-pro-6000` |🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/aicr-analyzing-snapshots/SKILL.md around lines 152 - 158, Remove the merge conflict markers from the Markdown table in the snapshot analysis skill document and reconcile the branch/base content into valid table rows. While fixing the table, make sure the supported service list stays aligned with the service enum in api/aicr/v1/server.yaml by preserving the ocp entry alongside the other supported services.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/contributor/recipe.md`:
- Around line 152-158: Resolve the merge conflict in the criteria table by
removing the conflict markers and merging both valid updates into a single
Markdown row set. Keep the `service` values aligned with the API contract by
preserving `ocp`, and keep the `accelerator` values aligned with the latest
support by including `l40s`; verify the `CriteriaServiceType` and
`CriteriaAcceleratorType` entries remain consistent with the documented schema.
In `@docs/user/api-reference.md`:
- Around line 121-127: Resolve the merge conflict in the API reference table by
removing the conflict markers and choosing the entries that match the API
schema. In the table for the documented accelerator/service fields, keep the
`service` list aligned with `api/aicr/v1/server.yaml` by retaining `ocp`, and
keep the `accelerator` list consistent with the supported values by including
`l40s` from the feature branch. Use the `service` and `accelerator` rows in the
affected table as the anchors when updating the docs.
In `@docs/user/cli-reference.md`:
- Around line 389-395: Resolve the merge conflict in the CLI reference table by
removing the conflict markers and merging both intended updates. Keep
`--service` aligned with `api/aicr/v1/server.yaml` by preserving `ocp`, and
merge the `--accelerator` values so the list includes the new `l40s` entry
alongside the existing GPU types. Verify the final table formatting in
`docs/user/cli-reference.md` remains valid after the conflict is removed.
In `@pkg/recipe/doc.go`:
- Line 29: The godoc in doc.go is inconsistent because the detailed accelerator
list under the Criteria types section still omits CriteriaAcceleratorL40S even
though the summary comment already mentions l40s. Update the comment block
around the CriteriaAcceleratorType / CriteriaAcceleratorL40 bullet list to
include CriteriaAcceleratorL40S alongside CriteriaAcceleratorL40, keeping the
documented accelerator set aligned with the API changes.
In `@tests/uat/oci/run`:
- Around line 125-130: The helm-diff installation step is unpinned, so UAT runs
can change when upstream releases change. Update the install logic in the run
script’s helm-diff plugin block to request a specific helm-diff version instead
of the default latest release, and apply the same version pinning pattern in the
matching helm-diff install blocks in tests/uat/aws/run and tests/uat/gcp/run so
all UAT environments behave consistently.
In `@tests/uat/oci/tests/cuj1-training/assert-recipe.yaml`:
- Around line 36-38: Update the CUJ1 training assertion fixture to verify the
OKE cert-manager addon is explicitly disabled, not just present in
componentRefs. In assert-recipe.yaml, extend the existing cert-manager check so
the rendered recipe/manifest is validated against the disabled state for
cert-manager, using the cert-manager and componentRefs assertions together to
catch any regression that re-enables the managed addon.
In `@tests/uat/oci/tests/l40s-training-config.yaml`:
- Around line 77-81: The validate agent block only sets tolerations, so the
validation Jobs can still schedule onto the system pool; update
spec.validate.agent to also mirror the GPU node selector used by
spec.snapshot.agent and keep the existing namespace/tolerations in place. Use
the agent stanza under spec.validate as the target and align it with the
nodeSelector settings from spec.snapshot.agent so validator jobs are constrained
to the L40S GPU pool.
---
Outside diff comments:
In @.claude/skills/aicr-analyzing-snapshots/SKILL.md:
- Around line 152-158: Remove the merge conflict markers from the Markdown table
in the snapshot analysis skill document and reconcile the branch/base content
into valid table rows. While fixing the table, make sure the supported service
list stays aligned with the service enum in api/aicr/v1/server.yaml by
preserving the ocp entry alongside the other supported services.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 78c5e234-83dd-4d27-947f-8a1c6c1d0054
📒 Files selected for processing (31)
.claude/skills/aicr-analyzing-snapshots/SKILL.md.github/ISSUE_TEMPLATE/bug_report.ymlapi/aicr/v1/server.yamldocs/contributor/recipe.mddocs/user/api-reference.mddocs/user/cli-reference.mdpkg/cli/recipe.gopkg/client/v1/types.gopkg/fingerprint/doc.gopkg/fingerprint/from_measurements_test.gopkg/fingerprint/gpu_sku.gopkg/fingerprint/gpu_sku_test.gopkg/fingerprint/types.gopkg/recipe/criteria.gopkg/recipe/criteria_test.gopkg/recipe/doc.gopkg/recipe/yaml_test.gorecipes/overlays/l40s-any.yamlrecipes/overlays/l40s-oke-inference.yamlrecipes/overlays/l40s-oke-training.yamlrecipes/overlays/l40s-oke-ubuntu-inference.yamlrecipes/overlays/l40s-oke-ubuntu-training-kubeflow.yamlrecipes/overlays/l40s-oke-ubuntu-training.yamlrecipes/overlays/oke.yamltests/uat/oci/cluster-config.yamltests/uat/oci/runtests/uat/oci/tests/cuj1-training/assert-recipe.yamltests/uat/oci/tests/cuj1-training/assert-validate-deployment.yamltests/uat/oci/tests/cuj1-training/assert-validate-multiphase.yamltests/uat/oci/tests/cuj1-training/chainsaw-test.yamltests/uat/oci/tests/l40s-training-config.yaml
| echo "::group::Install helm-diff plugin (idempotent)" | ||
| if ! helm plugin list 2>/dev/null | awk '{print $1}' | grep -qx diff; then | ||
| helm plugin install https://github.com/databus23/helm-diff | ||
| else | ||
| echo "helm-diff already installed" | ||
| fi |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Inspect existing repo pins/usages for helm-diff:"
fd -HI 'uat-oci\.ya?ml|\.settings\.yaml|Makefile' . .github tests 2>/dev/null | sort
rg -n "helm-diff|plugin install|HELM_DIFF_VERSION" .github tests .settings.yaml MakefileRepository: NVIDIA/aicr
Length of output: 1870
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Check repo pins for helm-diff and related tool setup:"
rg -n "helm-diff|HELM_DIFF_VERSION|plugin install .*databus23/helm-diff|helm plugin install" .settings.yaml Makefile tests/uat -g '!vendor/**'
echo
echo "Inspect relevant sections:"
sed -n '1,220p' .settings.yaml
echo
sed -n '1,260p' Makefile
echo
sed -n '110,165p' tests/uat/oci/run
echo
sed -n '130,170p' tests/uat/aws/run
echo
sed -n '130,170p' tests/uat/gcp/runRepository: NVIDIA/aicr
Length of output: 24716
Pin the helm-diff plugin version in this script. This installs whatever upstream release is current at runtime, so UAT behavior can drift across runs; the same pattern also appears in tests/uat/aws/run and tests/uat/gcp/run.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/uat/oci/run` around lines 125 - 130, The helm-diff installation step is
unpinned, so UAT runs can change when upstream releases change. Update the
install logic in the run script’s helm-diff plugin block to request a specific
helm-diff version instead of the default latest release, and apply the same
version pinning pattern in the matching helm-diff install blocks in
tests/uat/aws/run and tests/uat/gcp/run so all UAT environments behave
consistently.
Source: Coding guidelines
| componentRefs: ## alphabetically sorted | ||
| - name: cert-manager | ||
| - name: gpu-operator |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Assert the OKE cert-manager disablement explicitly.
This fixture only checks that cert-manager is present. It will not catch a regression that re-enables the managed addon, which is one of the key behaviors introduced in this PR. Please assert the exact rendered disabled state as well.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/uat/oci/tests/cuj1-training/assert-recipe.yaml` around lines 36 - 38,
Update the CUJ1 training assertion fixture to verify the OKE cert-manager addon
is explicitly disabled, not just present in componentRefs. In
assert-recipe.yaml, extend the existing cert-manager check so the rendered
recipe/manifest is validated against the disabled state for cert-manager, using
the cert-manager and componentRefs assertions together to catch any regression
that re-enables the managed addon.
| agent: | ||
| namespace: aicr-validation | ||
| tolerations: | ||
| - skyhook.nvidia.com=runtime-required:NoSchedule | ||
| - nvidia.com/gpu=present:NoSchedule |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Constrain validator jobs to the GPU node pool.
spec.validate.agent only adds tolerations. That still allows the validation Jobs to land on the untainted system pool instead of the L40S nodes, so this UAT can miss the GPU-specific path entirely. Mirror the GPU node selector used under spec.snapshot.agent.
Proposed fix
validate:
execution:
failFast: true
input:
recipe: recipe.yaml
snapshot: snapshot.yaml
agent:
namespace: aicr-validation
+ nodeSelector:
+ node.dgxc.nvidia.com/has-gpu: "true"
tolerations:
- skyhook.nvidia.com=runtime-required:NoSchedule
- nvidia.com/gpu=present:NoSchedule📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| agent: | |
| namespace: aicr-validation | |
| tolerations: | |
| - skyhook.nvidia.com=runtime-required:NoSchedule | |
| - nvidia.com/gpu=present:NoSchedule | |
| agent: | |
| namespace: aicr-validation | |
| nodeSelector: | |
| node.dgxc.nvidia.com/has-gpu: "true" | |
| tolerations: | |
| - skyhook.nvidia.com=runtime-required:NoSchedule | |
| - nvidia.com/gpu=present:NoSchedule |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/uat/oci/tests/l40s-training-config.yaml` around lines 77 - 81, The
validate agent block only sets tolerations, so the validation Jobs can still
schedule onto the system pool; update spec.validate.agent to also mirror the GPU
node selector used by spec.snapshot.agent and keep the existing
namespace/tolerations in place. Use the agent stanza under spec.validate as the
target and align it with the nodeSelector settings from spec.snapshot.agent so
validator jobs are constrained to the L40S GPU pool.
0769c3a to
287af6a
Compare
- Add l40s-oke-training and l40s-oke-inference overlays for OKE - Add `ol` OS enum (Oracle Linux, maps ID=ol from /etc/os-release) - Require explicit `os: ol` on all non-ubuntu OKE overlays; mirrors GKE/COS model — no `os: any` fallback (query must specify OS) - Disable cert-manager in the OKE base overlay via enabled: false - Add fingerprint normalization for Oracle Linux (normalizeOSID) - Update OpenAPI spec, CLI flags, and all doc enum lists with `ol` - Add OCI OKE L40S training UAT test suite Signed-off-by: Atif Mahmood <amahmood@nvidia.com> Signed-off-by: Atif Mahmood <atif1996@users.noreply.github.com>
8c4e842 to
ee608da
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/recipe/metadata_test.go (1)
2193-2210: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd the new L40S/OKE leaves to this coverage table.
This PR adds
l40s-oke-trainingandl40s-oke-inference, butTestNFDTopologyUpdater_OverlayCoveragestill doesn't exercise them. A topology-updater regression on the new OKE/L40S inheritance path would currently pass unnoticed.Suggested additions
+ {"l40s-oke-training", criteria{CriteriaServiceOKE, CriteriaAcceleratorL40S, CriteriaOSOracleLinux, CriteriaIntentTraining, ""}, true}, + {"l40s-oke-inference", criteria{CriteriaServiceOKE, CriteriaAcceleratorL40S, CriteriaOSOracleLinux, CriteriaIntentInference, ""}, true},🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/recipe/metadata_test.go` around lines 2193 - 2210, TestNFDTopologyUpdater_OverlayCoverage is missing the new OKE/L40S cases, so add coverage entries for the new l40s-oke-training and l40s-oke-inference leaves in the existing criteria table. Update the table in metadata_test.go alongside the other real-cluster GPU leaves so the overlay inheritance path for CriteriaServiceOKE with CriteriaAcceleratorL40S and the training/inference intents is exercised and regressions in the new leaf coverage are caught.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/uat/oci/tests/cuj1-training/chainsaw-test.yaml`:
- Around line 21-29: Update the CUJ1 Chainsaw UAT to exercise the new Oracle
Linux OKE path instead of the old Ubuntu selector. In the test spec’s
descriptive text and any recipe/bundle generation steps that still reference
`ubuntu`, change them to `ol` and ensure the `generate recipe`/`generate bundle`
flow explicitly passes `--os ol` so it matches the new OKE overlays.
Double-check the CUJ1 scenario steps and any related assertions in
`chainsaw-test.yaml` so the test validates the OKE/L40S/training path with
Oracle Linux throughout.
---
Outside diff comments:
In `@pkg/recipe/metadata_test.go`:
- Around line 2193-2210: TestNFDTopologyUpdater_OverlayCoverage is missing the
new OKE/L40S cases, so add coverage entries for the new l40s-oke-training and
l40s-oke-inference leaves in the existing criteria table. Update the table in
metadata_test.go alongside the other real-cluster GPU leaves so the overlay
inheritance path for CriteriaServiceOKE with CriteriaAcceleratorL40S and the
training/inference intents is exercised and regressions in the new leaf coverage
are caught.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 470dc7bb-f5ae-44d8-a673-3e9ab22d8098
📒 Files selected for processing (38)
.claude/skills/aicr-analyzing-snapshots/SKILL.md.github/ISSUE_TEMPLATE/bug_report.ymlapi/aicr/v1/server.yamldocs/contributor/recipe.mddocs/user/api-reference.mddocs/user/cli-reference.mdpkg/cli/recipe.gopkg/cli/snapshot.gopkg/client/v1/types.gopkg/fingerprint/doc.gopkg/fingerprint/from_measurements.gopkg/fingerprint/from_measurements_test.gopkg/fingerprint/gpu_sku.gopkg/fingerprint/gpu_sku_test.gopkg/fingerprint/types.gopkg/recipe/criteria.gopkg/recipe/criteria_test.gopkg/recipe/doc.gopkg/recipe/metadata_test.gopkg/recipe/oskind/oskind.gopkg/recipe/oskind/oskind_test.gopkg/recipe/yaml_test.gorecipes/overlays/a100-oke-training.yamlrecipes/overlays/gb200-oke-inference.yamlrecipes/overlays/gb200-oke-training.yamlrecipes/overlays/l40s-any.yamlrecipes/overlays/l40s-oke-inference.yamlrecipes/overlays/l40s-oke-training.yamlrecipes/overlays/oke-inference.yamlrecipes/overlays/oke-training.yamlrecipes/overlays/oke.yamltests/uat/oci/cluster-config.yamltests/uat/oci/runtests/uat/oci/tests/cuj1-training/assert-recipe.yamltests/uat/oci/tests/cuj1-training/assert-validate-deployment.yamltests/uat/oci/tests/cuj1-training/assert-validate-multiphase.yamltests/uat/oci/tests/cuj1-training/chainsaw-test.yamltests/uat/oci/tests/l40s-training-config.yaml
| description: | | ||
| UAT CUJ1: Training workload on live OKE cluster with L40S GPU nodes. | ||
| Tests the aicr workflow against a real OKE cluster (BM.GPU.L40S.4): | ||
| Step 1: Snapshot the live cluster | ||
| Step 2: Generate recipe (OKE/L40S/training/ubuntu) | ||
| Step 3: Validate deployment against live snapshot | ||
| Step 4: Generate bundle with node scheduling | ||
| Step 5: Validate bundle structure | ||
| Step 6: Multi-phase validation |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Switch this UAT to Oracle Linux.
Lines 25 and 60 still document/use ubuntu, but this PR moves OKE overlays to os: ol and requires explicit --os ol. As written, this test is no longer exercising the new OKE path and can fail once the old selector stops resolving.
Suggested fix
- Tests the aicr workflow against a real OKE cluster (BM.GPU.L40S.4):
+ Tests the aicr workflow against a real OKE cluster (BM.GPU.L40S.4):
Step 1: Snapshot the live cluster
- Step 2: Generate recipe (OKE/L40S/training/ubuntu)
+ Step 2: Generate recipe (OKE/L40S/training/ol)
Step 3: Validate deployment against live snapshot
@@
- --os ubuntu \
+ --os ol \Also applies to: 56-61
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/uat/oci/tests/cuj1-training/chainsaw-test.yaml` around lines 21 - 29,
Update the CUJ1 Chainsaw UAT to exercise the new Oracle Linux OKE path instead
of the old Ubuntu selector. In the test spec’s descriptive text and any
recipe/bundle generation steps that still reference `ubuntu`, change them to
`ol` and ensure the `generate recipe`/`generate bundle` flow explicitly passes
`--os ol` so it matches the new OKE overlays. Double-check the CUJ1 scenario
steps and any related assertions in `chainsaw-test.yaml` so the test validates
the OKE/L40S/training path with Oracle Linux throughout.
OKE nodes set spec.providerID to a bare OCID ("ocid1.instance.oc1...")
with no scheme prefix, so the existing "oci://" split never matched and
the raw OCID leaked into the service fingerprint.
Fix in two layers:
- pkg/collector/k8s: parseProvider checks strings.HasPrefix("ocid1.")
before the "://" split; removes the unreachable "oci" case.
- pkg/fingerprint: normalizeProviderID applied at read time so snapshots
produced by older agent images are also corrected.
Signed-off-by: Atif Mahmood <amahmood@nvidia.com>
Signed-off-by: Atif Mahmood <atif1996@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/fingerprint/from_measurements.go`:
- Around line 323-333: Add a regression test for the raw-OCID compatibility path
in TestFromMeasurements_ServiceDetection so normalizeProviderID is covered for
older measurements that still carry a bare ocid1.* providerID. Extend the
existing table-driven cases with a raw OCI OCID input and assert it resolves to
the oke service type, alongside the already-normalized provider name cases. Use
the normalizeProviderID helper and the TestFromMeasurements_ServiceDetection
table as the main anchors when updating the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: a34e66a6-5302-453d-865f-b0296bd697b9
📒 Files selected for processing (3)
pkg/collector/k8s/node.gopkg/collector/k8s/node_test.gopkg/fingerprint/from_measurements.go
| // normalizeProviderID maps a raw Kubernetes spec.providerID (or an already- | ||
| // normalized name stored by the collector) to the service type string used in | ||
| // recipe criteria. OKE nodes emit a bare OCID ("ocid1.instance.oc1...") with | ||
| // no scheme prefix; other providers use "<scheme>://..." which the collector | ||
| // already normalizes, but we handle both here for resilience. | ||
| func normalizeProviderID(v string) string { | ||
| if strings.HasPrefix(v, "ocid1.") { | ||
| return "oke" | ||
| } | ||
| return v | ||
| } |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win
Add a raw-OCID regression case for this compatibility path.
normalizeProviderID was added specifically to recover older measurements that still carry a raw ocid1.* value, but the shown TestFromMeasurements_ServiceDetection table only exercises already-normalized provider names. Please add a raw-OCID case there so this backward-compat branch cannot regress silently. As per coding guidelines, "Write tests for new functionality."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/fingerprint/from_measurements.go` around lines 323 - 333, Add a
regression test for the raw-OCID compatibility path in
TestFromMeasurements_ServiceDetection so normalizeProviderID is covered for
older measurements that still carry a bare ocid1.* providerID. Extend the
existing table-driven cases with a raw OCI OCID input and assert it resolves to
the oke service type, alongside the already-normalized provider name cases. Use
the normalizeProviderID helper and the TestFromMeasurements_ServiceDetection
table as the main anchors when updating the test.
Source: Coding guidelines
Summary
Add L40S as a supported accelerator for OKE, introduce
olas the Oracle Linux OS enum, and require explicit OS selection for all OKE overlays (mirrors the GKE/COS model — noos: anyfallback).Motivation / Context
Two gaps addressed:
26b9 → "l40s"but there were no OKE overlays for it.os:field (implicitany). Oracle's Gen2 GPU OKE image runs Oracle Linux 8.10 with the NVIDIA driver pre-installed at boot (driver.enabled: false). This PR makes the OS assumption explicit: OKE now requires--os ol, the same way GKE requires--os cos. A query without an OS flag returns no match — intentional, not a gap. Operators must confirm which node image they're running before a recipe is issued.Fixes: #1003
Fixes: #1517
Related: N/A
Type of Change
Component(s) Affected
pkg/recipe)pkg/collector,pkg/snapshotter)cmd/aicr,pkg/cli)docs/,examples/)pkg/fingerprint), overlays (recipes/), UAT testsImplementation Notes
olOS enum: Oracle Linux reportsID=olin/etc/os-release; mirrorscos(short OS name). Added tooskind,criteria.go,normalizeOSID(), OpenAPI spec, CLI flags, and all doc enum lists.oke,oke-training,oke-inference,a100-oke-training,gb200-oke-{training,inference},l40s-oke-{training,inference}) now carryos: ol. Ubuntu OKE overlays are unchanged. Noos: anyfallback — same as GKE requiring--os cos.l40s-any.yaml: Cross-cutting deployment-phase floor (4 checks + gpu-operator>= v24.6.0), mirroringa100-any.yaml.oke.yamlviaoverrides.enabled: false— OKE pre-installs it as a managed addon.driver.enabled: false/toolkit.enabled: falseinvalues-oke.yaml— Oracle Linux Gen2 GPU OKE image pre-installs both at the OS level.Testing
UAT test suite added at
tests/uat/oci/for CUJ1 (L40S OKE Oracle Linux training).Risk Assessment
--os olor using snapshot-driven criteria (fingerprinted Oracle Linux nodes) are unaffected. Callers passing no--osfor OKE now get no match instead of the Oracle Linux recipe — this is the intended behavior.Rollout notes: N/A
Checklist
make testwith-race)make lint)git commit -S)