Releases: NVIDIA/aicr
v0.16.0-oke-l40s-support.2
v0.16.0-oke-l40s-support
Changelog
New Features
- ddc4dde: feat(bundler): private-by-default agentgateway inference-gateway exposure (#1450) (@yuanchen8911)
- e9aba0d: feat(bundler): structured deploy.sh output with color and step headers (#1393) (@mohityadav8)
- b3b4eee: feat(ci): enforce single-owner model for /assign and self-only /unassign (@mchmarny)
- 9e6374b: feat(ci): self-serve issue assignment via /assign and /unassign comments (@mchmarny)
- e878161: feat(evidence): --no-sign push and pending/structured-cause verify (#1445) (@mchmarny)
- a758de2: feat(evidence): aicr evidence sign — sign an existing pushed bundle (#1446) (@mchmarny)
- db3ad73: feat(evidence): auto-sign evidence pointers on push (#1454) (@mchmarny)
- 480b306: feat(evidence): fork-based signing workflow + publishing docs (#1451) (@mchmarny)
- 03b14d9: feat(evidence): minimize sensitive content in evidence bundles (#1414) (@njhensley)
- da21e22: feat(measurement): NetworkTopology type + Subtype.Items for cross-repo l8k integration (#1417) (@almaslennikov)
- dc23abf: feat(network): NetworkTopology collector + l8k library integration (#1474) (@almaslennikov)
- ee608da: feat(recipe): add L40S OKE support with Oracle Linux OS enum (@atif1996)
- a1b6e20: feat(recipe): shared coordinate mapping + taxonomy spec (TG6) (#1409) (@mchmarny)
- ffe7522: feat(testgrid): add testgrid-publish CLI tool (TG2) (#1447) (@srao-nv)
- 9a741e6: feat(validate): enforce artifact apiVersion compatibility (#1387) (@mchmarny)
- 68eee88: feat(validate): warn on aicr version skew across inputs (#1386) (@mchmarny)
- d57f7ac: feat(validation): switch inference-perf frontend routing to least-loaded (#1399) (@yuanchen8911)
- 381ba2a: feat(validators): RoCE NET variant for nccl-all-reduce-bw (#1428) (@yuanchen8911)
- f0ddcd0: feat(verify): private Sigstore trust root (aicr verify --trust-root) (#1449) (@lockwobr)
Bug Fixes
- 732a8dd: fix(attestation): retry ambient OIDC token acquisition (#1389) (@mchmarny)
- 24edd51: fix(ci): correct docs-only detection in merge gate (#1381) (@mchmarny)
- eea2a81: fix(cli): validate/recipe UX papercuts (#1383 items 1-7) (#1391) (@mchmarny)
- fd8ba27: fix(evidence): key-constrain all redaction allowlist subtypes (#1419) (@njhensley)
- 9cd3fb7: fix(recipe): fix deployment ordering to honor enabled:false (#1465) (@yuanchen8911)
- 7f7ae67: fix(test): argocd-sync gate: use all-semantics, handle health-gated apps (#1397) (@rsd-darshan)
- a0f2d09: fix(uat): bind Prometheus PVC to cluster default StorageClass (#1455) (@njhensley)
- 026804e: fix(uat): gate install on the deployment validation phase (#1439) (@njhensley)
- ebbd322: fix(uat): gate on Prometheus readiness; retry conformance metric APIs (#1452) (@njhensley)
- 6663eb5: fix(uat): gate validate on cluster convergence (readiness wait + failFast) (#1426) (@njhensley)
- c4f0872: fix(uat): run all validate phases to gate on GPU readiness (#1416) (@njhensley)
- 58721c5: fix(uat): wait for nodewright tuning before validate readiness gate (#1429) (@njhensley)
- 6ac2cfc: fix(validator): drop single-shot ClusterPolicy check from expected-resources (#1495) (@njhensley)
- 410e31a: fix(validator): pod-autoscaling passes on external-metric HPA path (DRA clusters) (#1408) (@yuanchen8911)
Other Tasks
- bf294ed: Add OpenShift Container Platform (OCP) as a service type (#1380) (@kaponco)
- 3307169: Delete .github/workflows/devtrace.yaml (@mchmarny)
- e927f39: chore (validator): pin Trainer self-install JobSet image to promoted registry (#1440) (@yuanchen8911)
- b391a17: chore(ci): add kaynetu to copy-pr-bot trustees (#1412) (@kaynetu)
- 9c2d362: chore(deps): Update dependency awscli to v1.45.34 (#1415) (@github-actions[bot])
- 6011f49: chore(deps): Update dependency google/go-containerregistry to v0.21.7 (#1469) (@github-actions[bot])
- 44f912b: chore(deps): Update dependency hauler-dev/hauler to v2 (#1482) (@github-actions[bot])
- dbd698a: chore(deps): Update gitea/gitea Docker tag to v1.26.3 (#1444) (@github-actions[bot])
- cafb630: chore(deps): Update k8s.io/utils digest to a95e086 (#1468) (@github-actions[bot])
- fc24c42: chore(deps): Update testing-tools (#1378) (@github-actions[bot])
- 6d147e2: chore(deps): Update testing-tools (#1396) (@github-actions[bot])
- 71aea47: chore(deps): Update testing-tools (#1470) (@github-actions[bot])
- a5583f6: chore(deps): Update testing-tools (#1481) (@github-actions[bot])
- 9eb3ce9: chore(evidence): refresh gb200-eks-ubuntu-training attestation (#1427) (@yuanchen8911)
- f330078: chore(fern): register v0.15.0 (#1370) (@github-actions[bot])
- 09039e8: chore(recipes): bump aws-efa to v0.5.29, align image tag to appVersion (#1418) (@yuanchen8911)
- 872f851: chore: deps: bump actions/cache from 5.0.5 to 6.0.0 (#1442) (@dependabot[bot])
- 822477e: chore: deps: bump actions/checkout from 6.0.3 to 7.0.0 (#1395) (@dependabot[bot])
- 8c4c46f: chore: deps: bump actions/setup-go from 6.4.0 to 6.5.0 (#1441) (@dependabot[bot])
- 3b285b0: chore: deps: bump azure/setup-helm from 5.0.0 to 5.0.1 (#1443) (@dependabot[bot])
- bbed4d4: chore: deps: bump renovatebot/github-action from 46.1.15 to 46.1.16 (#1398) (@dependabot[bot])
- b40188f: chore: deps: bump thingzio/devtrace-action from b35e96e42494859cfbab25d17499efb5cb30a59d to 167f053a160c2ddf5cf1fcd288bbb83e1da28930 (#1377) (@dependabot[bot])
- f2f8ed9: ci(kwok): add argocd-git lane with in-cluster gitea (#1390) (@haarchri)
- eac5eee: ci(supply-chain): add Rekor identity + consistency monitor (#1480) (@lockwobr)
- 4e6824e: ci(uat): refresh cloud creds before teardown to prevent GPU leak (#1456) (@njhensley)
- 9b4a664: docs(adr): ADR-013 migrate artifact API domain to aicr.run (#1492) (@mchmarny)
- 173e81c: docs(cli): explain recipe input modes (snapshot vs criteria) and why (#1411) (@yuanchen8911)
- b2ca362: docs(contributing): single commit-signing rule (-s -S) for all contributors (#1494) (@mchmarny)
- 0c2ebbe: docs(contributor): document exposing scheduling knobs without new flags (#1379) (@mchmarny)
- 9bd2268: docs(evidence): add H100 GKE COS and GB200 EKS Ubuntu training attestations (#1368) (@atif1996)
- 6fc3444: docs(recipes): note b200-any-training retirement (#1053) in wildcard guide (#1475) (@yuanchen8911)
- 596ba97: docs(testgrid): public TestGrid page + Fern nav (TG6) (#1463) (@mchmarny)
- d25244b: feat(evidence-gate): protected/other partition + component-scoped cascade (#1448) (@mchmarny)
- 15c16f3: feat(slinky-slurm): add conformance validator and CUJ demo (#1394) (@kaynetu)
- e19cb68: refactor(uat): stable digest-addressed OCI ref for evidence push (#1479) (@njhensley)
- 12b86df: test(uat): allow NATS ingress on UAT AWS/GCP clusters (#1376) (@mchmarny)
v0.15.0
This release focuses on recipe health scoring, improved deployment validation, improved snapshot/discovery, and extending software supply chain capabilities for enterprise users.
Highlights
Recipe Structural Health - New pkg/health engine computes per-recipe health signals (chart_pinned, constraints_wellformed, declared_coverage) and rolls them up into a recipe-health matrix. aicr recipe list surfaces structural-health columns (with a --no-health opt-out), a tools/health generator and weekly recipe-health-refresh workflow keep the matrix current, and a lint guard now requires healthCheck.assertFile.
Improved Deployment Validation - The chainsaw deployment-phase runner is now an in-process executor rather than a shelled-out binary. aicr validate runs all phases by default with a --fail-fast opt-in, fails closed on evaluator errors, and is nil-safe across health checks.
Snapshot/Discovery - The collector now discovers GPU SKUs without nvidia-smi, removing the CUDA base image dependency and matching SKUs on token boundaries instead of substrings.
Closed Supply Chain - Signing and verification now work end-to-end in air-gapped and enterprise environments. aicr bundle supports KMS-backed signing (--signing-key) and private Sigstore deployments (--fulcio-url, --rekor-url); aicr verify --key validates bundles against a KMS or public key; and aicr evidence publish signs recipe evidence off-network. The recipe catalog itself now ships signed provenance for the V1 closed supply chain, and keyless signing warns before publishing identity to the public transparency log.
New Recipes & Overlays
- A100 training Kubeflow overlay chains for EKS, AKS, GKE COS, and OKE
- GB300 concrete EKS service-bound overlays
- OKE GB200 and AKS H100 Dynamo performance checks
CLI & Bundling
aicr recipe listsubcommand for catalog enumeration- Gatekeeper added as an optional component
Inference Performance & Validation
- Inference-performance validation enhanced and tuned; gated on all worker services Ready
nccl-all-reduce-bwgates wired for EKS + H200; GKE NCCL node selector made dynamic- Bounded absent-resource retries in deployment-phase health checks
Thanks to @atif1996, @cdesiniotis, @dims, @haarchri, @JaydipGabani, @lalitadithya, @lockwobr, @njhensley, @pdmack, @pedjak, @rsd-darshan, @sttts, @xdu31, @yuanchen8911, and @mchmarny.
Changelog
New Features
- cfb0cb0: feat(agentgateway): scope inference-gateway LB to allowed source ranges (#1138) (@yuanchen8911)
- 3b8b1f3: feat(bundle): KMS-backed signing via --signing-key (#407) (#1205) (@lockwobr)
- 5316d3c: feat(bundle): private Sigstore via --fulcio-url and --rekor-url (#1158) (@lockwobr)
- b847f96: feat(bundler): retry sign.Bundle on transient Sigstore failures (#1251) (@mchmarny)
- 3c1f525: feat(bundler): warn on open agentgateway inference-gateway exposure (#1163) (@yuanchen8911)
- cd30fe5: feat(ci): add weekly recipe-health-refresh workflow (#1320) (@njhensley)
- 5f8647d: feat(cli): add --no-health opt-out to recipe list (#1314) (@njhensley)
- f0490fb: feat(cli): add --set-json/--set-file for list and object bundle overrides (#1162) (@yuanchen8911)
- b401339: feat(cli): add structural-health columns to recipe list (#1302) (@njhensley)
- a47a53f: feat(cli): warn before keyless signing publishes identity to public log (#1300) (@njhensley)
- 97934ad: feat(collector): driver-free GPU SKU discovery; remove nvidia-smi + CUDA base (#1352) (@mchmarny)
- eb8728d: feat(coverage): generated CUJ/CLI coverage matrix (RQ3) (#1316) (@mchmarny)
- 1674ea4: feat(evidence): add
aicr evidence publishfor off-network signing (#1140) (@njhensley) - e1b0160: feat(health): add tools/health generator and recipe-health matrix (#1304) (@njhensley)
- 8d0da78: feat(health): chart_pinned signal + declared_coverage descriptor (#1293) (@mchmarny)
- fa92b43: feat(health): constraints_wellformed signal (parse-only, hermetic) (#1301) (@njhensley)
- 10b08f1: feat(health): pkg/health core Compute loop, resolves signal, rollup (#1291) (@mchmarny)
- 3e2f823: feat(recipe): add AKS H100 Dynamo perf check (#1232) (@yuanchen8911)
- 8f8bc56: feat(recipe): add OKE GB200 perf check (#1233) (@yuanchen8911)
- ec95e20: feat(recipe): add aicr recipe list subcommand for catalog enumeration (#1208) (@rsd-darshan)
- 57fbed0: feat(recipe): hydrate healthCheck.assertFile + suppression sentinel (#1231) (@mchmarny)
- ae6819c: feat(recipe): lint guard requiring healthCheck.assertFile + allowlist (#1244) (@mchmarny)
- 0bf2267: feat(recipe): signed catalog provenance for V1 closed supply chain (#1216) (@mchmarny)
- 463d6a1: feat(recipes): add A100 AKS training Kubeflow overlay chain (#1295) (@yuanchen8911)
- d8d3070: feat(recipes): add A100 EKS training Kubeflow overlay chain (#1305) (@yuanchen8911)
- fd64dd7: feat(recipes): add A100 GKE COS training Kubeflow overlay chain (#1306) (@yuanchen8911)
- 6eb85ac: feat(recipes): add A100 OKE training Kubeflow overlay chain (#1294) (@yuanchen8911)
- 4b817ce: feat(recipes): add concrete GB300 EKS service-bound overlays (#1319) (@yuanchen8911)
- cad0142: feat(recipes): backfill chainsaw health checks for 5 missing components (#1243) (@mchmarny)
- 81daab3: feat(recipes): deepen 21 chainsaw health checks; close epic #660 (#1245) (@mchmarny)
- 7bb7059: feat(recipes): migrate nvidia-dra-driver-gpu to registry.k8s.io v0.4.0 (#1285) (@mchmarny)
- e848e1f: feat(tests): KMS bundle-signing e2e against MiniStack over TLS (#1298) (@lockwobr)
- 81c3fb0: feat(tests): private-Sigstore bundle-signing e2e via Helm scaffold (#1321) (@lockwobr)
- 25a6cdd: feat(validator): ship chainsaw binary; activate deployment-phase runner (#1235) (@mchmarny)
- 4d5ac90: feat(validators): enhance inference performance validation (#1133) (@yuanchen8911)
- f6cb3cd: feat(validators): replace chainsaw binary with in-process executor (#1252) (@mchmarny)
- e3460fc: feat(verify): KMS/public-key bundle verification (aicr verify --key) (#1238) (@lockwobr)
Bug Fixes
- e3aa6b4: fix(bundler): disable kataSandboxDevicePlugin in gpu-operator values (#1343) (@atif1996)
- 5f51fe8: fix(ci): actually tear down AWS UAT cluster (destroy → apply) (#1213) (@njhensley)
- 484c61a: fix(ci): always upload recipe-evidence report so comment gate works (#1292) (@njhensley)
- eddc075: fix(ci): build patched nvkind with --config-source=file (#1237) (#1258) (@mchmarny)
- 1ef5bdb: fix(ci): resolve fork PRs for recipe-evidence sticky comment (#1297) (@njhensley)
- 7068779: fix(ci): shard Tier 3 KWOK matrix to stay under 256-config cap (#1173) (@njhensley)
- ff8a756: fix(ci): stamp publish with resolved tag instead of releases/latest API (#1136) (@pdmack)
- 85daf65: fix(ci): suppress chainsaw CVEs + apply VEX on release scan (#1366) (@mchmarny)
- 4e8f778: fix(ci): unblock build-attested workflow on missing HOMEBREW_DEPLOY_KEY (#1296) (@lockwobr)
- 4e86524: fix(docs): catch bare tags in MDX safety check (#1170) (@pedjak)
- c606d17: fix(docs): cover contributor docs in MDX check; catch autolinks (#1151) (@mchmarny)
- 915ed66: fix(docs): escape bare < for Fern MDX + harden MDX checker (#1367) (@mchmarny)
- 4b2864a: fix(docs): keep --set-json code span on single line for MDX check (@mchmarny)
- de2d9dd: fix(docs): use MDX comments in recipe-health.md so Fern publish parses (#1365) (@mchmarny)
- 833a2e5: fix(evidence): emit recipe-evidence pointer.yaml at 2-space indent (#1165) (@yuanchen8911)
- af563b2: fix(evidence): pull by digest and auto-tag pushes instead of :v1 (#1168) (@njhensley)
- 410f5a3: fix(fingerprint): match GPU SKUs on token boundaries not substrings (#1350) (@mchmarny)
- 6e95906: fix(health): exempt manifest-only Helm components from chart_pinned (#1303) (@njhensley)
- e32375c: fix(recipes): pin nvidia-dra-driver-gpu to 0.4.1-rc.1 for strict-YAML fix (#1341) (@yuanchen8911)
- 032b707: fix(scan): match VEX PURL to grype's image PURL + surface CVE IDs (@mchmarny)
- 909629b...
v0.14.0
This release focuses on further expansion to AICR recipe matrix (H200, B200, RTX PRO 6000, BCM, Slinky Slurm, Run:ai), new Go client for embedding AICR, and significant engine hardening that removes process-global state from the recipe and improves validator pipelines.
Highlights
In-Process Go Library — New pkg/aicr package exposes an aicr.Client facade for in-process consumers, allowing products and SDKs to drive recipe resolution, bundling, and validation without forking a CLI. The CLI and HTTP server already implement thin adapters over this facade.
aicr mirror — New top-level command for mirroring container images referenced by a recipe to an alternate registry, completing the air-gapped story that began with Helm vendoring in v0.13.0. The command reuses the recipe-bound DataProvider so its manifest reads are identical to what bundle and validate see.
Engine Hardening — The recipe and validator pipelines are now free of process-global state. Builder owns an isolated DataProvider, the criteria registry is per-provider (no more singleton), and the DataProvider interface is context-aware. Embedders can safely run multiple Builder instances concurrently against different sources.
New Recipes & Overlays
- H200 promoted to a first-class accelerator type with EKS overlays
- GKE B200 service-bound overlays
- RTX PRO 6000 Blackwell (B40) overlays for EKS
- BCM service type added with overlays and nodewright reapply-on-reboot
- NVIDIA Run:ai platform support
- Slinky Slurm gains a cluster chart with EKS, Kind, and GKE COS H100 leaves
- H100 and generic tuning extended to H200 and RTX PRO 6000 EKS recipes
Validation & Performance
- Strict performance floors scoped to accelerator-bound recipes
- Deployment-phase floor delivered at per-accelerator wildcards
- Per-field union merge for validation phase checks
Other Improvements
- Bundle command supports app name for configurable Argo CD parent
argocd-helmmixed-component bundles use native OCI source shape- Recipe-set scheduling paths now correctly override CLI defaults
- Kubeconfig-aware serializer for
ConfigMapoutput
Thanks to @ayuskauskas, @faganihajizada, @fallintoplace, @gat786, @haarchri, @hkii, @lockwobr, @njhensley, @pdmack, @resker, @xdu31, @yuanchen8911, and @mchmarny.
Changelog
New Features
- ae6c948: feat(aicr): top-level Go library facade for in-process consumers (#1072) (@hkii)
- 5c63c10: feat(aicr): wrap facade alias types as facade-owned structs (#1111) (@mchmarny)
- 971024e: feat(bundler): configurable parent Application name (--app-name) (#1036) (@mchmarny)
- 7adc586: feat(bundler): derive DRA chart-version annotation from resolved recipe (#973) (#1033) (@yuanchen8911)
- 9a31cc0: feat(bundler): drop generated undeploy.sh; delegate to helm uninstall (#1095) (@mchmarny)
- 6098765: feat(bundler): route registry/manifest reads through recipe-bound provider (#1016) (@mchmarny)
- ed6b480: feat(ci): allow publish-fern-docs to target a specific or latest tag (@mchmarny)
- 9e1841f: feat(cli): add evidence digest subcommand for recipe canonical hash (#1055) (@njhensley)
- 0e10705: feat(cli): add release-notes drafting skill for Claude Code (@mchmarny)
- 940fb15: feat(fern): adopt global-theme nvidia, remove per-repo theme assets (#995) (@pdmack)
- 095eb8d: feat(mirror): add mirror command (#967) (@haarchri)
- a1fc3ab: feat(mirror): thread recipe-bound DataProvider through manifest reads (#1123) (@mchmarny)
- 735c233: feat(recipe): add bcm service type with overlays (#1060) (@mchmarny)
- 94149a9: feat(recipe): context-aware DataProvider interface (#1121) (@mchmarny)
- c210874: feat(recipe): deliver deployment-phase floor at per-accelerator wildcards (#1001) (@yuanchen8911)
- c148e96: feat(recipe): extensible criteria values via catalog-driven runtime registry (#998) (#999) (@mchmarny)
- 4e6cd9c: feat(recipe): per-Builder DataProvider isolation; deprecate process globals (#1015) (@mchmarny)
- 76647b7: feat(recipe): per-field union merge for validation phase checks (#1103) (@mchmarny)
- cb098fa: feat(recipe): push owner-token guard down to pkg/recipe.RecipeResult (#1113) (@mchmarny)
- ba787a5: feat(recipe): register h200 as first-class accelerator type (#1091) (@yuanchen8911)
- f385af2: feat(recipes): add RTX PRO 6000 Blackwell (B40) overlays for EKS (#1046) (@yuanchen8911)
- bbf8176: feat(recipes): add concrete GKE B200 service-bound overlays (#1053) (@yuanchen8911)
- 8b49397: feat(recipes): add inference-perf to gb200-eks-ubuntu-inference-dynamo (#977) (@yuanchen8911)
- 81c0789: feat(recipes): add nodewright h100 tuning to H200 EKS recipes (#1102) (@yuanchen8911)
- 178bbe4: feat(recipes): add nodewright to bcm with reapply-on-reboot (#1105) (@ayuskauskas)
- 4daf2af: feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS (#1101) (@yuanchen8911)
- cce25f0: feat(recipes): pin nodewright-customizations packages by digest (#1037) (@ayuskauskas)
- 8435e1f: feat(validation): scope strict perf floor to accelerator-bound recipes (#1009) (@yuanchen8911)
- e565ce0: feat(validator): co-locate ai-service-metrics with Prometheus (#1066) (@njhensley)
- e30f18a: feat(validator): parameterize image override env vars (#1028) (@haarchri)
- 686d329: feat(validators): warm up inference-perf benchmark and make it tunable (#1096) (@yuanchen8911)
- 2dfa5e3: feat: add NVIDIA Run:ai as platform-runai (#955) (@resker)
Bug Fixes
- 506507b: fix(bundler): argocd-helm parent App uses native-OCI source shape (#1051) (@yuanchen8911)
- b05fc2e: fix(bundler): argocd-helm repoURL UX — surface key, emit install hint (#1081) (@mchmarny)
- 04e4418: fix(bundler): correct helmfile bundle README for stratified layout (#959) (@lockwobr)
- 51eb4b5: fix(bundler): derive argocd-helm chart name from OCI artifact path (#1032) (@mchmarny)
- d638d00: fix(bundler): fix bugs - assemble full OCI artifact in path-based child apps; quote chart name (#1034) (#1035) (@yuanchen8911)
- ef16e9e: fix(bundler): pin OCI sync for argocd-helm mixed-component -pre/-post children (#1039) (@mchmarny)
- ceec36e: fix(bundler): quote argocd app-of-apps metadata.name (#1011) (#1040) (@yuanchen8911)
- 3eab313: fix(bundler): recipe-set scheduling paths override CLI defaults (#982) (#1082) (@mchmarny)
- a578244: fix(bundler): replace placeholder GitRepository with ArtifactGenerator for Flux OCI (#1017) (@haarchri)
- e83728f: fix(bundler): scope helmfile disableValidation to primary release (#1125) (@mchmarny)
- 91a4552: fix(bundler): tolerate trailing slash on repoURL; quote Chart.yaml version (#1038) (@yuanchen8911)
- eec6349: fix(ci): broaden Renovate postUpgradeTasks allowlist (@mchmarny)
- 31c912a: fix(ci): drop privileged grants from /ok-to-test fork path (#1067) (@mchmarny)
- 37b3127: fix(ci): move fern registry PR step after publish (@mchmarny)
- 620c6f6: fix(ci): open PR instead of pushing fern version registry to main (@mchmarny)
- e423b8f: fix(ci): restore fetch-tags in publish-fern-docs checkout (@mchmarny)
- 734d3a9: fix(ci): retry release-asset curls to absorb transient 502s (@mchmarny)
- 423bc08: fix(ci): skip version registration for latest release and sort by semver (#979) (@pdmack)
- 1c82645: fix(ci): unblock packaging dry-run and /ok-to-test startup failures (#1128) (@mchmarny)
- 3c2aa56: fix(ci): use yq instead of sed for version stamp in publish workflow (#943) (@pdmack)
- 2cd479f: fix(deps): refresh helmfile checksums for v1.5.2 (@mchmarny)
- 45f3f1a: fix(kwok): drop recipe suffix from argocd-helm-oci in-cluster repoURL (#1047) (@yuanchen8911)
- b0bf12d: fix(kwok): make argocd OCI repoURL per-lane (follow-up to #1047) (#1048) (@yuanchen8911)
- 5f72700: fix(kwok): require Succeeded operationState in argocd sync gate (#1062) (@yuanchen8911)
- 216c76c: fix(kwok): single-snapshot verify_pods + settle retry (#1090) (#1092) (@yuanchen8911)
- 047123f: fix(performance): add GKE inference performance validation + cold-s...
v0.13.0
This release focuses on scaling out our recipe matrix, evidence-based recipe validation, additional deployer targets, and hardened component supply chain.
Highlights
Recipe Evidence — New capability to capture evidence during cluster validation allows users and contributors alike to verify that the recipe actually deployed and delivered the expected performance characteristics without access to the validating cluster. aicr validate now emits a Recipe Evidence v1 bundle, and a new aicr evidence verify command validates that evidence from either a local directory or a signed OCI image. This new capability closes the loop between recipe authorship, deployment, and audit.
New Deployers — The bundler command now supports Helmfile and Flux alongside the existing Argo CD and raw-Helm targets. AICR also adds a URL-portable argocd-helm bundle option so users can apply a single manifest without local chart access. Helm vendoring is also supported for air-gapped environments (option for image mirroring is still coming — see NVIDIA/aicr#743).
Overlays & Components
- Added deployment validation to EKS GB200
- Added Slinky platform support with Slurm operator
- Added Talos Linux support via new
os-talosmixin and bundlerpreManifestFiles - Updated AKS H100 Dynamo to match working cluster state
- Migrated GB200
kernel-module-paramstopreManifestFiles - Fixed AKS H100 RDMA network operator dependency and metrics
Other Improvements
- New doc site is now live at docs.nvidia.com/aicr with per-release versioning
diffcommand to help detect configuration drift between recipes and live state- Unified file-based config across
snapshot,recipe,bundle, andvalidateto enable easier reproducibility - Reliable cluster identity based on snapshot measurements to enable easier over-time correlation
storage-classsupport onbundlecommand for registry-driven storage-class injection
Supply Chain — New CycloneDX 1.6 BOM generator publishes a per-recipe container image inventory as an in-repo artifact, with strict validation that rejects bare scalar image references missing a tag, digest, or registry host. A growing number of component chart versions now also explicitly digest-pin image references.
Thanks to @ayuskauskas, @dims, @dtzar, @faganihajizada, @haarchri, @Jont828, @lockwobr, @njhensley, @pdmack, @sanjeevrg89, @xdu31, @yuanchen8911, and @mchmarny.
Changelog
New Features
- (tools) Add install-rc helper for latest RC binary by @mchmarny
- (cli) Add --config support to snapshot command by @mchmarny
- (recipes) Update AKS H100 Dynamo recipe to match working cluster state by @Jont828
- (bom) Add CycloneDX 1.6 image BOM generator by @mchmarny
- (ci) Add self-hosted renovate alongside dependabot by @njhensley
- (recipes) Pin nfd and k8s-ephemeral-storage-metrics chart versions by @mchmarny
- (bom) Publish container image inventory as a doc artifact by @mchmarny
- (bundler) Add --storage-class flag for registry-driven injection by @dtzar
- (recipes) Pin chart versions for NVIDIA-owned components (#748 Phase B) by @mchmarny
- (recipes) Digest-pin explicit image references by @mchmarny
- (cli) Unified --config flag for recipe and bundle by @mchmarny
- (tools) Add s3c supply-chain presence checker by @mchmarny
- (bundler) URL-portable argocd-helm bundle (#664, #665) by @lockwobr
- (docs) Add versioned docs dropdown with CI content pinning by @pdmack
- (tools) Add local Talos cluster + snapshot chainsaw test by @ayuskauskas
- (fingerprint) Cluster identity projection from snapshot measurements by @njhensley
- Add support for helm vendoring by @lockwobr
- (oci) Expose URIScheme constant and Ensure/TrimScheme helpers by @njhensley
- (cli) Add aicr diff for configuration drift detection by @sanjeevrg89
- (config) Aicr validate --config support by @njhensley
- (validator) Apply hybrid resource pattern to ValidatorCatalog by @xdu31
- (recipe) Extract Validation as standalone type with hybrid resource pattern by @xdu31
- Os-talos mixin + bundler preManifestFiles support by @ayuskauskas
- (flux) Add bundle flux option by @haarchri
- (evidence) Emit Recipe Evidence v1 bundle from aicr validate by @njhensley
- (evidence) Aicr evidence verify (directory input) by @njhensley
- (evidence) Aicr evidence verify (signed OCI bundles) by @njhensley
- (recipes) Add deployment validation to GB200/EKS recipes by @njhensley
- (bundler) Add helmfile deployer by @lockwobr
- (recipes) Add Slinky slurm-operator as platform-slurm by @faganihajizada
Bug Fixes
- (validator) Accept pre-release tags as release versions by @mchmarny
- (bundler) Synthesize GKE ResourceQuota for critical-priority pods by @mchmarny
- (bundler) Split helmfile bundle into CRD + main sub-helmfiles by @mchmarny
- (bundler) Wire PreManifestFiles through flux deployer with terminal-aware dependsOn by @yuanchen8911
- (bundler) Carry localformat createNamespace into helmfile.yaml by @yuanchen8911
- (ci) Harden Fern docs CI and configure custom domain by @pdmack
- (docs) Replace bare angle-bracket URL that breaks MDX parser by @pdmack
- (recipes) Fully-qualify image refs in component manifests by @mchmarny
- AKS H100 RDMA sets network operator as dependency and fix chart values/metrics by @Jont828
- (recipes) Document aws-efa regional ECR override pattern by @mchmarny
- (bom) Reject bare scalars without tag, digest, or registry host by @mchmarny
- (validators) Bump aiperf-bench to python:3.13 to clear CVEs by @mchmarny
- (recipes) Track nri-device-injector by tag, ignore tcpxo image by @njhensley
- (api) Sync OpenAPI platform enum with Go criteria type by @mchmarny
- (bundler) Suppress kubectl auth prompt in undeploy.sh post-flight by @mchmarny
- (fern) Drop https scheme from instances URL by @pdmack
- (recipes) Migrate GB200 kernel-module-params to preManifestFiles by @mchmarny
- (validator) Write ValidationInput wire shape to ConfigMap by @njhensley
- (validator) Make ExtractResult sidecar-safe by reading 'validator' container explicitly by @xdu31
- (validator) Per-run RBAC names to prevent concurrent-run races by @yuanchen8911
- (evidence) Fix a regression in cncf ai conformance evidence collection by @yuanchen8911
- (ci) Populate frozen version content in preview build and surface fern errors by @pdmack
- (validator) Surface skip reason in CTRF, treat missing constraint as skip by @ayuskauskas
- fix(bundler) stratify helmfile bundle by DAG level by @lockwobr
- (recipes) Fix stale kgateway-crds path in slinky-slurm-operator-crds comment by @yuanchen8911
- (recipes) Align overlay network-operator pins to v26.1.1 by @yuanchen8911
Other Tasks
- (demos) Add config-driven GKE CUJ with evidence verify by @mchmarny
- Add top level THIRD_PARTY_NOTICES by @ayuskauskas
- (bom) Wrap auto-generated image inventory with hand-written prose by @mchmarny
- (recipes) Enforce sha256 specifically in digest-pin gate (CodeRabbit follow-up to #778) by @mchmarny
- (adr) Add ADR-006 container image pinning policy by @mchmarny
- (go) .go-version as single source of truth for Go toolchain by @mchmarny
- (renovate) Hand workflow bumps to dependabot, disable dashboard by @njhensley
- Update copyright headers to NVIDIA CORPORATION & AFFILIATES by @ayuskauskas
- Update golang version by @lockwobr
- (design) Add ADR-007 verifiable recipe test evidence by @njhensley
- (tests) Use host aicr binary in snapshot deploy-agent test by @pdmack
- (design) Add ADR-008 KWOK CI deployer matrix ...
v0.12.1
Changelog
New Features
- cf3cd33: feat(bundler)!: uniform NNN-folder bundle layout via localformat (#662) (#706) (@lockwobr)
- 8843981: feat(bundler): add headless OIDC paths for bundle --attest (#707) (@lockwobr)
- 6593894: feat(cli): add skill command for AI agent integration (#691) (@yuanchen8911)
- b1e38fe: feat(cli): add snapshot analysis skill for Claude Code (@mchmarny)
- af8def3: feat(collector): add Talos OS support via Kubernetes Node info (#714) (@ayuskauskas)
- 639c53e: feat(recipes): enable NFD Topology Updater on production GPU recipes (#711) (@ArangoGutierrez)
- c1703eb: feat(release): publish THIRD_PARTY_NOTICES.md as a release asset (#722) (@ayuskauskas)
Bug Fixes
- af4df7c: fix(bundler): demote nodewright selector warnings to info severity (#704) (@mchmarny)
- 67ea746: fix(bundler): layer-neutral dynamic declaration errors (#703) (@mchmarny)
- cb6b98c: fix(bundler): preserve inner error codes instead of double-wrapping (#702) (@mchmarny)
- f94c66c: fix(ci): centralize GPU CI runtime pins (#710) (@yuanchen8911)
- 3c9f6ec: fix(ci): eliminate redundant CI workflow executions (@mchmarny)
- 2c99373: fix(ci): move organization-projects permission to workflow level (@mchmarny)
- 2fb1719: fix(ci): only send Slack notification on critical/high vulns (@mchmarny)
- 6900259: fix(ci): remove invalid organization-projects permission key (@mchmarny)
- 4c5c748: fix(ci): remove project board integration from issue report (@mchmarny)
- 7b96afa: fix(ci): trigger H100 GPU tests on shared recipe changes (#717) (@yuanchen8911)
- cd3abd2: fix(ci): use project board priority field instead of labels for issue report (@mchmarny)
- 39c8c29: fix(recipes): correct nvsentinel registry default to OCI source (#725) (@yuanchen8911)
- d66ba76: fix(recipes): drop hook-succeeded from torch-distributed runtime (#719) (@yuanchen8911)
- 604a324: fix(recipes): handle kubeflow-trainer v2.2.0 API changes (#724) (@yuanchen8911)
- 2038255: fix(recipes): use Helm manifest-only pattern for gke-nccl-tcpxo (#718) (@yuanchen8911)
- 8d0168e: fix(recipes): use NFD chart version 0.18.3 without v prefix (#688) (@yuanchen8911)
- eec81c5: fix(tools): pin golangci-lint installer URL to version tag (@mchmarny)
- 7274cab: fix(verifier): add trust level reason to verify output (#705) (@mchmarny)
- de84b0f: fix: address top-7 code-review findings across packages (#721) (@mchmarny)
- 07d9ab9: fix: post-release code quality and correctness cleanup (@mchmarny)
- 0a04439: fix: update license check (#712) (@lockwobr)
- c56f142: fix: update license check (#713) (@lockwobr)
Other Tasks
- 8ffea23: Refactor and harden H100 GPU CI workflow (#694) (@yuanchen8911)
- eb1a673: chore(deps): bump controller-runtime, apiextensions-apiserver, kube-openapi, semver (@mchmarny)
- a165fae: chore(recipe): bump dynamo-platform from 0.9.x to 1.0.2 and add Grove chart (#459) (@Jont828)
- fa5c02b: chore(recipes): bump 6 components to upstream latest (phase 1) (#715) (@yuanchen8911)
- 14ff3fa: chore(recipes): bump kai-scheduler v0.14.1 and kubeflow-trainer 2.2.0 (#720) (@yuanchen8911)
- e2da266: chore: bump postcss from 8.5.8 to 8.5.10 in /site in the npm_and_yarn group across 1 directory (#672) (@dependabot[bot])
- 0c939ce: chore: dep update (@mchmarny)
- 3cc4e26: chore: deps: bump goreleaser/goreleaser-action from 7.1.0 to 7.2.1 (#692) (@dependabot[bot])
- b266684: chore: update change log (@mchmarny)
- fc0daca: ci: enable CodeRabbit auto-review on draft PRs (#690) (@yuanchen8911)
- 3b7e970: ci: retry grype install on transient github 502s (#701) (@yuanchen8911)
- c7c3154: docs(kwok): add prerequisites and fix copy-paste pitfalls (#709) (@arun-gupta)
- fc2eeca: docs(roadmap): restructure around v1 objectives (#708) (@mchmarny)
- 0a8d6e1: feat(nodewright-customizations): add gb200 eks support (#699) (@ayuskauskas)
v0.12.0
Changelog
New Features
- cbaba36: feat(bundler): add --dynamic flag for install-time values (#515) (#527) (@lockwobr)
- 1e550c7: feat(bundler): enable --attest and --data for argocd-helm (#573) (#627) (@lockwobr)
- 142c0d2: feat(ci): add aggregate merge-gate workflow (#651) (@mchmarny)
- 1caf260: feat(ci): add daily Slack issue status report (@mchmarny)
- ad682ef: feat(ci): add daily image vulnerability scan workflow (@mchmarny)
- 42cfd26: feat(ci): auto-assign issues based on area labels (#513) (@mchmarny)
- 9b09c94: feat(cli): add dynamic shell completion for flag values (#339) (#512) (@lockwobr)
- 1b25135: feat(cli): auto-hydrate RecipeMetadata overlays in validate and bundle (#595) (@njhensley)
- f2aeaf2: feat(evidence): add NIM support to evidence collection and restructure conformance docs (#479) (@yuanchen8911)
- 6137c0b: feat(evidence): split ai_service_metrics and fix imagePullPolicy for local images (#463) (@yuanchen8911)
- 4e158cf: feat(performance): add GB200 EKS support for NCCL all-reduce bandwidth check (#640) (@njhensley)
- 7f91140: feat(recipe): add NFD as standalone shared component (#518) (@ArangoGutierrez)
- f47e95f: feat(recipe): add mixin composition for OS and platform fragments (#501) (@yuanchen8911)
- 94fb041: feat(recipe): merge external validator catalog with embedded when provided through DataProvider (#588) (@njhensley)
- a66de21: feat(recipes): add NIM Operator recipe for CNCF AI Conformance (#478) (@yuanchen8911)
- 16d670d: feat(release): add pre-release support (#639) (@mchmarny)
- 306cb9b: feat(validator): add --node-selector and --toleration flags for validation workload scheduling (#444) (@atif1996)
- 1db88e4: feat(validator): add AICR_VALIDATOR_IMAGE_TAG env-var override (#666) (@yuanchen8911)
- 3a86364: feat(validator): add inference performance validation (#641) (@yuanchen8911)
- 6f7b4c1: feat: Add AKS UAT chainsaw tests for training and inference CUJs (#476) (@Jont828)
- 306b785: feat: GB200 EKS NET/NVLS NCCL validation and driver bump (#668) (@njhensley)
- ba20188: feat: add HardwareDetector interface and measurement keys for NFD integration (#482) (@ArangoGutierrez)
- 83c18bc: feat: add component contributor test harness (#508) (@ArangoGutierrez)
- 340452b: feat: add support for Akamai (#517) (@lalitadithya)
- 5a33265: feat: auto-install shell completions via install (#504) (@lockwobr)
- 81cf701: feat: implement NFD-based GPU hardware detection (#494) (@ArangoGutierrez)
- 7283e9c: feat: two-phase GPU collection with hardware detection support (#495) (@ArangoGutierrez)
- 02002ca: feat: wire NFDHardwareDetector into production snapshot pipeline (#502) (@ArangoGutierrez)
Bug Fixes
- 9d57dfb: fix(build): use FullCommit in goreleaser to match CI image tags (#658) (@mchmarny)
- 228e518: fix(bundler): add pre-flight finalizer check to undeploy.sh (#406) (#561) (@lockwobr)
- 64b8759: fix(bundler): allow Helm-style array indexing in --set paths (#643) (@yuanchen8911)
- c12c783: fix(bundler): fix undeploy template pre/post-flight checks (#602) (@yuanchen8911)
- 6f4ec0e: fix(bundler): harden filepath.Join with SafeJoin for path-traversal protection (#578) (@lockwobr)
- 966d775: fix(bundler): resolve ArgoCD RepoURL placeholder in child applications (#520) (@mchmarny)
- ca9d96c: fix(bundler): scope cleanup to bundle components and remove stale skyhook taints (#477) (@yuanchen8911)
- 50825cc: fix(bundler): skip helm commands for manifest-only components in README (@mchmarny)
- 8eac760: fix(ci): add --platform to aiperf-bench E2E docker build (#674) (@xdu31)
- 57e6b8d: fix(ci): add -mod=vendor to snapshot agent build (#534) (@yuanchen8911)
- 0a78409: fix(ci): add MDX safety check for non-self-closing img tags (#620) (@pdmack)
- 9e481d7: fix(ci): add diagnostic logging and multi-assignee support to issue triage (@mchmarny)
- db9f3ab: fix(ci): add failure diagnostics and fix Grafana resource starvation in Kind (#563) (@yuanchen8911)
- e2586f0: fix(ci): auto-label new issues by area and assign owners (#535) (@yuanchen8911)
- 7ad4f96: fix(ci): correct artifact action SHA pins in vuln scan workflow (@mchmarny)
- f5f7387: fix(ci): deduplicate conformance coverage in GPU CI (#577) (@yuanchen8911)
- 4a08c63: fix(ci): enable manual trigger for fern-docs-ci workflow (@mchmarny)
- d30e235: fix(ci): expand GPU test triggers to cover collector, snapshotter, validator, and add run-gpu-tests label (#514) (@xdu31)
- f489db3: fix(ci): fix fern instances URL basepath and surface publish URL in step summary (#568) (@pdmack)
- 42877f6: fix(ci): fix fern preview metadata and add continuous staging publish (#546) (@pdmack)
- d821306: fix(ci): improve GPU test reliability and deploy timeout handling (#539) (@yuanchen8911)
- 4988346: fix(ci): install gke-gcloud-auth-plugin before cluster connect (@mchmarny)
- 7b0dbb1: fix(ci): make issue report counts clickable Slack links (@mchmarny)
- 39e3114: fix(ci): match artifact download pattern to upload names (@mchmarny)
- 1fb1695: fix(ci): move GPU concurrency to test jobs (#581) (@yuanchen8911)
- 945a57d: fix(ci): pin e2e goreleaser and exclude local build artifacts (#580) (@yuanchen8911)
- c334bfc: fix(ci): query GPU snapshot by subtype name instead of index (#509) (@yuanchen8911)
- c294788: fix(ci): remove invalid --base-image flag from ko build (@mchmarny)
- abad89a: fix(ci): replace middle-dot separators with commas in issue report (@mchmarny)
- d988b02: fix(ci): replace push path filters with runtime path gate in GPU workflows (#558) (@yuanchen8911)
- 352b006: fix(ci): safe manifest publishing (#586) (@njhensley)
- 40bb85d: fix(ci): set GKE cluster name at correct config path (@mchmarny)
- d61dbfa: fix(ci): set deployment.destroy as boolean, not string (@mchmarny)
- 9e44eb7: fix(ci): shorten GKE deployment ID to fit SA name limit (@mchmarny)
- 5785d81: fix(ci): skip capacity pre-check for shared GCP reservations (@mchmarny)
- a7c8bf6: fix(ci): surface fern generate errors in preview (#650) (@mchmarny)
- 0eaa16f: fix(ci): use --bare flag for ko build in vuln scan workflow (@mchmarny)
- 67f03f4: fix(ci): use KEY_CONTENT env var for GKE provisioner credentials (@mchmarny)
- 56f20f6: fix(ci): use anchored regex for lychee exclude-path patterns (#547) (@pdmack)
- d117fbd: fix(ci): use config-based destroy for GKE provisioner (@mchmarny)
- 984b244: fix(ci): use correct field name 'subtype' in GPU snapshot validation (#511) (@yuanchen8911)
- bb072b1: fix(ci): use explicit empty mapping for workflow_dispatch (@mchmarny)
- b2f20fa: fix(ci): use search API for first-time contributor detection (#524) (@yuanchen8911)
- 8015fa8: fix(cli): --no-cluster must not deploy the snapshot-capture agent (#604) (@yuanchen8911)
- 173dba5: fix(recipe): handle null override in mergeValues to delete keys (#458) (@Jont828)
- a228540: fix(recipe): scope mixin fallback to affected candidates (#521) (@yuanchen8911)
- 0b98847: fix(recipes): disable Dynamo ssh-keygen on Kind (#670) (@yuanchen8911)
- 08b2cb2: fix(recipes): fix NIM operator validation and demo script issues (#483) (@yuanchen8911)
- 7466275: fix(scan): add pillow and python CVEs to grype ignore list (@mchmarny)
- 97be223: fix(scan): revert aiperf-bench base image to python:3-slim (@mchmarny)
- 5788bca: fix(scan): revert aiperf-bench base image to python:3.12-slim (@mchmarny)
- 7c547ad: fix(scan): suppress all critical/high C...
v0.11.1
Changelog
New Features
- 76d27c7: feat(recipes): bump kai-scheduler to v0.13.0, fix DRA gang scheduling (#450) (@yuanchen8911)
Bug Fixes
- 0d267c9: fix(api): add b200 accelerator to OpenAPI spec enum (#455) (@nvidiajeff)
- cdc9bf4: fix(cli): replace broken shell completion with full flag+alias support (#454) (@nvidiajeff)
- 692bbf0: fix(validator): templatize EKS NCCL runtime for dynamic EFA and instance type discovery (#447) (@xdu31)
Other Tasks
v0.11.0
Changelog
New Features
- 500b561: feat(recipes): add GKE COS inference and Dynamo overlay recipes (#414) (@yuanchen8911)
- 3e46e47: feat(snapshot): add --runtime-class flag for CDI environments (#434) (@atif1996)
- d3fd483: feat(validator): add EKS/GKE cluster autoscaling fallback (#438) (@yuanchen8911)
- 87fd28f: feat: Add AKS (Azure Kubernetes Service) H100 recipe overlays (#415) (@Jont828)
- 0866ef0: feat: add B200 accelerator type support (#437) (@atif1996)
- 46736f8: feat: add query command for hydrated recipe value extraction (#445) (@mchmarny)
Bug Fixes
- 7c377c1: fix(bundler): clean up orphaned KAI and Kubeflow Trainer CRDs on undeploy (#416) (@yuanchen8911)
- 437126c: fix(gke): remove CAP_ prefix from capability names in TCPXO manifests (#428) (@yuanchen8911)
- f2ec6b2: fix(gke): update TCPXO to NRI profile without hostNetwork (#420) (@yuanchen8911)
- 8a65335: fix(validator): add retry for ai-service-metrics Prometheus query (#393) (@yuanchen8911)
- d99235e: fix(validator): remove hostNetwork and privileged from GKE NCCL runtime, use NRI device injection (#427) (@xdu31)
- e15a3c6: fix(validator): source NCCL env from host profile instead of hardcoding (#422) (@xdu31)
- 70efe82: fix: ArgoCD deployer generates valid YAML, add structural validation (#410) (#413) (@lockwobr)
Other Tasks
- 84f3c4c: chore: bump nvsentinel from v0.10.x to v1.1.0 (#423) (@mchmarny)
- 75092d8: chore: deps: bump github.com/in-toto/attestation from 1.1.2 to 1.2.0 (#431) (@dependabot[bot])
- ea19bdf: chore: deps: bump github/codeql-action from 4.32.6 to 4.33.0 (#418) (@dependabot[bot])
- a10d4b3: chore: deps: bump google.golang.org/grpc from 1.79.2 to 1.79.3 (#430) (@dependabot[bot])
- 9e81d69: chore: deps: bump the kubernetes group with 3 updates (#446) (@dependabot[bot])
- f23ade5: chore: ignore movies (@mchmarny)
- d4e818f: ci(kwok): implement tiered testing strategy per ADR-003 (#432) (@mchmarny)
- 9101d29: ci: build and publish validator images on merge to main (#412) (@yuanchen8911)
- ff9c66d: docs(conformance): update CNCF evidence for multi-platform and training (#425) (@yuanchen8911)
- 5d4aa7c: docs(validator): add custom image testing and private registry guide (#417) (@xdu31)