Skip to content

Releases: NVIDIA/aicr

v0.16.0-oke-l40s-support.2

28 Jun 14:48
Immutable release. Only release title and notes can be modified.
v0.16.0-oke-l40s-support.2
ff8442b

Choose a tag to compare

Pre-release

Changelog

Bug Fixes

  • ff8442b: fix(fingerprint): detect OKE service from raw OCID providerID (@atif1996)

v0.16.0-oke-l40s-support

28 Jun 01:51
Immutable release. Only release title and notes can be modified.
v0.16.0-oke-l40s-support
ee608da

Choose a tag to compare

Pre-release

Changelog

New Features

Bug Fixes

Other Tasks

v0.15.0

15 Jun 19:54
Immutable release. Only release title and notes can be modified.
v0.15.0
915ed66

Choose a tag to compare

This release focuses on recipe health scoring, improved deployment validation, improved snapshot/discovery, and extending software supply chain capabilities for enterprise users.

Highlights

Recipe Structural Health - New pkg/health engine computes per-recipe health signals (chart_pinned, constraints_wellformed, declared_coverage) and rolls them up into a recipe-health matrix. aicr recipe list surfaces structural-health columns (with a --no-health opt-out), a tools/health generator and weekly recipe-health-refresh workflow keep the matrix current, and a lint guard now requires healthCheck.assertFile.

Improved Deployment Validation - The chainsaw deployment-phase runner is now an in-process executor rather than a shelled-out binary. aicr validate runs all phases by default with a --fail-fast opt-in, fails closed on evaluator errors, and is nil-safe across health checks.

Snapshot/Discovery - The collector now discovers GPU SKUs without nvidia-smi, removing the CUDA base image dependency and matching SKUs on token boundaries instead of substrings.

Closed Supply Chain - Signing and verification now work end-to-end in air-gapped and enterprise environments. aicr bundle supports KMS-backed signing (--signing-key) and private Sigstore deployments (--fulcio-url, --rekor-url); aicr verify --key validates bundles against a KMS or public key; and aicr evidence publish signs recipe evidence off-network. The recipe catalog itself now ships signed provenance for the V1 closed supply chain, and keyless signing warns before publishing identity to the public transparency log.

New Recipes & Overlays

  • A100 training Kubeflow overlay chains for EKS, AKS, GKE COS, and OKE
  • GB300 concrete EKS service-bound overlays
  • OKE GB200 and AKS H100 Dynamo performance checks

CLI & Bundling

  • aicr recipe list subcommand for catalog enumeration
  • Gatekeeper added as an optional component

Inference Performance & Validation

  • Inference-performance validation enhanced and tuned; gated on all worker services Ready
  • nccl-all-reduce-bw gates wired for EKS + H200; GKE NCCL node selector made dynamic
  • Bounded absent-resource retries in deployment-phase health checks

Thanks to @atif1996, @cdesiniotis, @dims, @haarchri, @JaydipGabani, @lalitadithya, @lockwobr, @njhensley, @pdmack, @pedjak, @rsd-darshan, @sttts, @xdu31, @yuanchen8911, and @mchmarny.

Changelog

New Features

Bug Fixes

Read more

v0.14.0

01 Jun 17:37
Immutable release. Only release title and notes can be modified.
v0.14.0
0479e45

Choose a tag to compare

This release focuses on further expansion to AICR recipe matrix (H200, B200, RTX PRO 6000, BCM, Slinky Slurm, Run:ai), new Go client for embedding AICR, and significant engine hardening that removes process-global state from the recipe and improves validator pipelines.

Highlights

In-Process Go Library — New pkg/aicr package exposes an aicr.Client facade for in-process consumers, allowing products and SDKs to drive recipe resolution, bundling, and validation without forking a CLI. The CLI and HTTP server already implement thin adapters over this facade.

aicr mirror — New top-level command for mirroring container images referenced by a recipe to an alternate registry, completing the air-gapped story that began with Helm vendoring in v0.13.0. The command reuses the recipe-bound DataProvider so its manifest reads are identical to what bundle and validate see.

Engine Hardening — The recipe and validator pipelines are now free of process-global state. Builder owns an isolated DataProvider, the criteria registry is per-provider (no more singleton), and the DataProvider interface is context-aware. Embedders can safely run multiple Builder instances concurrently against different sources.

New Recipes & Overlays

  • H200 promoted to a first-class accelerator type with EKS overlays
  • GKE B200 service-bound overlays
  • RTX PRO 6000 Blackwell (B40) overlays for EKS
  • BCM service type added with overlays and nodewright reapply-on-reboot
  • NVIDIA Run:ai platform support
  • Slinky Slurm gains a cluster chart with EKS, Kind, and GKE COS H100 leaves
  • H100 and generic tuning extended to H200 and RTX PRO 6000 EKS recipes

Validation & Performance

  • Strict performance floors scoped to accelerator-bound recipes
  • Deployment-phase floor delivered at per-accelerator wildcards
  • Per-field union merge for validation phase checks

Other Improvements

  • Bundle command supports app name for configurable Argo CD parent
  • argocd-helm mixed-component bundles use native OCI source shape
  • Recipe-set scheduling paths now correctly override CLI defaults
  • Kubeconfig-aware serializer for ConfigMap output

Thanks to @ayuskauskas, @faganihajizada, @fallintoplace, @gat786, @haarchri, @hkii, @lockwobr, @njhensley, @pdmack, @resker, @xdu31, @yuanchen8911, and @mchmarny.

Changelog

New Features

Bug Fixes

Read more

v0.13.0

16 May 00:18
Immutable release. Only release title and notes can be modified.
v0.13.0
134b23d

Choose a tag to compare

This release focuses on scaling out our recipe matrix, evidence-based recipe validation, additional deployer targets, and hardened component supply chain.

Highlights

Recipe Evidence — New capability to capture evidence during cluster validation allows users and contributors alike to verify that the recipe actually deployed and delivered the expected performance characteristics without access to the validating cluster. aicr validate now emits a Recipe Evidence v1 bundle, and a new aicr evidence verify command validates that evidence from either a local directory or a signed OCI image. This new capability closes the loop between recipe authorship, deployment, and audit.

New Deployers — The bundler command now supports Helmfile and Flux alongside the existing Argo CD and raw-Helm targets. AICR also adds a URL-portable argocd-helm bundle option so users can apply a single manifest without local chart access. Helm vendoring is also supported for air-gapped environments (option for image mirroring is still coming — see NVIDIA/aicr#743).

Overlays & Components

  • Added deployment validation to EKS GB200
  • Added Slinky platform support with Slurm operator
  • Added Talos Linux support via new os-talos mixin and bundler preManifestFiles
  • Updated AKS H100 Dynamo to match working cluster state
  • Migrated GB200 kernel-module-params to preManifestFiles
  • Fixed AKS H100 RDMA network operator dependency and metrics

Other Improvements

  • New doc site is now live at docs.nvidia.com/aicr with per-release versioning
  • diff command to help detect configuration drift between recipes and live state
  • Unified file-based config across snapshot, recipe, bundle, and validate to enable easier reproducibility
  • Reliable cluster identity based on snapshot measurements to enable easier over-time correlation
  • storage-class support on bundle command for registry-driven storage-class injection

Supply Chain — New CycloneDX 1.6 BOM generator publishes a per-recipe container image inventory as an in-repo artifact, with strict validation that rejects bare scalar image references missing a tag, digest, or registry host. A growing number of component chart versions now also explicitly digest-pin image references.

Thanks to @ayuskauskas, @dims, @dtzar, @faganihajizada, @haarchri, @Jont828, @lockwobr, @njhensley, @pdmack, @sanjeevrg89, @xdu31, @yuanchen8911, and @mchmarny.

Changelog

New Features

  • (tools) Add install-rc helper for latest RC binary by @mchmarny
  • (cli) Add --config support to snapshot command by @mchmarny
  • (recipes) Update AKS H100 Dynamo recipe to match working cluster state by @Jont828
  • (bom) Add CycloneDX 1.6 image BOM generator by @mchmarny
  • (ci) Add self-hosted renovate alongside dependabot by @njhensley
  • (recipes) Pin nfd and k8s-ephemeral-storage-metrics chart versions by @mchmarny
  • (bom) Publish container image inventory as a doc artifact by @mchmarny
  • (bundler) Add --storage-class flag for registry-driven injection by @dtzar
  • (recipes) Pin chart versions for NVIDIA-owned components (#748 Phase B) by @mchmarny
  • (recipes) Digest-pin explicit image references by @mchmarny
  • (cli) Unified --config flag for recipe and bundle by @mchmarny
  • (tools) Add s3c supply-chain presence checker by @mchmarny
  • (bundler) URL-portable argocd-helm bundle (#664, #665) by @lockwobr
  • (docs) Add versioned docs dropdown with CI content pinning by @pdmack
  • (tools) Add local Talos cluster + snapshot chainsaw test by @ayuskauskas
  • (fingerprint) Cluster identity projection from snapshot measurements by @njhensley
  • Add support for helm vendoring by @lockwobr
  • (oci) Expose URIScheme constant and Ensure/TrimScheme helpers by @njhensley
  • (cli) Add aicr diff for configuration drift detection by @sanjeevrg89
  • (config) Aicr validate --config support by @njhensley
  • (validator) Apply hybrid resource pattern to ValidatorCatalog by @xdu31
  • (recipe) Extract Validation as standalone type with hybrid resource pattern by @xdu31
  • Os-talos mixin + bundler preManifestFiles support by @ayuskauskas
  • (flux) Add bundle flux option by @haarchri
  • (evidence) Emit Recipe Evidence v1 bundle from aicr validate by @njhensley
  • (evidence) Aicr evidence verify (directory input) by @njhensley
  • (evidence) Aicr evidence verify (signed OCI bundles) by @njhensley
  • (recipes) Add deployment validation to GB200/EKS recipes by @njhensley
  • (bundler) Add helmfile deployer by @lockwobr
  • (recipes) Add Slinky slurm-operator as platform-slurm by @faganihajizada

Bug Fixes

  • (validator) Accept pre-release tags as release versions by @mchmarny
  • (bundler) Synthesize GKE ResourceQuota for critical-priority pods by @mchmarny
  • (bundler) Split helmfile bundle into CRD + main sub-helmfiles by @mchmarny
  • (bundler) Wire PreManifestFiles through flux deployer with terminal-aware dependsOn by @yuanchen8911
  • (bundler) Carry localformat createNamespace into helmfile.yaml by @yuanchen8911
  • (ci) Harden Fern docs CI and configure custom domain by @pdmack
  • (docs) Replace bare angle-bracket URL that breaks MDX parser by @pdmack
  • (recipes) Fully-qualify image refs in component manifests by @mchmarny
  • AKS H100 RDMA sets network operator as dependency and fix chart values/metrics by @Jont828
  • (recipes) Document aws-efa regional ECR override pattern by @mchmarny
  • (bom) Reject bare scalars without tag, digest, or registry host by @mchmarny
  • (validators) Bump aiperf-bench to python:3.13 to clear CVEs by @mchmarny
  • (recipes) Track nri-device-injector by tag, ignore tcpxo image by @njhensley
  • (api) Sync OpenAPI platform enum with Go criteria type by @mchmarny
  • (bundler) Suppress kubectl auth prompt in undeploy.sh post-flight by @mchmarny
  • (fern) Drop https scheme from instances URL by @pdmack
  • (recipes) Migrate GB200 kernel-module-params to preManifestFiles by @mchmarny
  • (validator) Write ValidationInput wire shape to ConfigMap by @njhensley
  • (validator) Make ExtractResult sidecar-safe by reading 'validator' container explicitly by @xdu31
  • (validator) Per-run RBAC names to prevent concurrent-run races by @yuanchen8911
  • (evidence) Fix a regression in cncf ai conformance evidence collection by @yuanchen8911
  • (ci) Populate frozen version content in preview build and surface fern errors by @pdmack
  • (validator) Surface skip reason in CTRF, treat missing constraint as skip by @ayuskauskas
  • fix(bundler) stratify helmfile bundle by DAG level by @lockwobr
  • (recipes) Fix stale kgateway-crds path in slinky-slurm-operator-crds comment by @yuanchen8911
  • (recipes) Align overlay network-operator pins to v26.1.1 by @yuanchen8911

Other Tasks

  • (demos) Add config-driven GKE CUJ with evidence verify by @mchmarny
  • Add top level THIRD_PARTY_NOTICES by @ayuskauskas
  • (bom) Wrap auto-generated image inventory with hand-written prose by @mchmarny
  • (recipes) Enforce sha256 specifically in digest-pin gate (CodeRabbit follow-up to #778) by @mchmarny
  • (adr) Add ADR-006 container image pinning policy by @mchmarny
  • (go) .go-version as single source of truth for Go toolchain by @mchmarny
  • (renovate) Hand workflow bumps to dependabot, disable dashboard by @njhensley
  • Update copyright headers to NVIDIA CORPORATION & AFFILIATES by @ayuskauskas
  • Update golang version by @lockwobr
  • (design) Add ADR-007 verifiable recipe test evidence by @njhensley
  • (tests) Use host aicr binary in snapshot deploy-agent test by @pdmack
  • (design) Add ADR-008 KWOK CI deployer matrix ...
Read more

v0.12.1

01 May 17:13
Immutable release. Only release title and notes can be modified.
v0.12.1
eec81c5

Choose a tag to compare

Changelog

New Features

Bug Fixes

Other Tasks

v0.12.0

24 Apr 23:31
Immutable release. Only release title and notes can be modified.
v0.12.0
7db4275

Choose a tag to compare

Changelog

New Features

Bug Fixes

Read more

v0.11.1

21 Mar 10:22
Immutable release. Only release title and notes can be modified.
v0.11.1
bc05c6b

Choose a tag to compare

Changelog

New Features

Bug Fixes

Other Tasks

v0.11.0

20 Mar 18:41
Immutable release. Only release title and notes can be modified.
v0.11.0
15d9554

Choose a tag to compare

Changelog

New Features

Bug Fixes

Other Tasks

v0.10.16

16 Mar 18:21
Immutable release. Only release title and notes can be modified.
v0.10.16
e07c9e6

Choose a tag to compare

Changelog

Bug Fixes

Other Tasks

  • 06c7428: refactor(validator): unify GKE NCCL to TrainJob+MPI, match EKS pattern (#403) (@xdu31)