Merge V1.83.0 into Develop by sudheer-quad · Pull Request #5313 · GoogleCloudPlatform/cluster-toolkit

sudheer-quad · 2026-03-05T12:21:32Z

Merge V1.83.0 into Develop

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

Fork your PR branch from the Toolkit "develop" branch (not main)
Test all changes with pre-commit in a local branch #
Confirm that "make tests" passes all tests
Add or modify unit tests to cover code changes
Ensure that unit test coverage remains above 80%
Update all applicable documentation
Follow Cluster Toolkit Contribution guidelines #

…ation issue (#5290)

…ing (#5306)

)

gemini-code-assist · 2026-03-05T12:22:24Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily serves to advance the HPC Toolkit to version 1.83.0, incorporating a series of version bumps across its core components and Terraform modules. Beyond this foundational update, it refines the environment setup for machine learning workloads by upgrading NVIDIA software dependencies and addresses a common issue with container imports in non-interactive shells. Additionally, network configurations in several GKE examples have been adjusted, and Cloud Build test parameters have been updated to enhance reliability and specificity.

Highlights

Version Update: The HPC Toolkit version has been updated to v1.83.0 across the main CLI and numerous Terraform modules, reflecting a new release cycle.
NVIDIA Driver and CUDA Updates: The ml-slurm-g4.yaml example now uses updated NVIDIA driver (590) and CUDA toolkit (12.8) versions, including a fix for conflicting firmware and force-overwrite options during installation.
Enroot Container Import Fix: Shell scripts for importing PyTorch containers now include logic to correctly set XDG_RUNTIME_DIR for non-interactive environments, improving reliability.
Network Configuration Adjustments: Several GKE examples and the Managed Lustre README have updated the prefix_length for private service access from 24 to 22, potentially affecting IP address range allocation.
Cloud Build Test Configuration: Cloud Build test configurations for GKE and HPC Enterprise Slurm have been updated, including explicit zone/region settings and commenting out dynamic zone detection for stability.

Changelog

cmd/root.go
- Updated the CLI version from v1.82.0 to v1.83.0.
community/modules/compute/gke-nodeset/versions.tf
- Updated the module version to v1.83.0.
community/modules/compute/gke-partition/versions.tf
- Updated the module version to v1.83.0.
community/modules/compute/htcondor-execute-point/versions.tf
- Updated the module version to v1.83.0.
community/modules/compute/mig/versions.tf
- Updated the module version to v1.83.0.
community/modules/compute/schedmd-slurm-gcp-v6-nodeset-dynamic/versions.tf
- Updated the module version to v1.83.0.
community/modules/compute/schedmd-slurm-gcp-v6-nodeset-tpu/versions.tf
- Updated the module version to v1.83.0.
community/modules/compute/schedmd-slurm-gcp-v6-nodeset/versions.tf
- Updated the module version to v1.83.0.
community/modules/compute/schedmd-slurm-gcp-v6-partition/versions.tf
- Updated the module version to v1.83.0.
community/modules/database/slurm-cloudsql-federation/versions.tf
- Updated the module version to v1.83.0 for both google and google-beta providers.
community/modules/file-system/nfs-server/versions.tf
- Updated the module version to v1.83.0.
community/modules/files/fsi-montecarlo-on-batch/versions.tf
- Updated the module version to v1.83.0 for both google and google-beta providers.
community/modules/internal/slurm-gcp/login/versions.tf
- Updated the module version to v1.83.0.
community/modules/project/service-enablement/versions.tf
- Updated the module version to v1.83.0.
community/modules/pubsub/bigquery-sub/versions.tf
- Updated the module version to v1.83.0 for both google and google-beta providers.
community/modules/pubsub/topic/versions.tf
- Updated the module version to v1.83.0.
community/modules/scheduler/htcondor-access-point/versions.tf
- Updated the module version to v1.83.0.
community/modules/scheduler/htcondor-central-manager/versions.tf
- Updated the module version to v1.83.0.
community/modules/scheduler/htcondor-pool-secrets/versions.tf
- Updated the module version to v1.83.0.
community/modules/scheduler/schedmd-slurm-gcp-v6-controller/versions.tf
- Updated the module version to v1.83.0.
community/modules/scheduler/schedmd-slurm-gcp-v6-login/versions.tf
- Updated the module version to v1.83.0.
community/modules/scripts/wait-for-startup/versions.tf
- Updated the module version to v1.83.0.
community/modules/scripts/windows-startup-script/versions.tf
- Updated the module version to v1.83.0.
examples/gke-a3-ultragpu/gke-a3-ultragpu.yaml
- Updated the commented-out prefix_length setting from 24 to 22.
examples/gke-a4/gke-a4.yaml
- Updated the commented-out prefix_length setting from 24 to 22.
examples/gke-a4x/gke-a4x.yaml
- Updated the commented-out prefix_length setting from 24 to 22.
examples/gke-managed-lustre.yaml
- Updated the prefix_length setting from 24 to 22.
examples/gke-tpu-7x/gke-tpu-7x-advanced.yaml
- Updated the commented-out prefix_length setting from 24 to 22.
examples/gke-tpu-v6e/gke-tpu-v6e-advanced.yaml
- Updated the commented-out prefix_length setting from 24 to 22.
examples/machine-learning/a3-megagpu-8g/nccl-tests/import_pytorch_container.sh
- Added logic to set XDG_RUNTIME_DIR for non-interactive shells.
- Updated the commented container name from 21.10-py3 to 24.04-py3.
examples/machine-learning/a3-ultragpu-8g/nccl-tests/import_pytorch_container.sh
- Added logic to set XDG_RUNTIME_DIR for non-interactive shells.
examples/ml-slurm-g4.yaml
- Updated NVIDIA driver packages to version 590 and CUDA toolkit to 12.8.
- Added a step to remove conflicting Ubuntu NVIDIA firmware.
- Added force-overwrite dpkg option for NVIDIA package installation.
modules/compute/gke-node-pool/versions.tf
- Updated the module version to v1.83.0 for both google and google-beta providers.
modules/compute/resource-policy/versions.tf
- Updated the module version to v1.83.0.
modules/compute/vm-instance/versions.tf
- Updated the module version to v1.83.0 for both google and google-beta providers.
modules/file-system/cloud-storage-bucket/versions.tf
- Updated the module version to v1.83.0 for both google and google-beta providers.
modules/file-system/filestore/versions.tf
- Updated the module version to v1.83.0 for both google and google-beta providers.
modules/file-system/gke-persistent-volume/versions.tf
- Updated the module version to v1.83.0.
modules/file-system/gke-storage/versions.tf
- Updated the module version to v1.83.0.
modules/file-system/managed-lustre/README.md
- Updated the prefix_length setting from 24 to 22 in the example configuration.
modules/file-system/managed-lustre/versions.tf
- Updated the module version to v1.83.0 for both google and google-beta providers.
modules/file-system/netapp-storage-pool/versions.tf
- Updated the module version to v1.83.0 for both google and google-beta providers.
modules/file-system/netapp-volume/versions.tf
- Updated the module version to v1.83.0 for both google and google-beta providers.
modules/management/kubectl-apply/versions.tf
- Updated the module version to v1.83.0.
modules/monitoring/dashboard/versions.tf
- Updated the module version to v1.83.0.
modules/network/firewall-rules/versions.tf
- Updated the module version to v1.83.0.
modules/network/pre-existing-subnetwork/versions.tf
- Updated the module version to v1.83.0.
modules/network/pre-existing-vpc/versions.tf
- Updated the module version to v1.83.0.
modules/network/private-service-access/versions.tf
- Updated the module version to v1.83.0 for both google and google-beta providers.
modules/scheduler/batch-login-node/versions.tf
- Updated the module version to v1.83.0.
modules/scheduler/gke-cluster/versions.tf
- Updated the module version to v1.83.0 for both google and google-beta providers.
modules/scheduler/pre-existing-gke-cluster/versions.tf
- Updated the module version to v1.83.0.
modules/scripts/startup-script/versions.tf
- Updated the module version to v1.83.0.
tools/cloud-build/daily-tests/builds/gke-a2-highgpu-kueue-onspot.yaml
- Added explicit ZONE and PROVISIONING_MODEL variables.
- Commented out the dynamic zone detection script.
tools/cloud-build/daily-tests/tests/hpc-enterprise-slurm.yml
- Updated the default zone from europe-west4-c to us-central1-a.
- Updated the region from europe-west4 to us-central1.
- Updated the gpu_zones list to include us-central1 zones.

Activity

The pull request was created by sudheer-quad with the intention of merging version V1.83.0 into the develop branch.
The changes primarily involve updating version strings across various Terraform modules and the main CLI to reflect the new version.
Specific updates were made to NVIDIA driver and CUDA toolkit versions in an ML example, along with a fix for container imports.
Network configurations and Cloud Build test parameters were also adjusted.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request merges version 1.83.0 into the develop branch, primarily consisting of version bumps across various module files and updates to example configurations, including increasing the prefix_length for private service access and fixes for NVIDIA driver installation. A security concern was identified in the import_pytorch_container.sh scripts, where insecure temporary directory creation in /tmp could lead to potential interference in multi-user environments. It is recommended to use mktemp -d for secure and unique directory creation, aligning with repository guidelines for secure shell scripting practices.

gemini-code-assist · 2026-03-05T12:26:44Z

+	if [ ! -d "$XDG_RUNTIME_DIR" ]; then
+		# Fallback to a guaranteed writable location in /tmp
+		XDG_RUNTIME_DIR="/tmp/enroot-runtime-$(id -u)"
+		export XDG_RUNTIME_DIR
+		mkdir -p "$XDG_RUNTIME_DIR"
+		chmod 700 "$XDG_RUNTIME_DIR"
+	fi


The script creates a temporary directory in /tmp using a predictable name based on the user's UID, which is insecure and vulnerable to symlink attacks in a multi-user environment. An attacker could pre-create this directory with insecure permissions to intercept or interfere with the enroot import process. Additionally, the script lacks set -e, meaning it will continue to execute even if chmod 700 fails. It is recommended to use mktemp -d for secure and unique directory creation.

Suggested change

if [ ! -d "$XDG_RUNTIME_DIR" ]; then

# Fallback to a guaranteed writable location in /tmp

XDG_RUNTIME_DIR="/tmp/enroot-runtime-$(id -u)"

export XDG_RUNTIME_DIR

mkdir -p "$XDG_RUNTIME_DIR"

chmod 700 "$XDG_RUNTIME_DIR"

fi

if [ ! -d "$XDG_RUNTIME_DIR" ]; then

# Fallback to a guaranteed writable location in /tmp

XDG_RUNTIME_DIR=$(mktemp -d -t enroot-runtime-$(id -u).XXXXXX)

export XDG_RUNTIME_DIR

fi

References

In shell scripts, avoid creating temporary directories in /tmp with predictable names and world-writable permissions, especially with sudo, as this is vulnerable to symlink attacks. Prefer using mktemp -d for secure temporary directory creation, or use mkdir with restrictive permissions (e.g., 700) and without sudo.

gemini-code-assist · 2026-03-05T12:26:44Z

+	if [ ! -d "$XDG_RUNTIME_DIR" ]; then
+		# Fallback to a guaranteed writable location in /tmp
+		XDG_RUNTIME_DIR="/tmp/enroot-runtime-$(id -u)"
+		export XDG_RUNTIME_DIR
+		mkdir -p "$XDG_RUNTIME_DIR"
+		chmod 700 "$XDG_RUNTIME_DIR"
+	fi


The script creates a temporary directory in /tmp using a predictable name based on the user's UID, which is insecure and vulnerable to symlink attacks in a multi-user environment. An attacker could pre-create this directory with insecure permissions to intercept or interfere with the enroot import process. Additionally, the script lacks set -e, meaning it will continue to execute even if chmod 700 fails. It is recommended to use mktemp -d for secure and unique directory creation.

Suggested change

if [ ! -d "$XDG_RUNTIME_DIR" ]; then

# Fallback to a guaranteed writable location in /tmp

XDG_RUNTIME_DIR="/tmp/enroot-runtime-$(id -u)"

export XDG_RUNTIME_DIR

mkdir -p "$XDG_RUNTIME_DIR"

chmod 700 "$XDG_RUNTIME_DIR"

fi

if [ ! -d "$XDG_RUNTIME_DIR" ]; then

# Fallback to a guaranteed writable location in /tmp

XDG_RUNTIME_DIR=$(mktemp -d -t enroot-runtime-$(id -u).XXXXXX)

export XDG_RUNTIME_DIR

fi

References

In shell scripts, avoid creating temporary directories in /tmp with predictable names and world-writable permissions, especially with sudo, as this is vulnerable to symlink attacks. Prefer using mktemp -d for secure temporary directory creation, or use mkdir with restrictive permissions (e.g., 700) and without sudo.

sudheer-quad and others added 9 commits February 24, 2026 08:54

Increase version to 1.83.0

0708266

Update Toolkit release to v1.83.0 (#5275)

e790113

Hotfix: update G4 NVIDIA drivers for kernel 6.17 compatibility (#5289)

c8fa496

Hotfix: Update the region and zone(s) to overcome Lustre instance cre…

80f0b3b

…ation issue (#5290)

hardcode zone

cab05f9

Hardcode zone in a2high PR test to fix test failures (#5305)

44196de

Modifying prefix_length for PSA to accomodate sufficient IPs for peer…

8f655b4

…ing (#5306)

fix: Update a3m and a3u script to resolve slurm nccl test failure (#5308

8eb5de0

)

Release candidate : v1.83.0 (#5277)

09301db

sudheer-quad requested review from a team and samskillman as code owners March 5, 2026 12:21

sudheer-quad added the release-chore To not include into release notes label Mar 5, 2026

github-actions Bot added the external PR from external contributor label Mar 5, 2026

gemini-code-assist Bot reviewed Mar 5, 2026

View reviewed changes

sudheer-quad closed this Mar 5, 2026

sudheer-quad reopened this Mar 5, 2026

sudheer-quad closed this Mar 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge V1.83.0 into Develop#5313

Merge V1.83.0 into Develop#5313
sudheer-quad wants to merge 9 commits into
developfrom
main

sudheer-quad commented Mar 5, 2026

Uh oh!

gemini-code-assist Bot commented Mar 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 5, 2026

Uh oh!

gemini-code-assist Bot Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

sudheer-quad commented Mar 5, 2026

Submission Checklist

Uh oh!

gemini-code-assist Bot commented Mar 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants