Skip to content

Merge V1.83.0 into Develop#5313

Closed
sudheer-quad wants to merge 9 commits into
developfrom
main
Closed

Merge V1.83.0 into Develop#5313
sudheer-quad wants to merge 9 commits into
developfrom
main

Conversation

@sudheer-quad

Copy link
Copy Markdown
Contributor

Merge V1.83.0 into Develop

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@sudheer-quad sudheer-quad requested review from a team and samskillman as code owners March 5, 2026 12:21
@sudheer-quad sudheer-quad added the release-chore To not include into release notes label Mar 5, 2026
@github-actions github-actions Bot added the external PR from external contributor label Mar 5, 2026
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily serves to advance the HPC Toolkit to version 1.83.0, incorporating a series of version bumps across its core components and Terraform modules. Beyond this foundational update, it refines the environment setup for machine learning workloads by upgrading NVIDIA software dependencies and addresses a common issue with container imports in non-interactive shells. Additionally, network configurations in several GKE examples have been adjusted, and Cloud Build test parameters have been updated to enhance reliability and specificity.

Highlights

  • Version Update: The HPC Toolkit version has been updated to v1.83.0 across the main CLI and numerous Terraform modules, reflecting a new release cycle.
  • NVIDIA Driver and CUDA Updates: The ml-slurm-g4.yaml example now uses updated NVIDIA driver (590) and CUDA toolkit (12.8) versions, including a fix for conflicting firmware and force-overwrite options during installation.
  • Enroot Container Import Fix: Shell scripts for importing PyTorch containers now include logic to correctly set XDG_RUNTIME_DIR for non-interactive environments, improving reliability.
  • Network Configuration Adjustments: Several GKE examples and the Managed Lustre README have updated the prefix_length for private service access from 24 to 22, potentially affecting IP address range allocation.
  • Cloud Build Test Configuration: Cloud Build test configurations for GKE and HPC Enterprise Slurm have been updated, including explicit zone/region settings and commenting out dynamic zone detection for stability.
Changelog
  • cmd/root.go
    • Updated the CLI version from v1.82.0 to v1.83.0.
  • community/modules/compute/gke-nodeset/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/compute/gke-partition/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/compute/htcondor-execute-point/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/compute/mig/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/compute/schedmd-slurm-gcp-v6-nodeset-dynamic/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/compute/schedmd-slurm-gcp-v6-nodeset-tpu/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/compute/schedmd-slurm-gcp-v6-nodeset/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/compute/schedmd-slurm-gcp-v6-partition/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/database/slurm-cloudsql-federation/versions.tf
    • Updated the module version to v1.83.0 for both google and google-beta providers.
  • community/modules/file-system/nfs-server/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/files/fsi-montecarlo-on-batch/versions.tf
    • Updated the module version to v1.83.0 for both google and google-beta providers.
  • community/modules/internal/slurm-gcp/login/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/project/service-enablement/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/pubsub/bigquery-sub/versions.tf
    • Updated the module version to v1.83.0 for both google and google-beta providers.
  • community/modules/pubsub/topic/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/scheduler/htcondor-access-point/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/scheduler/htcondor-central-manager/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/scheduler/htcondor-pool-secrets/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/scheduler/schedmd-slurm-gcp-v6-controller/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/scheduler/schedmd-slurm-gcp-v6-login/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/scripts/wait-for-startup/versions.tf
    • Updated the module version to v1.83.0.
  • community/modules/scripts/windows-startup-script/versions.tf
    • Updated the module version to v1.83.0.
  • examples/gke-a3-ultragpu/gke-a3-ultragpu.yaml
    • Updated the commented-out prefix_length setting from 24 to 22.
  • examples/gke-a4/gke-a4.yaml
    • Updated the commented-out prefix_length setting from 24 to 22.
  • examples/gke-a4x/gke-a4x.yaml
    • Updated the commented-out prefix_length setting from 24 to 22.
  • examples/gke-managed-lustre.yaml
    • Updated the prefix_length setting from 24 to 22.
  • examples/gke-tpu-7x/gke-tpu-7x-advanced.yaml
    • Updated the commented-out prefix_length setting from 24 to 22.
  • examples/gke-tpu-v6e/gke-tpu-v6e-advanced.yaml
    • Updated the commented-out prefix_length setting from 24 to 22.
  • examples/machine-learning/a3-megagpu-8g/nccl-tests/import_pytorch_container.sh
    • Added logic to set XDG_RUNTIME_DIR for non-interactive shells.
    • Updated the commented container name from 21.10-py3 to 24.04-py3.
  • examples/machine-learning/a3-ultragpu-8g/nccl-tests/import_pytorch_container.sh
    • Added logic to set XDG_RUNTIME_DIR for non-interactive shells.
  • examples/ml-slurm-g4.yaml
    • Updated NVIDIA driver packages to version 590 and CUDA toolkit to 12.8.
    • Added a step to remove conflicting Ubuntu NVIDIA firmware.
    • Added force-overwrite dpkg option for NVIDIA package installation.
  • modules/compute/gke-node-pool/versions.tf
    • Updated the module version to v1.83.0 for both google and google-beta providers.
  • modules/compute/resource-policy/versions.tf
    • Updated the module version to v1.83.0.
  • modules/compute/vm-instance/versions.tf
    • Updated the module version to v1.83.0 for both google and google-beta providers.
  • modules/file-system/cloud-storage-bucket/versions.tf
    • Updated the module version to v1.83.0 for both google and google-beta providers.
  • modules/file-system/filestore/versions.tf
    • Updated the module version to v1.83.0 for both google and google-beta providers.
  • modules/file-system/gke-persistent-volume/versions.tf
    • Updated the module version to v1.83.0.
  • modules/file-system/gke-storage/versions.tf
    • Updated the module version to v1.83.0.
  • modules/file-system/managed-lustre/README.md
    • Updated the prefix_length setting from 24 to 22 in the example configuration.
  • modules/file-system/managed-lustre/versions.tf
    • Updated the module version to v1.83.0 for both google and google-beta providers.
  • modules/file-system/netapp-storage-pool/versions.tf
    • Updated the module version to v1.83.0 for both google and google-beta providers.
  • modules/file-system/netapp-volume/versions.tf
    • Updated the module version to v1.83.0 for both google and google-beta providers.
  • modules/management/kubectl-apply/versions.tf
    • Updated the module version to v1.83.0.
  • modules/monitoring/dashboard/versions.tf
    • Updated the module version to v1.83.0.
  • modules/network/firewall-rules/versions.tf
    • Updated the module version to v1.83.0.
  • modules/network/pre-existing-subnetwork/versions.tf
    • Updated the module version to v1.83.0.
  • modules/network/pre-existing-vpc/versions.tf
    • Updated the module version to v1.83.0.
  • modules/network/private-service-access/versions.tf
    • Updated the module version to v1.83.0 for both google and google-beta providers.
  • modules/scheduler/batch-login-node/versions.tf
    • Updated the module version to v1.83.0.
  • modules/scheduler/gke-cluster/versions.tf
    • Updated the module version to v1.83.0 for both google and google-beta providers.
  • modules/scheduler/pre-existing-gke-cluster/versions.tf
    • Updated the module version to v1.83.0.
  • modules/scripts/startup-script/versions.tf
    • Updated the module version to v1.83.0.
  • tools/cloud-build/daily-tests/builds/gke-a2-highgpu-kueue-onspot.yaml
    • Added explicit ZONE and PROVISIONING_MODEL variables.
    • Commented out the dynamic zone detection script.
  • tools/cloud-build/daily-tests/tests/hpc-enterprise-slurm.yml
    • Updated the default zone from europe-west4-c to us-central1-a.
    • Updated the region from europe-west4 to us-central1.
    • Updated the gpu_zones list to include us-central1 zones.
Activity
  • The pull request was created by sudheer-quad with the intention of merging version V1.83.0 into the develop branch.
  • The changes primarily involve updating version strings across various Terraform modules and the main CLI to reflect the new version.
  • Specific updates were made to NVIDIA driver and CUDA toolkit versions in an ML example, along with a fix for container imports.
  • Network configurations and Cloud Build test parameters were also adjusted.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request merges version 1.83.0 into the develop branch, primarily consisting of version bumps across various module files and updates to example configurations, including increasing the prefix_length for private service access and fixes for NVIDIA driver installation. A security concern was identified in the import_pytorch_container.sh scripts, where insecure temporary directory creation in /tmp could lead to potential interference in multi-user environments. It is recommended to use mktemp -d for secure and unique directory creation, aligning with repository guidelines for secure shell scripting practices.

Comment on lines +27 to +33
if [ ! -d "$XDG_RUNTIME_DIR" ]; then
# Fallback to a guaranteed writable location in /tmp
XDG_RUNTIME_DIR="/tmp/enroot-runtime-$(id -u)"
export XDG_RUNTIME_DIR
mkdir -p "$XDG_RUNTIME_DIR"
chmod 700 "$XDG_RUNTIME_DIR"
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The script creates a temporary directory in /tmp using a predictable name based on the user's UID, which is insecure and vulnerable to symlink attacks in a multi-user environment. An attacker could pre-create this directory with insecure permissions to intercept or interfere with the enroot import process. Additionally, the script lacks set -e, meaning it will continue to execute even if chmod 700 fails. It is recommended to use mktemp -d for secure and unique directory creation.

Suggested change
if [ ! -d "$XDG_RUNTIME_DIR" ]; then
# Fallback to a guaranteed writable location in /tmp
XDG_RUNTIME_DIR="/tmp/enroot-runtime-$(id -u)"
export XDG_RUNTIME_DIR
mkdir -p "$XDG_RUNTIME_DIR"
chmod 700 "$XDG_RUNTIME_DIR"
fi
if [ ! -d "$XDG_RUNTIME_DIR" ]; then
# Fallback to a guaranteed writable location in /tmp
XDG_RUNTIME_DIR=$(mktemp -d -t enroot-runtime-$(id -u).XXXXXX)
export XDG_RUNTIME_DIR
fi
References
  1. In shell scripts, avoid creating temporary directories in /tmp with predictable names and world-writable permissions, especially with sudo, as this is vulnerable to symlink attacks. Prefer using mktemp -d for secure temporary directory creation, or use mkdir with restrictive permissions (e.g., 700) and without sudo.

Comment on lines +27 to +33
if [ ! -d "$XDG_RUNTIME_DIR" ]; then
# Fallback to a guaranteed writable location in /tmp
XDG_RUNTIME_DIR="/tmp/enroot-runtime-$(id -u)"
export XDG_RUNTIME_DIR
mkdir -p "$XDG_RUNTIME_DIR"
chmod 700 "$XDG_RUNTIME_DIR"
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The script creates a temporary directory in /tmp using a predictable name based on the user's UID, which is insecure and vulnerable to symlink attacks in a multi-user environment. An attacker could pre-create this directory with insecure permissions to intercept or interfere with the enroot import process. Additionally, the script lacks set -e, meaning it will continue to execute even if chmod 700 fails. It is recommended to use mktemp -d for secure and unique directory creation.

Suggested change
if [ ! -d "$XDG_RUNTIME_DIR" ]; then
# Fallback to a guaranteed writable location in /tmp
XDG_RUNTIME_DIR="/tmp/enroot-runtime-$(id -u)"
export XDG_RUNTIME_DIR
mkdir -p "$XDG_RUNTIME_DIR"
chmod 700 "$XDG_RUNTIME_DIR"
fi
if [ ! -d "$XDG_RUNTIME_DIR" ]; then
# Fallback to a guaranteed writable location in /tmp
XDG_RUNTIME_DIR=$(mktemp -d -t enroot-runtime-$(id -u).XXXXXX)
export XDG_RUNTIME_DIR
fi
References
  1. In shell scripts, avoid creating temporary directories in /tmp with predictable names and world-writable permissions, especially with sudo, as this is vulnerable to symlink attacks. Prefer using mktemp -d for secure temporary directory creation, or use mkdir with restrictive permissions (e.g., 700) and without sudo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external PR from external contributor release-chore To not include into release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants