Skip to content

Update A3ultra blueprint to use pre-built ACI images #5786

Open
ksaishree wants to merge 1 commit into
GoogleCloudPlatform:developfrom
ksaishree:a3ultra
Open

Update A3ultra blueprint to use pre-built ACI images #5786
ksaishree wants to merge 1 commit into
GoogleCloudPlatform:developfrom
ksaishree:a3ultra

Conversation

@ksaishree

Copy link
Copy Markdown
Contributor

Update a3ultra blueprint to use pre-built ACI images and remove manual image building modules

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@ksaishree ksaishree requested a review from a team as a code owner June 15, 2026 10:47
@github-actions github-actions Bot added the external PR from external contributor label Jun 15, 2026
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request simplifies the A3ultra blueprint by transitioning from manual image building to using pre-built ACI images. It also enhances the cluster environment setup by refining Enroot storage paths, improving permission management for local SSDs, and ensuring critical services like NFS are correctly initialized during startup. These changes improve the maintainability and reliability of the A3ultra deployment process.

Highlights

  • Image Management: Updated the A3ultra blueprint to utilize pre-built ACI images, removing the manual image building modules and associated configuration scripts.
  • Environment Configuration: Standardized Enroot configuration paths to use local SSD storage and added scripts to ensure proper permissions for /mnt/localssd.
  • Dependency Updates: Refactored CUDA and DCGM installation processes to use a more direct runfile-based approach and updated the startup configuration.
  • System Reliability: Added an NFS server restart step to ensure exports are active after Lustre mounts and increased the compute startup script timeout to 1800 seconds.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the A3 Ultra Slurm blueprint to use a pre-built ACI image, removing the custom Packer image build steps. It also configures Enroot on local SSDs, ensures proper local SSD permissions, and adds NFS restart logic. A critical review comment points out that downloading and installing the CUDA toolkit via a runfile at boot time is redundant because the pre-built ACI image already includes CUDA 13.0. This redundant step should be removed to prevent slow boot times and potential rate limits during scaling.

@saara-tyagi27

Copy link
Copy Markdown
Contributor

/gcbrun

@saara-tyagi27

Copy link
Copy Markdown
Contributor

/gcbrun

@sudheer-quad sudheer-quad added the release-improvements Added to release notes under the "Improvements" heading. label Jun 16, 2026
a3u_cluster_size: # supply cluster size
instance_image:
project: advanced-compute-images
family: aci-gpu-u2404-slurm-2511-cuda-130-nvidia-580-amd64

@arpit974 arpit974 Jun 17, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are using the aci-...-cuda-130-... image, CUDA 13.0 should already be pre-installed. Is there a specific reason we are re-installing it via the runfile in the startup script?

we are reinstalling cuda again below..
why..? is it needed.?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is intentional. We had to remove certain non-distributable libraries from the base CUDA toolkit installation, as well as the DCGM package. Reinstalling them here so that the state remains consistent with previous configurations.

Comment thread tools/cloud-build/daily-tests/blueprints/a3ultra-custom-image-blueprint.yaml Outdated
machine_type: n2-standard-80
controller_startup_script: $(controller_startup.startup_script)
enable_external_prolog_epilog: true
compute_startup_scripts_timeout: 1800

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why such a high timeout, from what i remember default is 300 right..?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default timeout of 300 seconds (5 minutes) was insufficient for the nodeset startup script, consistently triggering a subprocess.TimeoutExpired error. This caused the VM setup to abort prematurely, leading Slurm to incorrectly flag the node deployment as failed.

a3u_cluster_size: # supply cluster size
instance_image:
project: advanced-compute-images
family: aci-gpu-u2404-slurm-2511-cuda-130-nvidia-580-amd64

@saara-tyagi27 saara-tyagi27 Jun 17, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration test for A3u is failing on this pr with a 403 error because our project's service account lacks the roles/compute.imageUser role in the advanced-compute-images project.
Please grant this permission so Terraform can access the required machine images and complete the deployment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are heading into public preview and the image will become public in a few days, so this access issue should be resolved shortly. I will monitor the rollout and confirm once the permissions are active.

@ksaishree ksaishree force-pushed the a3ultra branch 2 times, most recently from c73861f to 98aeafe Compare June 17, 2026 06:06
@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor

Hi @ksaishree, thanks for your contribution. Can you please make the PR title more descriptive?

@ksaishree ksaishree changed the title A3ultra Update A3ultra blueprint to use pre-built ACI images Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external PR from external contributor release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants