Update A3ultra blueprint to use pre-built ACI images #5786
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request simplifies the A3ultra blueprint by transitioning from manual image building to using pre-built ACI images. It also enhances the cluster environment setup by refining Enroot storage paths, improving permission management for local SSDs, and ensuring critical services like NFS are correctly initialized during startup. These changes improve the maintainability and reliability of the A3ultra deployment process. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates the A3 Ultra Slurm blueprint to use a pre-built ACI image, removing the custom Packer image build steps. It also configures Enroot on local SSDs, ensures proper local SSD permissions, and adds NFS restart logic. A critical review comment points out that downloading and installing the CUDA toolkit via a runfile at boot time is redundant because the pre-built ACI image already includes CUDA 13.0. This redundant step should be removed to prevent slow boot times and potential rate limits during scaling.
|
/gcbrun |
|
/gcbrun |
| a3u_cluster_size: # supply cluster size | ||
| instance_image: | ||
| project: advanced-compute-images | ||
| family: aci-gpu-u2404-slurm-2511-cuda-130-nvidia-580-amd64 |
There was a problem hiding this comment.
Since we are using the aci-...-cuda-130-... image, CUDA 13.0 should already be pre-installed. Is there a specific reason we are re-installing it via the runfile in the startup script?
we are reinstalling cuda again below..
why..? is it needed.?
There was a problem hiding this comment.
Yes, this is intentional. We had to remove certain non-distributable libraries from the base CUDA toolkit installation, as well as the DCGM package. Reinstalling them here so that the state remains consistent with previous configurations.
| machine_type: n2-standard-80 | ||
| controller_startup_script: $(controller_startup.startup_script) | ||
| enable_external_prolog_epilog: true | ||
| compute_startup_scripts_timeout: 1800 |
There was a problem hiding this comment.
why such a high timeout, from what i remember default is 300 right..?
There was a problem hiding this comment.
The default timeout of 300 seconds (5 minutes) was insufficient for the nodeset startup script, consistently triggering a subprocess.TimeoutExpired error. This caused the VM setup to abort prematurely, leading Slurm to incorrectly flag the node deployment as failed.
| a3u_cluster_size: # supply cluster size | ||
| instance_image: | ||
| project: advanced-compute-images | ||
| family: aci-gpu-u2404-slurm-2511-cuda-130-nvidia-580-amd64 |
There was a problem hiding this comment.
The integration test for A3u is failing on this pr with a 403 error because our project's service account lacks the roles/compute.imageUser role in the advanced-compute-images project.
Please grant this permission so Terraform can access the required machine images and complete the deployment.
There was a problem hiding this comment.
We are heading into public preview and the image will become public in a few days, so this access issue should be resolved shortly. I will monitor the rollout and confirm once the permissions are active.
c73861f to
98aeafe
Compare
…move manual image building modules
|
Hi @ksaishree, thanks for your contribution. Can you please make the PR title more descriptive? |
Update a3ultra blueprint to use pre-built ACI images and remove manual image building modules
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.