Skip to content

Feat(pathways): Align GCluster JobSet with XPK production defaults#5849

Draft
SwarnaBharathiMantena wants to merge 7 commits into
GoogleCloudPlatform:developfrom
SwarnaBharathiMantena:swarnabm/pathways-jobset-alignment
Draft

Feat(pathways): Align GCluster JobSet with XPK production defaults#5849
SwarnaBharathiMantena wants to merge 7 commits into
GoogleCloudPlatform:developfrom
SwarnaBharathiMantena:swarnabm/pathways-jobset-alignment

Conversation

@SwarnaBharathiMantena

@SwarnaBharathiMantena SwarnaBharathiMantena commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Aligns the Kubernetes JobSet manifests generated by GCluster with XPK and GKE Pathways standards to ensure reliable execution of distributed JAX workloads.

  • Injected JAX proxy environment variables (JAX_PLATFORMS, JAX_BACKEND_TARGET, XCLOUD_ENVIRONMENT) into the JAX workload container.
  • Added host-path volume mount for /tmp to enable shared-memory and local socket IPC between the JAX client, Proxy, and Resource Manager.
  • Enabled privileged security context (privileged: true) on the JAX container to allow host network binding and physical memory locking.
  • Added default resource limits (cpu: "24", memory: "100Gi") to the JAX workload container to prevent CPU node starvation.
  • Wrapped the user command in a SIGTERM-propagating bash trap to ensure reliable checkpoint-on-preemption during Spot VM evictions.
  • Stamped exclusive-topology annotations on the worker replicated job to force contiguous scheduling on GKE TPU node pools.
  • Natively injected ALTS bypass environment variables to prevent secure gRPC handshake failures on standard GKE VPC networks.
  • Propagated priorityClassName to the head coordinator pod spec.
  • Updated GKE orchestrator unit tests to assert all new configurations.

🔄 Revisions during Review

  • Optimized SIGTERM Propagation: Refactored the Spot VM preemption trap in the bash wrapper to use a direct, instant wait $PID instead of a 5-second polling loop, ensuring maximum grace period availability.
  • Optimized Image Pulls (XPK Alignment): Removed the hardcoded imagePullPolicy: Always from the workload container to allow Kubernetes to default to IfNotPresent. This prevents unnecessary multi-gigabyte image downloads on container restarts and aligns GCluster's output directly with XPK.

Aligns the Kubernetes JobSet manifests generated by GCluster with XPK and GKE
Pathways standards to ensure reliable execution of distributed JAX workloads.

- Injected JAX proxy environment variables (JAX_PLATFORMS, JAX_BACKEND_TARGET,
  XCLOUD_ENVIRONMENT) into the JAX workload container.
- Added host-path volume mount for /tmp to enable shared-memory and local
  socket IPC between the JAX client, Proxy, and Resource Manager.
- Enabled privileged security context (privileged: true) on the JAX container
  to allow host network binding and physical memory locking.
- Added default resource limits (cpu: "24", memory: "100Gi") to the JAX
  workload container to prevent CPU node starvation.
- Wrapped the user command in a SIGTERM-propagating bash trap to ensure
  reliable checkpoint-on-preemption during Spot VM evictions.
- Stamped exclusive-topology annotations on the worker replicated job to force
  contiguous scheduling on GKE TPU node pools.
- Natively injected ALTS bypass environment variables to prevent secure gRPC
  handshake failures on standard GKE VPC networks.
- Propagated priorityClassName to the head coordinator pod spec.
- Updated GKE orchestrator unit tests to assert all new configurations.
@SwarnaBharathiMantena SwarnaBharathiMantena added the release-improvements Added to release notes under the "Improvements" heading. label Jun 25, 2026
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request aligns GCluster's JobSet manifests with XPK and GKE Pathways production defaults. The changes focus on hardening the JAX workload container, improving scheduling efficiency, and ensuring reliable execution and checkpointing in distributed environments.

Highlights

  • Environment Configuration: Injected essential JAX proxy environment variables and ALTS bypass settings to ensure compatibility with GKE Pathways and XPK standards.
  • Resource and Security Hardening: Enabled privileged security context, added default resource limits, and mounted /tmp for shared memory to improve workload stability.
  • Lifecycle Management: Implemented a SIGTERM-propagating bash trap to handle graceful shutdowns during Spot VM evictions and added exclusive-topology annotations for better scheduling.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the GKE Jobset template for Pathways, introducing priority class names, exclusive topology annotations, additional environment variables, and a SIGTERM-propagating trap wrapper for the workload container. Feedback on these changes highlights a critical issue in the SIGTERM trap implementation, which introduces up to a 5-second delay during Spot VM evictions due to a polling loop; refactoring to use wait $PID directly is recommended. Additionally, the reviewer advises against hardcoding imagePullPolicy: Always on several containers to avoid increased startup latency and registry rate-limiting, and notes that the corresponding unit tests should be updated to match the corrected SIGTERM pattern.

Comment thread pkg/orchestrator/gke/templates/pathways_jobset.tmpl
Comment thread pkg/orchestrator/gke/gke_job_orchestrator_test.go Outdated
Comment thread pkg/orchestrator/gke/templates/pathways_jobset.tmpl Outdated
@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the GKE job orchestrator to algorithmically derive the Pathways platform key from GKE accelerator labels and enhances the JobSet template with support for priority classes, resource limits, signal handling, and exclusive topology annotations. The review feedback highlights a bug in the platform key derivation algorithm that fails to fully strip the 'tpu' prefix (e.g., producing 'tpuv6e' instead of 'v6e') and recommends removing the hardcoded imagePullPolicy: Always from the pathways containers to prevent startup delays and registry rate-limiting issues.

Comment thread pkg/orchestrator/gke/manifest_generator.go
Comment thread pkg/orchestrator/gke/templates/pathways_jobset.tmpl Outdated
Comment thread pkg/orchestrator/gke/templates/pathways_jobset.tmpl Outdated
Comment thread pkg/orchestrator/gke/templates/pathways_jobset.tmpl Outdated
@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the GKE Job Orchestrator to algorithmically derive the Pathways platform key from GKE accelerator labels and enhances the pathways jobset template with priority class names, environment variables, security contexts, resource limits, and SIGTERM handling. Feedback on these changes highlights two critical issues: first, the platform key derivation logic fails to strip the 'tpu-' prefix, which will cause topology lookup failures; second, specifying high resource limits without explicit requests for the workload container may lead to unschedulable pods on standard coordinator nodes.

Comment thread pkg/orchestrator/gke/manifest_generator.go
Comment thread pkg/orchestrator/gke/templates/pathways_jobset.tmpl
@Neelabh94

Copy link
Copy Markdown
Contributor

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the GKE orchestrator to algorithmically derive the Pathways platform key from GKE accelerator labels and enhances the Pathways JobSet template with resource limits, environment variables, a shared /tmp volume, and a SIGTERM preemption trap. Feedback highlights two key issues: first, the derived pathwaysPlatform needs to strip the 'tpu' prefix to prevent runtime lookup failures in the Pathways server; second, the _sigterm trap in the JobSet template should verify that $PID is set before executing the kill command to avoid bash runtime errors during early container startup.

Comment thread pkg/orchestrator/gke/manifest_generator.go
Comment thread pkg/orchestrator/gke/templates/pathways_jobset.tmpl

@Neelabh94 Neelabh94 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it addresses most of the gaps, there are still two key issues that need to be resolved before it can be considered complete:

  1. TPU v5p Platform Mapping Bug
    The algorithmic derivation for pathwaysInstanceType in pkg/orchestrator/gke/manifest_generator.go (lines 45–47):
pathwaysPlatform := strings.ReplaceAll(normalizedLabel, "-podslice", "")
pathwaysPlatform = strings.ReplaceAll(pathwaysPlatform, "-slice", "")
pathwaysPlatform = strings.ReplaceAll(pathwaysPlatform, "-", "")

will map the GKE label tpu-v5p-slice to tpuv5p.

The Issue: JAX/Pathways does not recognize tpuv5p. It expects tpuv5 for TPU v5p slice architectures (as verified in XPK's system_characteristics.py mapping).

Fix: Add a normalization step to convert tpuv5p to tpuv5.

  1. Redundant Volume Mount in pathways-rm Container
    The PR prepends the shared-tmp volume mount at /tmp to workload-container (which is correct), but it fails to remove the redundant /tmp volume mount from the pathways-rm container in pkg/orchestrator/gke/templates/pathways_jobset.tmpl.

The Issue: XPK does not mount /tmp in pathways-rm (only the workload and worker containers require access to the shared temp directory).

Fix: Remove the volumeMounts section from the pathways-rm container block.

@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor Author

While it addresses most of the gaps, there are still two key issues that need to be resolved before it can be considered complete:

  1. TPU v5p Platform Mapping Bug
    The algorithmic derivation for pathwaysInstanceType in pkg/orchestrator/gke/manifest_generator.go (lines 45–47):
pathwaysPlatform := strings.ReplaceAll(normalizedLabel, "-podslice", "")
pathwaysPlatform = strings.ReplaceAll(pathwaysPlatform, "-slice", "")
pathwaysPlatform = strings.ReplaceAll(pathwaysPlatform, "-", "")

will map the GKE label tpu-v5p-slice to tpuv5p.

The Issue: JAX/Pathways does not recognize tpuv5p. It expects tpuv5 for TPU v5p slice architectures (as verified in XPK's system_characteristics.py mapping).

Fix: Add a normalization step to convert tpuv5p to tpuv5.

  1. Redundant Volume Mount in pathways-rm Container
    The PR prepends the shared-tmp volume mount at /tmp to workload-container (which is correct), but it fails to remove the redundant /tmp volume mount from the pathways-rm container in pkg/orchestrator/gke/templates/pathways_jobset.tmpl.

The Issue: XPK does not mount /tmp in pathways-rm (only the workload and worker containers require access to the shared temp directory).

Fix: Remove the volumeMounts section from the pathways-rm container block.

Thanks for suggesting two important fixes here, @Neelabh94. I updated the code.

@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the GKE Job Orchestrator to dynamically derive Pathways platform keys from GKE accelerator labels, adds support for priority class names, and configures environment variables, resources, and signal trapping in the Pathways JobSet template. The review feedback highlights two key issues: wrapping the user command directly with & PID=$! is fragile for multi-line scripts and should be enclosed in a subshell, and hardcoding high resource limits (CPU and memory) for the workload container may cause the coordinator pod to remain permanently pending on smaller node pools.

Comment thread pkg/orchestrator/gke/templates/pathways_jobset.tmpl Outdated
Comment thread pkg/orchestrator/gke/templates/pathways_jobset.tmpl
@SwarnaBharathiMantena SwarnaBharathiMantena force-pushed the swarnabm/pathways-jobset-alignment branch from 2f6ce97 to b2cb4eb Compare June 25, 2026 20:40
@SwarnaBharathiMantena SwarnaBharathiMantena force-pushed the swarnabm/pathways-jobset-alignment branch from b2cb4eb to 60cbce0 Compare June 25, 2026 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants