Feat(pathways): Align GCluster JobSet with XPK production defaults#5849
Conversation
Aligns the Kubernetes JobSet manifests generated by GCluster with XPK and GKE Pathways standards to ensure reliable execution of distributed JAX workloads. - Injected JAX proxy environment variables (JAX_PLATFORMS, JAX_BACKEND_TARGET, XCLOUD_ENVIRONMENT) into the JAX workload container. - Added host-path volume mount for /tmp to enable shared-memory and local socket IPC between the JAX client, Proxy, and Resource Manager. - Enabled privileged security context (privileged: true) on the JAX container to allow host network binding and physical memory locking. - Added default resource limits (cpu: "24", memory: "100Gi") to the JAX workload container to prevent CPU node starvation. - Wrapped the user command in a SIGTERM-propagating bash trap to ensure reliable checkpoint-on-preemption during Spot VM evictions. - Stamped exclusive-topology annotations on the worker replicated job to force contiguous scheduling on GKE TPU node pools. - Natively injected ALTS bypass environment variables to prevent secure gRPC handshake failures on standard GKE VPC networks. - Propagated priorityClassName to the head coordinator pod spec. - Updated GKE orchestrator unit tests to assert all new configurations.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request aligns GCluster's JobSet manifests with XPK and GKE Pathways production defaults. The changes focus on hardening the JAX workload container, improving scheduling efficiency, and ensuring reliable execution and checkpointing in distributed environments. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates the GKE Jobset template for Pathways, introducing priority class names, exclusive topology annotations, additional environment variables, and a SIGTERM-propagating trap wrapper for the workload container. Feedback on these changes highlights a critical issue in the SIGTERM trap implementation, which introduces up to a 5-second delay during Spot VM evictions due to a polling loop; refactoring to use wait $PID directly is recommended. Additionally, the reviewer advises against hardcoding imagePullPolicy: Always on several containers to avoid increased startup latency and registry rate-limiting, and notes that the corresponding unit tests should be updated to match the corrected SIGTERM pattern.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the GKE job orchestrator to algorithmically derive the Pathways platform key from GKE accelerator labels and enhances the JobSet template with support for priority classes, resource limits, signal handling, and exclusive topology annotations. The review feedback highlights a bug in the platform key derivation algorithm that fails to fully strip the 'tpu' prefix (e.g., producing 'tpuv6e' instead of 'v6e') and recommends removing the hardcoded imagePullPolicy: Always from the pathways containers to prevent startup delays and registry rate-limiting issues.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the GKE Job Orchestrator to algorithmically derive the Pathways platform key from GKE accelerator labels and enhances the pathways jobset template with priority class names, environment variables, security contexts, resource limits, and SIGTERM handling. Feedback on these changes highlights two critical issues: first, the platform key derivation logic fails to strip the 'tpu-' prefix, which will cause topology lookup failures; second, specifying high resource limits without explicit requests for the workload container may lead to unschedulable pods on standard coordinator nodes.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the GKE orchestrator to algorithmically derive the Pathways platform key from GKE accelerator labels and enhances the Pathways JobSet template with resource limits, environment variables, a shared /tmp volume, and a SIGTERM preemption trap. Feedback highlights two key issues: first, the derived pathwaysPlatform needs to strip the 'tpu' prefix to prevent runtime lookup failures in the Pathways server; second, the _sigterm trap in the JobSet template should verify that $PID is set before executing the kill command to avoid bash runtime errors during early container startup.
Neelabh94
left a comment
There was a problem hiding this comment.
While it addresses most of the gaps, there are still two key issues that need to be resolved before it can be considered complete:
- TPU v5p Platform Mapping Bug
The algorithmic derivation for pathwaysInstanceType in pkg/orchestrator/gke/manifest_generator.go (lines 45–47):
pathwaysPlatform := strings.ReplaceAll(normalizedLabel, "-podslice", "")
pathwaysPlatform = strings.ReplaceAll(pathwaysPlatform, "-slice", "")
pathwaysPlatform = strings.ReplaceAll(pathwaysPlatform, "-", "")will map the GKE label tpu-v5p-slice to tpuv5p.
The Issue: JAX/Pathways does not recognize tpuv5p. It expects tpuv5 for TPU v5p slice architectures (as verified in XPK's system_characteristics.py mapping).
Fix: Add a normalization step to convert tpuv5p to tpuv5.
- Redundant Volume Mount in pathways-rm Container
The PR prepends the shared-tmp volume mount at /tmp to workload-container (which is correct), but it fails to remove the redundant /tmp volume mount from the pathways-rm container in pkg/orchestrator/gke/templates/pathways_jobset.tmpl.
The Issue: XPK does not mount /tmp in pathways-rm (only the workload and worker containers require access to the shared temp directory).
Fix: Remove the volumeMounts section from the pathways-rm container block.
… clean redundant rm volume mounts
Thanks for suggesting two important fixes here, @Neelabh94. I updated the code. |
01b3e38 to
2f6ce97
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the GKE Job Orchestrator to dynamically derive Pathways platform keys from GKE accelerator labels, adds support for priority class names, and configures environment variables, resources, and signal trapping in the Pathways JobSet template. The review feedback highlights two key issues: wrapping the user command directly with & PID=$! is fragile for multi-line scripts and should be enclosed in a subshell, and hardcoding high resource limits (CPU and memory) for the workload container may cause the coordinator pod to remain permanently pending on smaller node pools.
2f6ce97 to
b2cb4eb
Compare
b2cb4eb to
60cbce0
Compare
Aligns the Kubernetes JobSet manifests generated by GCluster with XPK and GKE Pathways standards to ensure reliable execution of distributed JAX workloads.
🔄 Revisions during Review
wait $PIDinstead of a 5-second polling loop, ensuring maximum grace period availability.imagePullPolicy: Alwaysfrom the workload container to allow Kubernetes to default toIfNotPresent. This prevents unnecessary multi-gigabyte image downloads on container restarts and aligns GCluster's output directly with XPK.