Monarch k8s#325
Conversation
|
Just stumbled on this, pretty cool. If you end up publishing the PR, I think there shouldn't be an entirely new script. The job allocator in the original script should be pluggable based on the scheduler you're using. Maybe some nice abstraction to do here. cc @d4l3k |
@amirafzali Agreed. This needs to get more polished and just drafted it here till we fix some on-going issues in Monarch (see meta-pytorch/monarch#3435). |
|
This pull request has been imported. If you are a Meta employee, you can view this in D106979430. (Because this pull request was imported automatically, there will not be any future comments.) |
…clean → train_distributed_k8s_pod_restart. Update README.
|
@d4l3k @tushar00jain CI here is red on a repo-wide nightly dependency issue, not this PR. lint, Docs / build, and unittest all fail at the pip install --pre torch torchvision torchcomms step before anything runs: |
Summary
Adds Kubernetes support for fault-tolerant distributed training with Monarch + TorchFT + TorchTitan.
train_distributed_k8s.py— K8s orchestration script withMonarchKubernetesscheduler,ReplicaActorsupervision boundary, inner retry loop, and failure injectionREADME.mdwith K8s setup instructions (container image, controller pod, RBAC, headless service)utils/failure.pywith configurable failure injectionTraining runs end-to-end with 2 replicas × 8 GPUs. FT checkpoint recovery from healthy replicas works .
Known Blockers
async with proc_mesh:cleanup hangs when surviving processes are stuck in C-level calls (e.g., NCCL allreduce).proc_mesh.stop()cannot terminate unresponsive processes. Fix in progress by Monarch team(controller stop path partitioning by rank health state).
Repeated
__supervise__callbacks for already-dead actors (~every 10s). Monarch PR #3383 partially addresses.Test Plan
--training-steps 1500)call()raises on child death (confirmed via minimal repro)