Skip to content

Monarch k8s#325

Open
HosseinKaviani-H wants to merge 62 commits into
meta-pytorch:mainfrom
HosseinKaviani-H:Monarch_K8s
Open

Monarch k8s#325
HosseinKaviani-H wants to merge 62 commits into
meta-pytorch:mainfrom
HosseinKaviani-H:Monarch_K8s

Conversation

@HosseinKaviani-H

Copy link
Copy Markdown

Summary

Adds Kubernetes support for fault-tolerant distributed training with Monarch + TorchFT + TorchTitan.

  • train_distributed_k8s.py — K8s orchestration script with MonarchKubernetes scheduler, ReplicaActor supervision boundary, inner retry loop, and failure injection
  • Updated README.md with K8s setup instructions (container image, controller pod, RBAC, headless service)
  • Updated utils/failure.py with configurable failure injection

Training runs end-to-end with 2 replicas × 8 GPUs. FT checkpoint recovery from healthy replicas works .

Known Blockers

  1. async with proc_mesh: cleanup hangs when surviving processes are stuck in C-level calls (e.g., NCCL allreduce). proc_mesh.stop() cannot terminate unresponsive processes. Fix in progress by Monarch team
    (controller stop path partitioning by rank health state).

  2. Repeated __supervise__ callbacks for already-dead actors (~every 10s). Monarch PR #3383 partially addresses.

Test Plan

  • Training completes without failures (--training-steps 1500)
  • call() raises on child death (confirmed via minimal repro)
  • FT checkpoint loading from healthy replica after failure
  • Full recovery cycle with failure injection (blocked by issue 1)

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 12, 2026
@amirafzali

Copy link
Copy Markdown
Member

Just stumbled on this, pretty cool. If you end up publishing the PR, I think there shouldn't be an entirely new script. The job allocator in the original script should be pluggable based on the scheduler you're using. Maybe some nice abstraction to do here. cc @d4l3k

@HosseinKaviani-H

Copy link
Copy Markdown
Author

Just stumbled on this, pretty cool. If you end up publishing the PR, I think there shouldn't be an entirely new script. The job allocator in the original script should be pluggable based on the scheduler you're using. Maybe some nice abstraction to do here. cc @d4l3k

@amirafzali Agreed. This needs to get more polished and just drafted it here till we fix some on-going issues in Monarch (see meta-pytorch/monarch#3435).

@meta-codesync

meta-codesync Bot commented May 31, 2026

Copy link
Copy Markdown

This pull request has been imported. If you are a Meta employee, you can view this in D106979430. (Because this pull request was imported automatically, there will not be any future comments.)

@HosseinKaviani-H HosseinKaviani-H marked this pull request as ready for review June 15, 2026 17:20
@HosseinKaviani-H

Copy link
Copy Markdown
Author

@d4l3k @tushar00jain CI here is red on a repo-wide nightly dependency issue, not this PR. lint, Docs / build, and unittest all fail at the pip install --pre torch torchvision torchcomms step before anything runs:
orchvision 0.27.0.dev20260407 requires torch==2.12.0.dev20260407, but only torch 2.12.0.dev20260408 is in the nightly index --> ResolutionImpossible Re-running doesn't help (the matching torch nightly has been pruned). This repros on every open PR and on main. Could you take a look / re-run once the nightlies align, or drop torchvision from those workflow installs if it isn't needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants