Monarch k8s by HosseinKaviani-H · Pull Request #325 · meta-pytorch/torchft

HosseinKaviani-H · 2026-05-12T17:50:32Z

Summary

Adds Kubernetes support for fault-tolerant distributed training with Monarch + TorchFT + TorchTitan.

train_distributed_k8s.py — K8s orchestration script with MonarchKubernetes scheduler, ReplicaActor supervision boundary, inner retry loop, and failure injection
Updated README.md with K8s setup instructions (container image, controller pod, RBAC, headless service)
Updated utils/failure.py with configurable failure injection

Training runs end-to-end with 2 replicas × 8 GPUs. FT checkpoint recovery from healthy replicas works .

Known Blockers

async with proc_mesh: cleanup hangs when surviving processes are stuck in C-level calls (e.g., NCCL allreduce). proc_mesh.stop() cannot terminate unresponsive processes. Fix in progress by Monarch team
(controller stop path partitioning by rank health state).
Repeated __supervise__ callbacks for already-dead actors (~every 10s). Monarch PR #3383 partially addresses.

Test Plan

Training completes without failures (--training-steps 1500)
call() raises on child death (confirmed via minimal repro)
FT checkpoint loading from healthy replica after failure
Full recovery cycle with failure injection (blocked by issue 1)

amirafzali · 2026-05-15T21:00:49Z

Just stumbled on this, pretty cool. If you end up publishing the PR, I think there shouldn't be an entirely new script. The job allocator in the original script should be pluggable based on the scheduler you're using. Maybe some nice abstraction to do here. cc @d4l3k

HosseinKaviani-H · 2026-05-15T21:14:53Z

Just stumbled on this, pretty cool. If you end up publishing the PR, I think there shouldn't be an entirely new script. The job allocator in the original script should be pluggable based on the scheduler you're using. Maybe some nice abstraction to do here. cc @d4l3k

@amirafzali Agreed. This needs to get more polished and just drafted it here till we fix some on-going issues in Monarch (see meta-pytorch/monarch#3435).

…tree

…cker

…visibility

…ator 0.2.0

meta-codesync · 2026-05-31T07:56:50Z

This pull request has been imported. If you are a Meta employee, you can view this in D106979430. (Because this pull request was imported automatically, there will not be any future comments.)

…very

…omparison

…om host

…clean → train_distributed_k8s_pod_restart. Update README.

…fig/output docs

HosseinKaviani-H · 2026-06-15T17:41:46Z

@d4l3k @tushar00jain CI here is red on a repo-wide nightly dependency issue, not this PR. lint, Docs / build, and unittest all fail at the pip install --pre torch torchvision torchcomms step before anything runs:
orchvision 0.27.0.dev20260407 requires torch==2.12.0.dev20260407, but only torch 2.12.0.dev20260408 is in the nightly index --> ResolutionImpossible Re-running doesn't help (the matching torch nightly has been pruned). This repros on every open PR and on main. Could you take a look / re-run once the nightlies align, or drop torchvision from those workflow installs if it isn't needed?

Hossein Kavianihamedani added 2 commits May 12, 2026 10:37

Add Monarch K8s fault-tolerant training with TorchFT

c872c42

Add updated README and failure injection utils

ff686bc

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 12, 2026

Hossein Kavianihamedani added 25 commits May 18, 2026 19:26

Remove async with proc_mesh to avoid stop() hang on broken broadcast …

12c3c7f

…tree

Putting Async back for test

4bed860

Update to pod_template_spec for Monarch operator v0.2.0

8536529

Revert pod_template_spec, keep pod_spec for operator 0.1.2

f1c3155

Update to pod_template_spec for Monarch operator v0.2.0-second try

8803406

Revert to pod_spec for 0.5.0rc1

5ef924c

Add simple two-replica connectivity test

fa61d01

Add worker debug script

9fb76d5

Update test to use call() for multi-GPU

199e315

Add clean K8s script without supervise for recovery testing

b3f0d46

Use only SEGFAULT/KILL_PROC failures, increase rest_time to 600s

1b18cb3

Don't recreate K8s job on every failure, only every PROC_ATTEMPTS

4880eb4

Skip proc_mesh.stop in teardown

6e22fa4

Add minimal __supervise__ to prevent root actor kill-all

a4fb400

Remove async with proc_mesh to avoid stop() hang on failure

45a9306

Increase spawn delay to 70s to wait for orphan cleanup

49f80f7

Exact SLURM pattern: no supervise, async with

5af016e

Reduce NCCL timeout to 30s, add supervise, delay respawn 35s

8c9ee90

Add a monkey patch

49836eb

NCCL timeout

c77b029

Add minimal K8s FT training script mirroring SLURM pattern

c57e446

Minimal train script - implement no clean up

c51f8e7

No supervise

e953084

Add repro script for stale TCP session on HostMesh reuse

3fc6086

Add repro script for stale TCP session on HostMesh reuse - change

e736330

Hossein Kavianihamedani added 9 commits May 20, 2026 20:52

Keep pods alive after test_two_replicas — don't delete CRDs

8dc7ce5

Repro: match training script's exact scheduler/ReplicaActor code path

a4e0a7c

Add excalidraw diagram explaining broadcast tree gap and recovery blo…

8e4f9fd

…cker

Fix excalidraw font families — use fontFamily 2 (supported)

00a5676

Fix excalidraw text box sizes — increase heights and widths for full …

7f53667

…visibility

Fix excalidraw: set autoResize false so text renders as editable text

7cb1ca3

Add broadcast tree gap doc as markdown

cf456a4

Update repro script: use pod_template with V1PodTemplateSpec for oper…

caecf1e

…ator 0.2.0

Revert to pod_template=pod_spec (CRD expects pod spec fields directly)

45a3ff9

Hossein Kavianihamedani added 17 commits May 31, 2026 01:08

Use V1PodTemplateSpec(spec=pod_spec) for operator 0.2.0 with updated CRD

263a7d0

Update launcher to use hossein-ft-v12 image

154c7e8

Update train_k8s_minimal to use pod_template for operator 0.2.0

1ce9a42

PROC_ATTEMPT_DELAY=65, add recovery timing logs

4b76e0a

Side-by-side comparison: minimal (HostMesh reuse) vs clean (pod restart)

2c66575

Set mesh_orphan_timeout=10s, PROC_ATTEMPT_DELAY=15 — target ~17s reco…

6618aa2

…very

Add baseline script: no __supervise__, full restart on failure, for c…

88445d2

…omparison

Baseline: no-Monarch FT training via torchrun + K8s Jobs for comparison

5aa99b0

Add advanced FT on K8s doc: orphan pattern, results, architecture

37004bf

Add RBAC for K8s Jobs — needed for baseline script

f3f9887

Fix baseline: write train script to file, then torchrun it

57dad94

Fix baseline: fully self-contained inline train script, no imports fr…

cb85047

…om host

Add --hosts-per-replica support for multi-node replicas

711b7aa

Clean up repo: keep only core training scripts and utils

5471d2d

Rename scripts: train_k8s_minimal → train_distributed_k8s, train_k8s_…

4b1063b

…clean → train_distributed_k8s_pod_restart. Update README.

Remove pod restart script from repo, keep locally

1b5859a

Address PR review: make failure.py scheduler-agnostic, fix README con…

23376f2

…fig/output docs

HosseinKaviani-H marked this pull request as ready for review June 15, 2026 17:20

Re-trigger CI

2e50d84

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monarch k8s#325

Monarch k8s#325
HosseinKaviani-H wants to merge 62 commits into
meta-pytorch:mainfrom
HosseinKaviani-H:Monarch_K8s

HosseinKaviani-H commented May 12, 2026

Uh oh!

amirafzali commented May 15, 2026

Uh oh!

HosseinKaviani-H commented May 15, 2026

Uh oh!

meta-codesync Bot commented May 31, 2026

Uh oh!

HosseinKaviani-H commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HosseinKaviani-H commented May 12, 2026

Summary

Known Blockers

Test Plan

Uh oh!

amirafzali commented May 15, 2026

Uh oh!

HosseinKaviani-H commented May 15, 2026

Uh oh!

meta-codesync Bot commented May 31, 2026

Uh oh!

HosseinKaviani-H commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants