Description:
TraceML currently captures compute, memory, and input pipeline timing, but it does not yet measure total distributed communication time. Add a new step-level sampler that estimates the total communication cost for each training step, starting from the first collective before the split / communication phase and aggregating all communication activity during the step.
This should be implemented with lightweight hooks or patches similar to the existing PyTorch instrumentation paths. The feature must be best-effort: if patch installation fails, user training code must continue to run without interruption. Users should also be able to disable communication timing completely when they do not want the overhead.
Acceptance Criteria:
- A new sampler records step-level communication time
- Communication timing is aggregated per step and per rank
- Instrumentation is best-effort and never breaks user training if patching fails
- Users can disable communication timing entirely
- No rendering/UI changes are required in this issue
Description:
TraceML currently captures compute, memory, and input pipeline timing, but it does not yet measure total distributed communication time. Add a new step-level sampler that estimates the total communication cost for each training step, starting from the first collective before the split / communication phase and aggregating all communication activity during the step.
This should be implemented with lightweight hooks or patches similar to the existing PyTorch instrumentation paths. The feature must be best-effort: if patch installation fails, user training code must continue to run without interruption. Users should also be able to disable communication timing completely when they do not want the overhead.
Acceptance Criteria: