Skip to content

Title: Add step-level total communication time sampler #86

Description

@abhinavsriva

Description:

TraceML currently captures compute, memory, and input pipeline timing, but it does not yet measure total distributed communication time. Add a new step-level sampler that estimates the total communication cost for each training step, starting from the first collective before the split / communication phase and aggregating all communication activity during the step.

This should be implemented with lightweight hooks or patches similar to the existing PyTorch instrumentation paths. The feature must be best-effort: if patch installation fails, user training code must continue to run without interruption. Users should also be able to disable communication timing completely when they do not want the overhead.

Acceptance Criteria:

  • A new sampler records step-level communication time
  • Communication timing is aggregated per step and per rank
  • Instrumentation is best-effort and never breaks user training if patching fails
  • Users can disable communication timing entirely
  • No rendering/UI changes are required in this issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions