Skip to content

Move TraceML runtime startup into user code via traceml.init(...) #84

Description

@abhinavsriva

Description:

Update TraceML so the runtime is started from the user’s Python code instead of being launched by traceml run. Today, traceml run starts the runtime and then executes the user script via runpy, but the new model should let users launch their training script directly with python or torchrun, and then call traceml.init(...) inside their code.

In this design, traceml.init(...) becomes responsible for starting the TraceML runtime threads and connecting to the aggregator over TCP using a configurable host and port. If the aggregator is not available, initialization should retry for a short bounded period and then fail with a clear message telling the user to start traceml serve or disable tracing.

This issue is about moving runtime ownership into the training process itself. It should preserve the existing no-op disabled path and avoid hanging forever when the aggregator is missing. traceml run should still remain available for compatibility, but the preferred path becomes direct execution of user code with explicit traceml.init(...).

Acceptance Criteria:

  • traceml.init(...) starts the TraceML runtime from inside user code
  • traceml.init(...) connects to a running aggregator over TCP
  • TCP host/port are configurable
  • If the aggregator is missing, initialization fails clearly after bounded retries
  • Disabled tracing remains a no-op
  • traceml run remains available for backward compatibility
  • No UI/rendering changes in this issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions