Description:
Update TraceML so the runtime is started from the user’s Python code instead of being launched by traceml run. Today, traceml run starts the runtime and then executes the user script via runpy, but the new model should let users launch their training script directly with python or torchrun, and then call traceml.init(...) inside their code.
In this design, traceml.init(...) becomes responsible for starting the TraceML runtime threads and connecting to the aggregator over TCP using a configurable host and port. If the aggregator is not available, initialization should retry for a short bounded period and then fail with a clear message telling the user to start traceml serve or disable tracing.
This issue is about moving runtime ownership into the training process itself. It should preserve the existing no-op disabled path and avoid hanging forever when the aggregator is missing. traceml run should still remain available for compatibility, but the preferred path becomes direct execution of user code with explicit traceml.init(...).
Acceptance Criteria:
- traceml.init(...) starts the TraceML runtime from inside user code
- traceml.init(...) connects to a running aggregator over TCP
- TCP host/port are configurable
- If the aggregator is missing, initialization fails clearly after bounded retries
- Disabled tracing remains a no-op
- traceml run remains available for backward compatibility
- No UI/rendering changes in this issue
Description:
Update TraceML so the runtime is started from the user’s Python code instead of being launched by traceml run. Today, traceml run starts the runtime and then executes the user script via runpy, but the new model should let users launch their training script directly with python or torchrun, and then call traceml.init(...) inside their code.
In this design, traceml.init(...) becomes responsible for starting the TraceML runtime threads and connecting to the aggregator over TCP using a configurable host and port. If the aggregator is not available, initialization should retry for a short bounded period and then fail with a clear message telling the user to start traceml serve or disable tracing.
This issue is about moving runtime ownership into the training process itself. It should preserve the existing no-op disabled path and avoid hanging forever when the aggregator is missing. traceml run should still remain available for compatibility, but the preferred path becomes direct execution of user code with explicit traceml.init(...).
Acceptance Criteria: