Description:
Add a dedicated aggregator service command, traceml serve, so the aggregator can run independently of training. Also extend traceml run, traceml watch, and traceml deep with a --serve flag that defaults to False.
When --serve=False, TraceML should keep the current behavior and launch the embedded aggregator as it does today. When --serve=True, TraceML should not launch the embedded aggregator and instead connect to an existing standalone aggregator using serve_ip and serve_port, with sensible default values.
The goal is to separate the long-running aggregator lifecycle from training execution while preserving backward compatibility. This should use the existing TCP transport and keep the rank-0 fallback available for simple runs.
Acceptance Criteria:
- traceml serve starts the aggregator as a standalone process
- traceml run, traceml watch, and traceml deep support --serve
- --serve=False keeps current embedded aggregator behavior
- --serve=True skips embedded aggregator startup and connects via serve_ip and serve_port
- Default serve_ip and serve_port values are provided
- Rank-0 fallback remains available for compatibility
- No changes to rendering behavior in this issue
Description:
Add a dedicated aggregator service command, traceml serve, so the aggregator can run independently of training. Also extend traceml run, traceml watch, and traceml deep with a --serve flag that defaults to False.
When --serve=False, TraceML should keep the current behavior and launch the embedded aggregator as it does today. When --serve=True, TraceML should not launch the embedded aggregator and instead connect to an existing standalone aggregator using serve_ip and serve_port, with sensible default values.
The goal is to separate the long-running aggregator lifecycle from training execution while preserving backward compatibility. This should use the existing TCP transport and keep the rank-0 fallback available for simple runs.
Acceptance Criteria: