Skip to content

Standalone aggregator service + launcher support #83

Description

@abhinavsriva

Description:

Add a dedicated aggregator service command, traceml serve, so the aggregator can run independently of training. Also extend traceml run, traceml watch, and traceml deep with a --serve flag that defaults to False.

When --serve=False, TraceML should keep the current behavior and launch the embedded aggregator as it does today. When --serve=True, TraceML should not launch the embedded aggregator and instead connect to an existing standalone aggregator using serve_ip and serve_port, with sensible default values.

The goal is to separate the long-running aggregator lifecycle from training execution while preserving backward compatibility. This should use the existing TCP transport and keep the rank-0 fallback available for simple runs.

Acceptance Criteria:

  • traceml serve starts the aggregator as a standalone process
  • traceml run, traceml watch, and traceml deep support --serve
  • --serve=False keeps current embedded aggregator behavior
  • --serve=True skips embedded aggregator startup and connects via serve_ip and serve_port
  • Default serve_ip and serve_port values are provided
  • Rank-0 fallback remains available for compatibility
  • No changes to rendering behavior in this issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions