Skip to content

[FEATURE] metal-agent: honor InferenceService.spec.runtime per-CR (multi-runtime on one agent) #525

@Defilan

Description

@Defilan

Feature Description

Make metal-agent honor InferenceService.spec.runtime per-CR
instead of pinning all served models to its global --runtime flag.
A single metal-agent should be able to host a llama-server GGUF and
an mlx-server MLX model concurrently, selected by the ISvc CR.

Problem Statement

Today metal-agent takes a single --runtime flag and that's it:

  • cmd/metal-agent/main.go:217 sets cfg.Runtime from the flag,
    defaulting to llama-server.
  • The dispatch switches at pkg/agent/agent.go:294 and :544 key off
    a.config.Runtime (the agent-global value), never isvc.Spec.Runtime.
  • isvc.Spec.Runtime is read at pkg/agent/agent.go:1314 but is only
    propagated forward for telemetry / status; the executor selection
    ignores it.

Operational consequence: if you want to serve GGUF (Carnice, phi-4-mini,
Qwen3 GGUFs) AND MLX-format models (Qwen3.6-35B MLX, future MLX
Phi-class) from the same Apple Silicon node, you need two metal-agents
on different ports, or you flip the launchd unit's --runtime flag
each time you switch model families and bounce the agent.

Surfaced 2026-05-24 during Foreman V3 demo prep: the M5 Max was
configured --runtime mlx-server for the Swift mlx-server dogfood;
that prevented the same agent from serving the locked Carnice GGUF
coder model V3 calls for. Workaround tonight: flip M5 Max back to
--runtime llama-server, accept that mlx-server work pauses.

Proposed Solution

  1. Runtime dispatch becomes per-ISvc. agent.go:294/544 switch
    on isvc.Spec.Runtime instead of a.config.Runtime. The agent
    maintains a registry of runtime executors keyed by runtime name.
  2. --runtime flag stays for back-compat as the default: when
    isvc.Spec.Runtime == "", fall back to cfg.Runtime. Existing CRs
    with no spec.runtime keep working.
  3. Capability advertisement extends to multi-runtime: the metal-
    agent advertises runtimes: [llama-server, mlx-server, ...] in its
    FleetNode capability instead of a single runtime: <one> label.
    The scheduler / operator can pre-filter ISvc -> node matches by
    runtime support.
  4. Per-runtime binary flags stay agent-global (--mlx-server-bin,
    --llama-server-bin, etc.). The CR picks the runtime; the agent
    knows where the binaries live.

Concrete change shape

  • pkg/agent/agent.go: new Runtimes field on the per-ISvc executor
    selection path; helper resolveRuntime(isvc) string does the
    defaulting.
  • cmd/metal-agent/main.go: keep current flags; new optional flag
    --runtimes llama-server,mlx-server enables multi-runtime mode (or
    derive from which binary paths are populated).
  • pkg/agent/agent.go:1314: stop being telemetry-only; feed
    isvc.Spec.Runtime into the executor lookup.
  • Tests: extend agent_test.go table to cover empty spec.runtime
    (fall back to cfg.Runtime) and per-ISvc override.
  • Doc / CHANGELOG mention.

Out of scope for this issue

  • New runtimes (vllm, sglang, etc.). This is purely about respecting
    the existing spec.runtime field at dispatch time.
  • Per-ISvc binary path overrides (e.g. one ISvc using a custom
    mlx-server build). v0.2 problem.

Additional Context

Priority

  • Medium - Nice to have

(Bumps to High once we want a single Apple Silicon node serving
GGUF + MLX simultaneously, which is the natural next step for the
Foreman fleet story.)

Willingness to Contribute

  • Yes, I can submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions