Feature Description
Make metal-agent honor InferenceService.spec.runtime per-CR
instead of pinning all served models to its global --runtime flag.
A single metal-agent should be able to host a llama-server GGUF and
an mlx-server MLX model concurrently, selected by the ISvc CR.
Problem Statement
Today metal-agent takes a single --runtime flag and that's it:
cmd/metal-agent/main.go:217 sets cfg.Runtime from the flag,
defaulting to llama-server.
- The dispatch switches at
pkg/agent/agent.go:294 and :544 key off
a.config.Runtime (the agent-global value), never isvc.Spec.Runtime.
isvc.Spec.Runtime is read at pkg/agent/agent.go:1314 but is only
propagated forward for telemetry / status; the executor selection
ignores it.
Operational consequence: if you want to serve GGUF (Carnice, phi-4-mini,
Qwen3 GGUFs) AND MLX-format models (Qwen3.6-35B MLX, future MLX
Phi-class) from the same Apple Silicon node, you need two metal-agents
on different ports, or you flip the launchd unit's --runtime flag
each time you switch model families and bounce the agent.
Surfaced 2026-05-24 during Foreman V3 demo prep: the M5 Max was
configured --runtime mlx-server for the Swift mlx-server dogfood;
that prevented the same agent from serving the locked Carnice GGUF
coder model V3 calls for. Workaround tonight: flip M5 Max back to
--runtime llama-server, accept that mlx-server work pauses.
Proposed Solution
- Runtime dispatch becomes per-ISvc.
agent.go:294/544 switch
on isvc.Spec.Runtime instead of a.config.Runtime. The agent
maintains a registry of runtime executors keyed by runtime name.
--runtime flag stays for back-compat as the default: when
isvc.Spec.Runtime == "", fall back to cfg.Runtime. Existing CRs
with no spec.runtime keep working.
- Capability advertisement extends to multi-runtime: the metal-
agent advertises runtimes: [llama-server, mlx-server, ...] in its
FleetNode capability instead of a single runtime: <one> label.
The scheduler / operator can pre-filter ISvc -> node matches by
runtime support.
- Per-runtime binary flags stay agent-global (
--mlx-server-bin,
--llama-server-bin, etc.). The CR picks the runtime; the agent
knows where the binaries live.
Concrete change shape
pkg/agent/agent.go: new Runtimes field on the per-ISvc executor
selection path; helper resolveRuntime(isvc) string does the
defaulting.
cmd/metal-agent/main.go: keep current flags; new optional flag
--runtimes llama-server,mlx-server enables multi-runtime mode (or
derive from which binary paths are populated).
pkg/agent/agent.go:1314: stop being telemetry-only; feed
isvc.Spec.Runtime into the executor lookup.
- Tests: extend
agent_test.go table to cover empty spec.runtime
(fall back to cfg.Runtime) and per-ISvc override.
- Doc / CHANGELOG mention.
Out of scope for this issue
- New runtimes (vllm, sglang, etc.). This is purely about respecting
the existing spec.runtime field at dispatch time.
- Per-ISvc binary path overrides (e.g. one ISvc using a custom
mlx-server build). v0.2 problem.
Additional Context
Priority
(Bumps to High once we want a single Apple Silicon node serving
GGUF + MLX simultaneously, which is the natural next step for the
Foreman fleet story.)
Willingness to Contribute
Feature Description
Make
metal-agenthonorInferenceService.spec.runtimeper-CRinstead of pinning all served models to its global
--runtimeflag.A single metal-agent should be able to host a llama-server GGUF and
an mlx-server MLX model concurrently, selected by the ISvc CR.
Problem Statement
Today
metal-agenttakes a single--runtimeflag and that's it:cmd/metal-agent/main.go:217setscfg.Runtimefrom the flag,defaulting to
llama-server.pkg/agent/agent.go:294and:544key offa.config.Runtime(the agent-global value), neverisvc.Spec.Runtime.isvc.Spec.Runtimeis read atpkg/agent/agent.go:1314but is onlypropagated forward for telemetry / status; the executor selection
ignores it.
Operational consequence: if you want to serve GGUF (Carnice, phi-4-mini,
Qwen3 GGUFs) AND MLX-format models (Qwen3.6-35B MLX, future MLX
Phi-class) from the same Apple Silicon node, you need two metal-agents
on different ports, or you flip the launchd unit's
--runtimeflageach time you switch model families and bounce the agent.
Surfaced 2026-05-24 during Foreman V3 demo prep: the M5 Max was
configured
--runtime mlx-serverfor the Swift mlx-server dogfood;that prevented the same agent from serving the locked Carnice GGUF
coder model V3 calls for. Workaround tonight: flip M5 Max back to
--runtime llama-server, accept that mlx-server work pauses.Proposed Solution
agent.go:294/544switchon
isvc.Spec.Runtimeinstead ofa.config.Runtime. The agentmaintains a registry of runtime executors keyed by runtime name.
--runtimeflag stays for back-compat as the default: whenisvc.Spec.Runtime == "", fall back tocfg.Runtime. Existing CRswith no
spec.runtimekeep working.agent advertises
runtimes: [llama-server, mlx-server, ...]in itsFleetNode capability instead of a single
runtime: <one>label.The scheduler / operator can pre-filter ISvc -> node matches by
runtime support.
--mlx-server-bin,--llama-server-bin, etc.). The CR picks the runtime; the agentknows where the binaries live.
Concrete change shape
pkg/agent/agent.go: newRuntimesfield on the per-ISvc executorselection path; helper
resolveRuntime(isvc) stringdoes thedefaulting.
cmd/metal-agent/main.go: keep current flags; new optional flag--runtimes llama-server,mlx-serverenables multi-runtime mode (orderive from which binary paths are populated).
pkg/agent/agent.go:1314: stop being telemetry-only; feedisvc.Spec.Runtimeinto the executor lookup.agent_test.gotable to cover emptyspec.runtime(fall back to
cfg.Runtime) and per-ISvc override.Out of scope for this issue
the existing
spec.runtimefield at dispatch time.mlx-server build). v0.2 problem.
Additional Context
about a single metal-agent gracefully hosting heterogeneous workloads.
but does not scale to a multi-engineer fleet where each node runs
one agent serving multiple model families.
Priority
(Bumps to High once we want a single Apple Silicon node serving
GGUF + MLX simultaneously, which is the natural next step for the
Foreman fleet story.)
Willingness to Contribute