Feature Description
Add a new runtime: llamacpp-router option on InferenceService.spec
that lets a single llama-server Pod host multiple models via
llama.cpp's built-in router mode, swapping which model is resident in
VRAM on demand. Today an InferenceService is 1:1 with a Model and 1:1
with a Pod; this would let one InferenceService advertise N models
behind one Service endpoint, with the upstream llama-server handling
the load / unload / dispatch cycle.
This is purely additive. The existing single-model runtime path stays
unchanged. The ModelRouter CRD also stays unchanged: it would now front
a mix of single-model and multi-model InferenceServices transparently.
Problem Statement
As a cluster operator with a multi-site / edge fleet (NVIDIA L4 nodes
at manufacturing sites, Mac mini, similar low-VRAM hardware), I want
to host several distinct models on the same node without paying the
"one Pod per model" cost in GPU memory, so that I can offer my users
choice of model on hardware that physically can't hold them all in
VRAM simultaneously.
Concrete scenarios:
- An L4 site needs a planner model during the day, a code-completion
model in dev hours, a summarization model at end-of-day. Three Pods
fighting for one 24 GB L4 is wasteful. One Pod with router mode is
the right shape.
- A 24 GB Mac mini cannot hold two ~13B models in unified memory at
the same time but can swap between them in ~5-15s.
- Dev clusters where someone wants to A/B between two quantizations
of the same base model without doubling the Pod footprint.
Proposed Solution
Add spec.runtime: llamacpp-router as an enum value on InferenceService,
with spec.models[] accepting a list of Model refs (in addition to
the existing spec.modelRef for the single-model case). Mutual
exclusion: setting runtime: llamacpp-router requires models[]
and forbids modelRef; the default runtime value (or any other
value) requires modelRef and forbids models[].
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
name: edge-multi-model
spec:
runtime: llamacpp-router
models:
- name: phi4-mini
- name: qwen3-7b-coder
- name: gemma3-4b-summarizer
resources:
requests:
nvidia.com/gpu: 1
# one Service, one endpoint; clients pick the model via the OAI
# `model` field on /v1/chat/completions just like today.
Under the hood, the controller renders a llama-server Deployment with:
llama-server \
--models-dir /models \
--host 0.0.0.0 --port 8080 \
--metrics
(or --models-preset from a templated ConfigMap; see the upstream
docs link below). Each referenced Model's GGUF is fetched by the
existing download Job machinery into the shared /models PVC; the
metal-agent / GPU-Operator path stays unchanged otherwise.
status.endpoint continues to point at /v1/chat/completions. A
new status.advertisedModels []string field surfaces the model
names a client can pick from. The three-probe pattern (startup /
liveness / readiness) maps onto llama-server's existing health
endpoint; readiness should reflect "at least one model loaded".
Alternatives Considered
- Status quo: deploy one InferenceService per model. Works on
beefy nodes; doesn't fit edge / Mac mini hardware tiers.
- Replace ModelRouter with llama.cpp router mode: rejected.
ModelRouter solves a different problem (horizontal scale-out across
Pods, multi-node, LiteLLM cloud fallback, per-model HPA). Router
mode is single-Pod, single-node, sequential. They are complementary,
not substitutional.
- Build swap logic inside the metal-agent: feasible but reinvents
what upstream now ships. We'd own a worse copy.
- Use Ollama: rejected on the same dependency-cone grounds that
motivated the current llama.cpp / mlx-server choice.
Additional Context
Constraints worth flagging in the implementation issue:
- Cold-swap latency is ~5-30s on the first request after a model
change. The InferenceService docs should call this out so users
pick the right shape (parallel residency vs swap on demand).
- Metal path: this is
llama.cpp specific. The Metal runtime
is mlx-server. A symmetric "router mode for MLX" would be a
separate issue against the mlx-server project; the InferenceService
CRD shape proposed here is runtime-agnostic and would accept an
mlx-router value later without a schema change.
- Per-model HPA disappears for router-mode InferenceServices.
The HPA scales the whole Pod, not per model. Document this so
users don't expect the M3 autoscaling tutorial's pattern to apply.
Priority
Willingness to Contribute
Feature Description
Add a new
runtime: llamacpp-routeroption onInferenceService.specthat lets a single
llama-serverPod host multiple models viallama.cpp's built-in router mode, swapping which model is resident in
VRAM on demand. Today an InferenceService is 1:1 with a Model and 1:1
with a Pod; this would let one InferenceService advertise N models
behind one Service endpoint, with the upstream
llama-serverhandlingthe load / unload / dispatch cycle.
This is purely additive. The existing single-model runtime path stays
unchanged. The ModelRouter CRD also stays unchanged: it would now front
a mix of single-model and multi-model InferenceServices transparently.
Problem Statement
As a cluster operator with a multi-site / edge fleet (NVIDIA L4 nodes
at manufacturing sites, Mac mini, similar low-VRAM hardware), I want
to host several distinct models on the same node without paying the
"one Pod per model" cost in GPU memory, so that I can offer my users
choice of model on hardware that physically can't hold them all in
VRAM simultaneously.
Concrete scenarios:
model in dev hours, a summarization model at end-of-day. Three Pods
fighting for one 24 GB L4 is wasteful. One Pod with router mode is
the right shape.
the same time but can swap between them in ~5-15s.
of the same base model without doubling the Pod footprint.
Proposed Solution
Add
spec.runtime: llamacpp-routeras an enum value on InferenceService,with
spec.models[]accepting a list of Model refs (in addition tothe existing
spec.modelReffor the single-model case). Mutualexclusion: setting
runtime: llamacpp-routerrequiresmodels[]and forbids
modelRef; the defaultruntimevalue (or any othervalue) requires
modelRefand forbidsmodels[].Under the hood, the controller renders a llama-server Deployment with:
(or
--models-presetfrom a templated ConfigMap; see the upstreamdocs link below). Each referenced Model's GGUF is fetched by the
existing download Job machinery into the shared
/modelsPVC; themetal-agent / GPU-Operator path stays unchanged otherwise.
status.endpointcontinues to point at/v1/chat/completions. Anew
status.advertisedModels []stringfield surfaces the modelnames a client can pick from. The three-probe pattern (startup /
liveness / readiness) maps onto llama-server's existing health
endpoint; readiness should reflect "at least one model loaded".
Alternatives Considered
beefy nodes; doesn't fit edge / Mac mini hardware tiers.
ModelRouter solves a different problem (horizontal scale-out across
Pods, multi-node, LiteLLM cloud fallback, per-model HPA). Router
mode is single-Pod, single-node, sequential. They are complementary,
not substitutional.
what upstream now ships. We'd own a worse copy.
motivated the current llama.cpp / mlx-server choice.
Additional Context
ggml-org/llama.cppover the last fewmonths. Flag set:
--models-preset,--models-dir. Multi-processworker architecture; one model resident in VRAM per worker at a
time.
Constraints worth flagging in the implementation issue:
change. The InferenceService docs should call this out so users
pick the right shape (parallel residency vs swap on demand).
llama.cppspecific. The Metal runtimeis
mlx-server. A symmetric "router mode for MLX" would be aseparate issue against the mlx-server project; the InferenceService
CRD shape proposed here is runtime-agnostic and would accept an
mlx-routervalue later without a schema change.The HPA scales the whole Pod, not per model. Document this so
users don't expect the M3 autoscaling tutorial's pattern to apply.
Priority
Willingness to Contribute