Skip to content

[FEATURE] llamacpp-router runtime for multi-model InferenceService #516

@Defilan

Description

@Defilan

Feature Description

Add a new runtime: llamacpp-router option on InferenceService.spec
that lets a single llama-server Pod host multiple models via
llama.cpp's built-in router mode, swapping which model is resident in
VRAM on demand. Today an InferenceService is 1:1 with a Model and 1:1
with a Pod; this would let one InferenceService advertise N models
behind one Service endpoint, with the upstream llama-server handling
the load / unload / dispatch cycle.

This is purely additive. The existing single-model runtime path stays
unchanged. The ModelRouter CRD also stays unchanged: it would now front
a mix of single-model and multi-model InferenceServices transparently.

Problem Statement

As a cluster operator with a multi-site / edge fleet (NVIDIA L4 nodes
at manufacturing sites, Mac mini, similar low-VRAM hardware), I want
to host several distinct models on the same node without paying the
"one Pod per model" cost in GPU memory, so that I can offer my users
choice of model on hardware that physically can't hold them all in
VRAM simultaneously.

Concrete scenarios:

  • An L4 site needs a planner model during the day, a code-completion
    model in dev hours, a summarization model at end-of-day. Three Pods
    fighting for one 24 GB L4 is wasteful. One Pod with router mode is
    the right shape.
  • A 24 GB Mac mini cannot hold two ~13B models in unified memory at
    the same time but can swap between them in ~5-15s.
  • Dev clusters where someone wants to A/B between two quantizations
    of the same base model without doubling the Pod footprint.

Proposed Solution

Add spec.runtime: llamacpp-router as an enum value on InferenceService,
with spec.models[] accepting a list of Model refs (in addition to
the existing spec.modelRef for the single-model case). Mutual
exclusion: setting runtime: llamacpp-router requires models[]
and forbids modelRef; the default runtime value (or any other
value) requires modelRef and forbids models[].

apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: edge-multi-model
spec:
  runtime: llamacpp-router
  models:
    - name: phi4-mini
    - name: qwen3-7b-coder
    - name: gemma3-4b-summarizer
  resources:
    requests:
      nvidia.com/gpu: 1
  # one Service, one endpoint; clients pick the model via the OAI
  # `model` field on /v1/chat/completions just like today.

Under the hood, the controller renders a llama-server Deployment with:

llama-server \
  --models-dir /models \
  --host 0.0.0.0 --port 8080 \
  --metrics

(or --models-preset from a templated ConfigMap; see the upstream
docs link below). Each referenced Model's GGUF is fetched by the
existing download Job machinery into the shared /models PVC; the
metal-agent / GPU-Operator path stays unchanged otherwise.

status.endpoint continues to point at /v1/chat/completions. A
new status.advertisedModels []string field surfaces the model
names a client can pick from. The three-probe pattern (startup /
liveness / readiness) maps onto llama-server's existing health
endpoint; readiness should reflect "at least one model loaded".

Alternatives Considered

  • Status quo: deploy one InferenceService per model. Works on
    beefy nodes; doesn't fit edge / Mac mini hardware tiers.
  • Replace ModelRouter with llama.cpp router mode: rejected.
    ModelRouter solves a different problem (horizontal scale-out across
    Pods, multi-node, LiteLLM cloud fallback, per-model HPA). Router
    mode is single-Pod, single-node, sequential. They are complementary,
    not substitutional.
  • Build swap logic inside the metal-agent: feasible but reinvents
    what upstream now ships. We'd own a worse copy.
  • Use Ollama: rejected on the same dependency-cone grounds that
    motivated the current llama.cpp / mlx-server choice.

Additional Context

Constraints worth flagging in the implementation issue:

  • Cold-swap latency is ~5-30s on the first request after a model
    change. The InferenceService docs should call this out so users
    pick the right shape (parallel residency vs swap on demand).
  • Metal path: this is llama.cpp specific. The Metal runtime
    is mlx-server. A symmetric "router mode for MLX" would be a
    separate issue against the mlx-server project; the InferenceService
    CRD shape proposed here is runtime-agnostic and would accept an
    mlx-router value later without a schema change.
  • Per-model HPA disappears for router-mode InferenceServices.
    The HPA scales the whole Pod, not per model. Document this so
    users don't expect the M3 autoscaling tutorial's pattern to apply.

Priority

  • Medium - Nice to have

Willingness to Contribute

  • Yes, I can submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/foremanForeman: the agentic fleet orchestrator add-onenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions