[FEATURE] llamacpp-router runtime for multi-model InferenceService

## Feature Description

Add a new `runtime: llamacpp-router` option on `InferenceService.spec`
that lets a single `llama-server` Pod host **multiple** models via
llama.cpp's built-in router mode, swapping which model is resident in
VRAM on demand. Today an InferenceService is 1:1 with a Model and 1:1
with a Pod; this would let one InferenceService advertise N models
behind one Service endpoint, with the upstream `llama-server` handling
the load / unload / dispatch cycle.

This is purely additive. The existing single-model runtime path stays
unchanged. The ModelRouter CRD also stays unchanged: it would now front
a mix of single-model and multi-model InferenceServices transparently.

## Problem Statement

As a cluster operator with a multi-site / edge fleet (NVIDIA L4 nodes
at manufacturing sites, Mac mini, similar low-VRAM hardware), I want
to host several distinct models on the same node without paying the
"one Pod per model" cost in GPU memory, so that I can offer my users
choice of model on hardware that physically can't hold them all in
VRAM simultaneously.

Concrete scenarios:

- An L4 site needs a planner model during the day, a code-completion
  model in dev hours, a summarization model at end-of-day. Three Pods
  fighting for one 24 GB L4 is wasteful. One Pod with router mode is
  the right shape.
- A 24 GB Mac mini cannot hold two ~13B models in unified memory at
  the same time but can swap between them in ~5-15s.
- Dev clusters where someone wants to A/B between two quantizations
  of the same base model without doubling the Pod footprint.

## Proposed Solution

Add `spec.runtime: llamacpp-router` as an enum value on InferenceService,
with `spec.models[]` accepting a list of Model refs (in addition to
the existing `spec.modelRef` for the single-model case). Mutual
exclusion: setting `runtime: llamacpp-router` requires `models[]`
and forbids `modelRef`; the default `runtime` value (or any other
value) requires `modelRef` and forbids `models[]`.

```yaml
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: edge-multi-model
spec:
  runtime: llamacpp-router
  models:
    - name: phi4-mini
    - name: qwen3-7b-coder
    - name: gemma3-4b-summarizer
  resources:
    requests:
      nvidia.com/gpu: 1
  # one Service, one endpoint; clients pick the model via the OAI
  # `model` field on /v1/chat/completions just like today.
```

Under the hood, the controller renders a llama-server Deployment with:

```sh
llama-server \
  --models-dir /models \
  --host 0.0.0.0 --port 8080 \
  --metrics
```

(or `--models-preset` from a templated ConfigMap; see the upstream
docs link below). Each referenced Model's GGUF is fetched by the
existing download Job machinery into the shared `/models` PVC; the
metal-agent / GPU-Operator path stays unchanged otherwise.

`status.endpoint` continues to point at `/v1/chat/completions`. A
new `status.advertisedModels []string` field surfaces the model
names a client can pick from. The three-probe pattern (startup /
liveness / readiness) maps onto llama-server's existing health
endpoint; readiness should reflect "at least one model loaded".

## Alternatives Considered

- **Status quo**: deploy one InferenceService per model. Works on
  beefy nodes; doesn't fit edge / Mac mini hardware tiers.
- **Replace ModelRouter with llama.cpp router mode**: rejected.
  ModelRouter solves a different problem (horizontal scale-out across
  Pods, multi-node, LiteLLM cloud fallback, per-model HPA). Router
  mode is single-Pod, single-node, sequential. They are complementary,
  not substitutional.
- **Build swap logic inside the metal-agent**: feasible but reinvents
  what upstream now ships. We'd own a worse copy.
- **Use Ollama**: rejected on the same dependency-cone grounds that
  motivated the current llama.cpp / mlx-server choice.

## Additional Context

- Upstream feature landed in `ggml-org/llama.cpp` over the last few
  months. Flag set: `--models-preset`, `--models-dir`. Multi-process
  worker architecture; one model resident in VRAM per worker at a
  time.
- Hugging Face blog: <https://huggingface.co/blog/ggml-org/model-management-in-llamacpp>
- Official server README: <https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md>
- Third-party walkthrough: <https://www.glukhov.org/llm-hosting/llama-cpp/llama-server-router-mode/>
- Multi-model hosting post: <https://soypetetech.substack.com/p/from-one-model-to-many-hosting-multiple>

Constraints worth flagging in the implementation issue:

- **Cold-swap latency** is ~5-30s on the first request after a model
  change. The InferenceService docs should call this out so users
  pick the right shape (parallel residency vs swap on demand).
- **Metal path**: this is `llama.cpp` specific. The Metal runtime
  is `mlx-server`. A symmetric "router mode for MLX" would be a
  separate issue against the mlx-server project; the InferenceService
  CRD shape proposed here is runtime-agnostic and would accept an
  `mlx-router` value later without a schema change.
- **Per-model HPA disappears** for router-mode InferenceServices.
  The HPA scales the whole Pod, not per model. Document this so
  users don't expect the M3 autoscaling tutorial's pattern to apply.

## Priority

- [x] Medium - Nice to have

## Willingness to Contribute

- [x] Yes, I can submit a PR


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] llamacpp-router runtime for multi-model InferenceService #516

Feature Description

Problem Statement

Proposed Solution

Alternatives Considered

Additional Context

Priority

Willingness to Contribute

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[FEATURE] llamacpp-router runtime for multi-model InferenceService #516

Description

Feature Description

Problem Statement

Proposed Solution

Alternatives Considered

Additional Context

Priority

Willingness to Contribute

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions