[FEATURE] InferenceService.spec.speculativeDecoding for llama.cpp MTP / draft-model flags

## Feature Description

Expose llama.cpp's speculative decoding flags (in particular Multi-Token
Prediction / MTP) as first-class fields on `InferenceService.spec` so users
can turn on speculative decoding through Kubernetes config rather than
passing custom flags to `llama-server` out-of-band.

## Problem Statement

llama.cpp mainline shipped MTP support in mid-May 2026. On compatible models
(Qwen 3.6 family, DeepSeek V4 Flash, MiniMax with MTP heads, others), two
`llama-server` flags produce a 50-80% decode throughput gain:

```
--spec-type draft-mtp --spec-draft-n-max 2
```

These flags are not exposed on `InferenceService.spec` today. Operators have
three options, all bad:

1. Edit `metal-agent` flags to pass extra args (applies to all
   InferenceServices, not per-IS).
2. Fork `llama-server` with the flags baked in (loses flexibility).
3. Skip the optimization (loses ~70% of available decode throughput).

## Proposed Solution

Add an optional `speculativeDecoding` block to `InferenceService.spec`. The
metal-agent / runtime executor reads it and appends the corresponding
`llama-server` flags when launching the runtime.

```yaml
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: qwen36-35b-carnice-mtp
spec:
  modelRef: qwen36-35b-carnice-mtp
  runtime: llamacpp
  speculativeDecoding:
    # Self-speculative decoding via the model's own MTP heads. Requires
    # a model that ships MTP heads (e.g. the Carnice APEX-MTP family).
    type: mtp                 # one of: mtp | draft | disabled (default: disabled)
    nDraftMax: 2              # forwarded to --spec-draft-n-max; range [1,8]
    # Future, when separate draft-model support lands:
    # draftModelRef: { name: qwen36-3b-draft }   # used only when type: draft
```

The metal-agent translates this into:

```
--spec-type draft-mtp --spec-draft-n-max 2
```

For runtimes that don't support speculative decoding (`mlx-server`,
`vllm-swift` in their current state), the reconciler should **reject the
spec at admission** with a clear error, rather than silently ignoring. Silent
ignore is a correctness footgun.

## Validation

- `type: mtp` requires the model to advertise MTP support. Detect via either:
  1. GGUF metadata reports MTP heads (preferred); reconciler checks
     `Model.status.gguf.hasMTP` (new field on existing GGUF status).
  2. Annotations / labels claim MTP. Weaker.
- `nDraftMax` constrained to `[1, 8]`.
- `type: disabled` or the whole field omitted preserves current behavior
  byte-identically.

## Alternatives Considered

- **Free-form `spec.extraArgs`** for arbitrary `llama-server` flags. Quick
  win, but no validation, no per-runtime translation, no GGUF capability
  check. Rejected: footgun.
- **Annotation-based** (`annotations[llmkube.io/spec-type]=draft-mtp`).
  Avoids CRD evolution but escapes typed validation. Same footgun.
- **Always-on when MTP detected.** Convenient but opaque; benchmarking
  against the baseline requires disabling it explicitly. Could be a
  default-on policy after the field stabilizes.

## Additional Context

- llama.cpp mainline MTP support is the substrate; the community write-ups
  in mid-May 2026 documented `--spec-type draft-mtp --spec-draft-n-max 2`
  as the activation flags.
- First user-visible win: the
  [`mudler/Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP-GGUF`](https://hf.co/mudler/Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP-GGUF)
  model family ships APEX-MTP heads embedded in the weights for
  self-speculative decoding.
- The Foreman v0.1 tracking epic (#500) targets Carnice as the coder
  Agent's model; this enhancement is the difference between baseline and
  ~+70% decode speed for the coder pipeline step.
- Likely a good first issue: self-contained CRD field + executor branch,
  no controller architecture changes.

## Priority

- [x] **Medium** - Nice to have (enables a meaningful perf win; not blocking)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] InferenceService.spec.speculativeDecoding for llama.cpp MTP / draft-model flags #502

Feature Description

Problem Statement

Proposed Solution

Validation

Alternatives Considered

Additional Context

Priority

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[FEATURE] InferenceService.spec.speculativeDecoding for llama.cpp MTP / draft-model flags #502

Description

Feature Description

Problem Statement

Proposed Solution

Validation

Alternatives Considered

Additional Context

Priority

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions