Skip to content

[FEATURE] InferenceService.spec.speculativeDecoding for llama.cpp MTP / draft-model flags #502

@Defilan

Description

@Defilan

Feature Description

Expose llama.cpp's speculative decoding flags (in particular Multi-Token
Prediction / MTP) as first-class fields on InferenceService.spec so users
can turn on speculative decoding through Kubernetes config rather than
passing custom flags to llama-server out-of-band.

Problem Statement

llama.cpp mainline shipped MTP support in mid-May 2026. On compatible models
(Qwen 3.6 family, DeepSeek V4 Flash, MiniMax with MTP heads, others), two
llama-server flags produce a 50-80% decode throughput gain:

--spec-type draft-mtp --spec-draft-n-max 2

These flags are not exposed on InferenceService.spec today. Operators have
three options, all bad:

  1. Edit metal-agent flags to pass extra args (applies to all
    InferenceServices, not per-IS).
  2. Fork llama-server with the flags baked in (loses flexibility).
  3. Skip the optimization (loses ~70% of available decode throughput).

Proposed Solution

Add an optional speculativeDecoding block to InferenceService.spec. The
metal-agent / runtime executor reads it and appends the corresponding
llama-server flags when launching the runtime.

apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: qwen36-35b-carnice-mtp
spec:
  modelRef: qwen36-35b-carnice-mtp
  runtime: llamacpp
  speculativeDecoding:
    # Self-speculative decoding via the model's own MTP heads. Requires
    # a model that ships MTP heads (e.g. the Carnice APEX-MTP family).
    type: mtp                 # one of: mtp | draft | disabled (default: disabled)
    nDraftMax: 2              # forwarded to --spec-draft-n-max; range [1,8]
    # Future, when separate draft-model support lands:
    # draftModelRef: { name: qwen36-3b-draft }   # used only when type: draft

The metal-agent translates this into:

--spec-type draft-mtp --spec-draft-n-max 2

For runtimes that don't support speculative decoding (mlx-server,
vllm-swift in their current state), the reconciler should reject the
spec at admission
with a clear error, rather than silently ignoring. Silent
ignore is a correctness footgun.

Validation

  • type: mtp requires the model to advertise MTP support. Detect via either:
    1. GGUF metadata reports MTP heads (preferred); reconciler checks
      Model.status.gguf.hasMTP (new field on existing GGUF status).
    2. Annotations / labels claim MTP. Weaker.
  • nDraftMax constrained to [1, 8].
  • type: disabled or the whole field omitted preserves current behavior
    byte-identically.

Alternatives Considered

  • Free-form spec.extraArgs for arbitrary llama-server flags. Quick
    win, but no validation, no per-runtime translation, no GGUF capability
    check. Rejected: footgun.
  • Annotation-based (annotations[llmkube.io/spec-type]=draft-mtp).
    Avoids CRD evolution but escapes typed validation. Same footgun.
  • Always-on when MTP detected. Convenient but opaque; benchmarking
    against the baseline requires disabling it explicitly. Could be a
    default-on policy after the field stabilizes.

Additional Context

  • llama.cpp mainline MTP support is the substrate; the community write-ups
    in mid-May 2026 documented --spec-type draft-mtp --spec-draft-n-max 2
    as the activation flags.
  • First user-visible win: the
    mudler/Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP-GGUF
    model family ships APEX-MTP heads embedded in the weights for
    self-speculative decoding.
  • The Foreman v0.1 tracking epic ([FEATURE] Foreman v0.1: agentic fleet orchestrator (epic) #500) targets Carnice as the coder
    Agent's model; this enhancement is the difference between baseline and
    ~+70% decode speed for the coder pipeline step.
  • Likely a good first issue: self-contained CRD field + executor branch,
    no controller architecture changes.

Priority

  • Medium - Nice to have (enables a meaningful perf win; not blocking)

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/performancePerformance optimization and benchmarkingenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions