You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Expose llama.cpp's speculative decoding flags (in particular Multi-Token
Prediction / MTP) as first-class fields on InferenceService.spec so users
can turn on speculative decoding through Kubernetes config rather than
passing custom flags to llama-server out-of-band.
Problem Statement
llama.cpp mainline shipped MTP support in mid-May 2026. On compatible models
(Qwen 3.6 family, DeepSeek V4 Flash, MiniMax with MTP heads, others), two llama-server flags produce a 50-80% decode throughput gain:
--spec-type draft-mtp --spec-draft-n-max 2
These flags are not exposed on InferenceService.spec today. Operators have
three options, all bad:
Edit metal-agent flags to pass extra args (applies to all
InferenceServices, not per-IS).
Fork llama-server with the flags baked in (loses flexibility).
Skip the optimization (loses ~70% of available decode throughput).
Proposed Solution
Add an optional speculativeDecoding block to InferenceService.spec. The
metal-agent / runtime executor reads it and appends the corresponding llama-server flags when launching the runtime.
apiVersion: inference.llmkube.dev/v1alpha1kind: InferenceServicemetadata:
name: qwen36-35b-carnice-mtpspec:
modelRef: qwen36-35b-carnice-mtpruntime: llamacppspeculativeDecoding:
# Self-speculative decoding via the model's own MTP heads. Requires# a model that ships MTP heads (e.g. the Carnice APEX-MTP family).type: mtp # one of: mtp | draft | disabled (default: disabled)nDraftMax: 2# forwarded to --spec-draft-n-max; range [1,8]# Future, when separate draft-model support lands:# draftModelRef: { name: qwen36-3b-draft } # used only when type: draft
The metal-agent translates this into:
--spec-type draft-mtp --spec-draft-n-max 2
For runtimes that don't support speculative decoding (mlx-server, vllm-swift in their current state), the reconciler should reject the
spec at admission with a clear error, rather than silently ignoring. Silent
ignore is a correctness footgun.
Validation
type: mtp requires the model to advertise MTP support. Detect via either:
GGUF metadata reports MTP heads (preferred); reconciler checks Model.status.gguf.hasMTP (new field on existing GGUF status).
Annotations / labels claim MTP. Weaker.
nDraftMax constrained to [1, 8].
type: disabled or the whole field omitted preserves current behavior
byte-identically.
Alternatives Considered
Free-form spec.extraArgs for arbitrary llama-server flags. Quick
win, but no validation, no per-runtime translation, no GGUF capability
check. Rejected: footgun.
Annotation-based (annotations[llmkube.io/spec-type]=draft-mtp).
Avoids CRD evolution but escapes typed validation. Same footgun.
Always-on when MTP detected. Convenient but opaque; benchmarking
against the baseline requires disabling it explicitly. Could be a
default-on policy after the field stabilizes.
Additional Context
llama.cpp mainline MTP support is the substrate; the community write-ups
in mid-May 2026 documented --spec-type draft-mtp --spec-draft-n-max 2
as the activation flags.
Feature Description
Expose llama.cpp's speculative decoding flags (in particular Multi-Token
Prediction / MTP) as first-class fields on
InferenceService.specso userscan turn on speculative decoding through Kubernetes config rather than
passing custom flags to
llama-serverout-of-band.Problem Statement
llama.cpp mainline shipped MTP support in mid-May 2026. On compatible models
(Qwen 3.6 family, DeepSeek V4 Flash, MiniMax with MTP heads, others), two
llama-serverflags produce a 50-80% decode throughput gain:These flags are not exposed on
InferenceService.spectoday. Operators havethree options, all bad:
metal-agentflags to pass extra args (applies to allInferenceServices, not per-IS).
llama-serverwith the flags baked in (loses flexibility).Proposed Solution
Add an optional
speculativeDecodingblock toInferenceService.spec. Themetal-agent / runtime executor reads it and appends the corresponding
llama-serverflags when launching the runtime.The metal-agent translates this into:
For runtimes that don't support speculative decoding (
mlx-server,vllm-swiftin their current state), the reconciler should reject thespec at admission with a clear error, rather than silently ignoring. Silent
ignore is a correctness footgun.
Validation
type: mtprequires the model to advertise MTP support. Detect via either:Model.status.gguf.hasMTP(new field on existing GGUF status).nDraftMaxconstrained to[1, 8].type: disabledor the whole field omitted preserves current behaviorbyte-identically.
Alternatives Considered
spec.extraArgsfor arbitraryllama-serverflags. Quickwin, but no validation, no per-runtime translation, no GGUF capability
check. Rejected: footgun.
annotations[llmkube.io/spec-type]=draft-mtp).Avoids CRD evolution but escapes typed validation. Same footgun.
against the baseline requires disabling it explicitly. Could be a
default-on policy after the field stabilizes.
Additional Context
in mid-May 2026 documented
--spec-type draft-mtp --spec-draft-n-max 2as the activation flags.
mudler/Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP-GGUFmodel family ships APEX-MTP heads embedded in the weights for
self-speculative decoding.
Agent's model; this enhancement is the difference between baseline and
~+70% decode speed for the coder pipeline step.
no controller architecture changes.
Priority