Granite-Switch Architecture#25107
Open
barvhaim wants to merge 16 commits into
Open
Conversation
New "granite-switch" architecture: a dense, all-attention Granite-4.1
model with N embedded LoRA adapters selected per-token by control tokens.
- gguf-py schema (arch, KV keys, stacked LoRA tensor names) + writer helpers
- conversion/granite.py: GraniteSwitchModel converter (stacks N adapters +
zero base slot into per-projection A/B tensors; emits switch metadata)
- C++ arch registration (llama-arch.{h,cpp}, llama-model.{h,cpp})
- src/models/granite_switch.cpp: load + per-token switched-LoRA graph via
ggml_mul_mat_id over stacked tensors; sticky per-token index + control-token
substitution in llm_graph_input_switch::set_input
- llm_graph_input_switch in src/models/models.h
Runs end-to-end on CPU: convert 3b checkpoint (842 tensors, stacked dim 13)
and generate on both base and control-token paths. Sticky switch state is
single-sequence (POC); full multi-sequence machinery is a follow-up.
Self-contained script to build llama.cpp on Apple Silicon (Metal),
convert the composed 3b checkpoint, and run the crisp mid-sequence
adapter-switch demos verified on Vela:
- answerability: <|answerability|> mid-seq -> "unanswerable"
- query_rewrite: <|query_rewrite|> mid-seq -> {"rewritten_question": ...}
Each demo runs the same prompt twice, differing only by a control token
placed before the assistant turn, so the per-token switch is visible.
The composed model ships a chat template, so llama-completion auto-enables interactive conversation mode and halts at a `>` prompt after generating, stalling the script. -no-cnv disables conversation mode: generate once from the raw prompt and exit (also prints special tokens, making the switch visible).
…ntion
The POC computed the per-token adapter index on the CPU and carried it
across ubatches in ONE global `mutable int32_t poc_sticky_index`, reset
only when a ubatch contained sequence position 0. That global had two
problems:
1. Concurrency: with multiple sequences in a batch it was last-writer-
wins — one sequence's adapter leaked into the others.
2. Multi-turn: an interactive `ollama run` chat continues one KV cache,
so turn 2 never saw position 0 and the index never reset — the
adapter stayed stuck on across turns.
Port the vLLM/HF backend mechanism faithfully: a single-head causal
"router" attention recovers the adapter index in-graph. Per token, only
dim 0 carries signal — Q[0]=1, K[0]=+gain for a control token / -gain
otherwise, V[0]=adapter slot / 0 — and the causal softmax over the single
visible control token recovers that adapter's slot (readback =
clamp(round(V[0]), 0, n_adapters)). gain=15 matches config.py and is
F16-safe (no F32 cache).
The router's K/V live in the model KV cache at an extra layer
R == hparams.router_layer (== n_layer). We bump n_layer_all to n_real+1
so the cache allocator gives the router its own per-sequence slot, and
set n_layer_nextn=1 so n_layer() stays n_real — the decoder loop and
tensor loading are untouched and never reference layer R. The router K is
exempted from the k-shift RoPE loop (its dim-0 value is a literal
magnitude, not a rotation).
Because the selection now lives in the per-sequence KV cache, CONCURRENT
requests are isolated for free (problem 1 fixed; verified by
scratch/concurrent_switch_test.cpp). set_input becomes stateless pure
per-token maps; the global is gone.
Single-switch contract / known limitation, identical to vLLM & HF: the
gain is flat (no recency), so within one sequence there is no mechanism to
revert to base mid-sequence — once an adapter fires it stays on until that
sequence ends (problem 2 is therefore NOT fixed by a faithful copy; vLLM/HF
avoid it only because each served request is a fresh sequence). A client
continuing one KV cache across turns must start a fresh sequence per turn,
or opt into a recency-biased router (a deliberate divergence, not done
here). Documented in granite_switch.cpp and asserted by
scratch/multiturn_leak_test.cpp.
Verified (CPU): both demos unchanged (answerability -> "unanswerable",
query_rewrite -> rewritten query); concurrent two-sequence isolation
passes; multi-turn carry-over matches the vLLM/HF contract.
Remove the local-only development artifacts that should not ship in the upstream PR: - granite-switch-mac-demo.sh (local Metal build + demo driver) - scratch/concurrent_switch_test.cpp - scratch/multiturn_leak_test.cpp Also drop the now-dangling reference to the scratch tests from the granite_switch.cpp header comment. Leaves only the core architecture support (conversion, gguf constants, llama-arch/model/kv-cache, and the granite_switch graph).
|
Hi @barvhaim, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
Author
I already added before "AI usage disclosure" |
State the actual constraint: mul_mat_id needs n_expert_used == 1, and since the GGUF carries expert_count = 0 the generic loader's n_expert == 0 => n_expert_used == 0 assertion has already passed by the time load_arch_hparams runs, so it is forced to 1 here.
The router carving reuses n_layer_nextn, normally the MTP/next-token count. Clarify in the comment that it is borrowed here purely as the trailing-layers lever and that there is no MTP head, to spare readers the double-take.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Adaptation of Granite-Switch model (https://huggingface.co/ibm-granite/granite-switch-4.1-3b-preview) for llama.cpp,
Today Granite-Switch supports HF and vLLM backends.
Granite-Switch OSS project: https://github.com/generative-computing/granite-switch
Converted model with this code: https://huggingface.co/barha/granite-switch-4.1-3b-preview-GGUF/blob/main/granite-switch-4.1-3b-preview-f16.gguf
The model is dense, not MoE. This value is only needed because
ggml_mul_mat_idusesn_expert_usedas the number of selected matrix IDs per token. For Granite Switch this is always one selected adapter slot per token.Requirements
P.S, maintainers, I'm looking forward for your comments on how to progress with getting this supported arch, so pretty much opened that PR for getting feedbacks. thanks!