Skip to content

Granite-Switch Architecture#25107

Open
barvhaim wants to merge 16 commits into
ggml-org:masterfrom
barvhaim:feature/granite-switch
Open

Granite-Switch Architecture#25107
barvhaim wants to merge 16 commits into
ggml-org:masterfrom
barvhaim:feature/granite-switch

Conversation

@barvhaim

@barvhaim barvhaim commented Jun 28, 2026

Copy link
Copy Markdown

Overview

Adaptation of Granite-Switch model (https://huggingface.co/ibm-granite/granite-switch-4.1-3b-preview) for llama.cpp,
Today Granite-Switch supports HF and vLLM backends.
Granite-Switch OSS project: https://github.com/generative-computing/granite-switch
Converted model with this code: https://huggingface.co/barha/granite-switch-4.1-3b-preview-GGUF/blob/main/granite-switch-4.1-3b-preview-f16.gguf

The model is dense, not MoE. This value is only needed because ggml_mul_mat_id uses n_expert_used as the number of selected matrix IDs per token. For Granite Switch this is always one selected adapter slot per token.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, parity with existing backends + documentations, assisted by Claude Code harness

P.S, maintainers, I'm looking forward for your comments on how to progress with getting this supported arch, so pretty much opened that PR for getting feedbacks. thanks!

barvhaim and others added 14 commits June 18, 2026 12:43
New "granite-switch" architecture: a dense, all-attention Granite-4.1
model with N embedded LoRA adapters selected per-token by control tokens.

- gguf-py schema (arch, KV keys, stacked LoRA tensor names) + writer helpers
- conversion/granite.py: GraniteSwitchModel converter (stacks N adapters +
  zero base slot into per-projection A/B tensors; emits switch metadata)
- C++ arch registration (llama-arch.{h,cpp}, llama-model.{h,cpp})
- src/models/granite_switch.cpp: load + per-token switched-LoRA graph via
  ggml_mul_mat_id over stacked tensors; sticky per-token index + control-token
  substitution in llm_graph_input_switch::set_input
- llm_graph_input_switch in src/models/models.h

Runs end-to-end on CPU: convert 3b checkpoint (842 tensors, stacked dim 13)
and generate on both base and control-token paths. Sticky switch state is
single-sequence (POC); full multi-sequence machinery is a follow-up.
Self-contained script to build llama.cpp on Apple Silicon (Metal),
convert the composed 3b checkpoint, and run the crisp mid-sequence
adapter-switch demos verified on Vela:
  - answerability: <|answerability|> mid-seq -> "unanswerable"
  - query_rewrite: <|query_rewrite|> mid-seq -> {"rewritten_question": ...}
Each demo runs the same prompt twice, differing only by a control token
placed before the assistant turn, so the per-token switch is visible.
The composed model ships a chat template, so llama-completion auto-enables
interactive conversation mode and halts at a `>` prompt after generating,
stalling the script. -no-cnv disables conversation mode: generate once from
the raw prompt and exit (also prints special tokens, making the switch visible).
…ntion

The POC computed the per-token adapter index on the CPU and carried it
across ubatches in ONE global `mutable int32_t poc_sticky_index`, reset
only when a ubatch contained sequence position 0. That global had two
problems:

  1. Concurrency: with multiple sequences in a batch it was last-writer-
     wins — one sequence's adapter leaked into the others.
  2. Multi-turn: an interactive `ollama run` chat continues one KV cache,
     so turn 2 never saw position 0 and the index never reset — the
     adapter stayed stuck on across turns.

Port the vLLM/HF backend mechanism faithfully: a single-head causal
"router" attention recovers the adapter index in-graph. Per token, only
dim 0 carries signal — Q[0]=1, K[0]=+gain for a control token / -gain
otherwise, V[0]=adapter slot / 0 — and the causal softmax over the single
visible control token recovers that adapter's slot (readback =
clamp(round(V[0]), 0, n_adapters)). gain=15 matches config.py and is
F16-safe (no F32 cache).

The router's K/V live in the model KV cache at an extra layer
R == hparams.router_layer (== n_layer). We bump n_layer_all to n_real+1
so the cache allocator gives the router its own per-sequence slot, and
set n_layer_nextn=1 so n_layer() stays n_real — the decoder loop and
tensor loading are untouched and never reference layer R. The router K is
exempted from the k-shift RoPE loop (its dim-0 value is a literal
magnitude, not a rotation).

Because the selection now lives in the per-sequence KV cache, CONCURRENT
requests are isolated for free (problem 1 fixed; verified by
scratch/concurrent_switch_test.cpp). set_input becomes stateless pure
per-token maps; the global is gone.

Single-switch contract / known limitation, identical to vLLM & HF: the
gain is flat (no recency), so within one sequence there is no mechanism to
revert to base mid-sequence — once an adapter fires it stays on until that
sequence ends (problem 2 is therefore NOT fixed by a faithful copy; vLLM/HF
avoid it only because each served request is a fresh sequence). A client
continuing one KV cache across turns must start a fresh sequence per turn,
or opt into a recency-biased router (a deliberate divergence, not done
here). Documented in granite_switch.cpp and asserted by
scratch/multiturn_leak_test.cpp.

Verified (CPU): both demos unchanged (answerability -> "unanswerable",
query_rewrite -> rewritten query); concurrent two-sequence isolation
passes; multi-turn carry-over matches the vLLM/HF contract.
Remove the local-only development artifacts that should not ship in the
upstream PR:
  - granite-switch-mac-demo.sh (local Metal build + demo driver)
  - scratch/concurrent_switch_test.cpp
  - scratch/multiturn_leak_test.cpp

Also drop the now-dangling reference to the scratch tests from the
granite_switch.cpp header comment. Leaves only the core architecture
support (conversion, gguf constants, llama-arch/model/kv-cache, and the
granite_switch graph).
@barvhaim barvhaim requested review from CISC and ggerganov as code owners June 28, 2026 16:02
@github-actions github-actions Bot added model Model specific conversion labels Jun 28, 2026
@ggml-gh-bot

ggml-gh-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

Hi @barvhaim, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@barvhaim

Copy link
Copy Markdown
Author

Hi @barvhaim, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

I already added before "AI usage disclosure"

@barvhaim barvhaim marked this pull request as draft June 28, 2026 18:04
@pwilkin pwilkin marked this pull request as ready for review June 28, 2026 18:14
@pwilkin pwilkin marked this pull request as draft June 28, 2026 18:14
barvhaim added 2 commits June 28, 2026 21:19
State the actual constraint: mul_mat_id needs n_expert_used == 1, and
since the GGUF carries expert_count = 0 the generic loader's
n_expert == 0 => n_expert_used == 0 assertion has already passed by the
time load_arch_hparams runs, so it is forced to 1 here.
The router carving reuses n_layer_nextn, normally the MTP/next-token
count. Clarify in the comment that it is borrowed here purely as the
trailing-layers lever and that there is no MTP head, to spare readers
the double-take.
@barvhaim barvhaim marked this pull request as ready for review June 28, 2026 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

conversion model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant