Granite-Switch Architecture by barvhaim · Pull Request #25107 · ggml-org/llama.cpp

barvhaim · 2026-06-28T16:02:44Z

Overview

Adaptation of Granite-Switch model (https://huggingface.co/ibm-granite/granite-switch-4.1-3b-preview) for llama.cpp,
Today Granite-Switch supports HF and vLLM backends.
Granite-Switch OSS project: https://github.com/generative-computing/granite-switch
Converted model with this code: https://huggingface.co/barha/granite-switch-4.1-3b-preview-GGUF/blob/main/granite-switch-4.1-3b-preview-f16.gguf

The model is dense, not MoE. This value is only needed because ggml_mul_mat_id uses n_expert_used as the number of selected matrix IDs per token. For Granite Switch this is always one selected adapter slot per token.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, parity with existing backends + documentations, assisted by Claude Code harness

P.S, maintainers, I'm looking forward for your comments on how to progress with getting this supported arch, so pretty much opened that PR for getting feedbacks. thanks!

New "granite-switch" architecture: a dense, all-attention Granite-4.1 model with N embedded LoRA adapters selected per-token by control tokens. - gguf-py schema (arch, KV keys, stacked LoRA tensor names) + writer helpers - conversion/granite.py: GraniteSwitchModel converter (stacks N adapters + zero base slot into per-projection A/B tensors; emits switch metadata) - C++ arch registration (llama-arch.{h,cpp}, llama-model.{h,cpp}) - src/models/granite_switch.cpp: load + per-token switched-LoRA graph via ggml_mul_mat_id over stacked tensors; sticky per-token index + control-token substitution in llm_graph_input_switch::set_input - llm_graph_input_switch in src/models/models.h Runs end-to-end on CPU: convert 3b checkpoint (842 tensors, stacked dim 13) and generate on both base and control-token paths. Sticky switch state is single-sequence (POC); full multi-sequence machinery is a follow-up.

Self-contained script to build llama.cpp on Apple Silicon (Metal), convert the composed 3b checkpoint, and run the crisp mid-sequence adapter-switch demos verified on Vela: - answerability: <|answerability|> mid-seq -> "unanswerable" - query_rewrite: <|query_rewrite|> mid-seq -> {"rewritten_question": ...} Each demo runs the same prompt twice, differing only by a control token placed before the assistant turn, so the per-token switch is visible.

The composed model ships a chat template, so llama-completion auto-enables interactive conversation mode and halts at a `>` prompt after generating, stalling the script. -no-cnv disables conversation mode: generate once from the raw prompt and exit (also prints special tokens, making the switch visible).

…ntion The POC computed the per-token adapter index on the CPU and carried it across ubatches in ONE global `mutable int32_t poc_sticky_index`, reset only when a ubatch contained sequence position 0. That global had two problems: 1. Concurrency: with multiple sequences in a batch it was last-writer- wins — one sequence's adapter leaked into the others. 2. Multi-turn: an interactive `ollama run` chat continues one KV cache, so turn 2 never saw position 0 and the index never reset — the adapter stayed stuck on across turns. Port the vLLM/HF backend mechanism faithfully: a single-head causal "router" attention recovers the adapter index in-graph. Per token, only dim 0 carries signal — Q[0]=1, K[0]=+gain for a control token / -gain otherwise, V[0]=adapter slot / 0 — and the causal softmax over the single visible control token recovers that adapter's slot (readback = clamp(round(V[0]), 0, n_adapters)). gain=15 matches config.py and is F16-safe (no F32 cache). The router's K/V live in the model KV cache at an extra layer R == hparams.router_layer (== n_layer). We bump n_layer_all to n_real+1 so the cache allocator gives the router its own per-sequence slot, and set n_layer_nextn=1 so n_layer() stays n_real — the decoder loop and tensor loading are untouched and never reference layer R. The router K is exempted from the k-shift RoPE loop (its dim-0 value is a literal magnitude, not a rotation). Because the selection now lives in the per-sequence KV cache, CONCURRENT requests are isolated for free (problem 1 fixed; verified by scratch/concurrent_switch_test.cpp). set_input becomes stateless pure per-token maps; the global is gone. Single-switch contract / known limitation, identical to vLLM & HF: the gain is flat (no recency), so within one sequence there is no mechanism to revert to base mid-sequence — once an adapter fires it stays on until that sequence ends (problem 2 is therefore NOT fixed by a faithful copy; vLLM/HF avoid it only because each served request is a fresh sequence). A client continuing one KV cache across turns must start a fresh sequence per turn, or opt into a recency-biased router (a deliberate divergence, not done here). Documented in granite_switch.cpp and asserted by scratch/multiturn_leak_test.cpp. Verified (CPU): both demos unchanged (answerability -> "unanswerable", query_rewrite -> rewritten query); concurrent two-sequence isolation passes; multi-turn carry-over matches the vLLM/HF contract.

Remove the local-only development artifacts that should not ship in the upstream PR: - granite-switch-mac-demo.sh (local Metal build + demo driver) - scratch/concurrent_switch_test.cpp - scratch/multiturn_leak_test.cpp Also drop the now-dangling reference to the scratch tests from the granite_switch.cpp header comment. Leaves only the core architecture support (conversion, gguf constants, llama-arch/model/kv-cache, and the granite_switch graph).

…ption

…ve style

…4 Vision style

ggml-gh-bot · 2026-06-28T16:07:01Z

Hi @barvhaim, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

barvhaim · 2026-06-28T17:16:14Z

Hi @barvhaim, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

I already added before "AI usage disclosure"

State the actual constraint: mul_mat_id needs n_expert_used == 1, and since the GGUF carries expert_count = 0 the generic loader's n_expert == 0 => n_expert_used == 0 assertion has already passed by the time load_arch_hparams runs, so it is forced to 1 here.

The router carving reuses n_layer_nextn, normally the MTP/next-token count. Clarify in the comment that it is borrowed here purely as the trailing-layers lever and that there is no MTP head, to spare readers the double-take.

barvhaim and others added 14 commits June 18, 2026 12:43

Merge branch 'ggml-org:master' into feature/granite-switch

0575799

Merge branch 'ggml-org:master' into feature/granite-switch

7501b12

granite-switch: trim comments to match native llama.cpp style

6e864e5

granite-switch: trim conversion comments to match native style

c81aa83

granite-switch: drop unused adapter_ranks metadata

e681408

granite-switch: rename arch to graniteswitch and drop obid alias

4202cef

granite-switch: fix non-ASCII comments and document router gain assum…

9f58f29

…ption

granite-switch: drop section comments from constants.py to match nati…

941b057

…ve style

granite-switch: add functional tensor block comments matching Granite…

91fcd09

…4 Vision style

barvhaim requested review from CISC and ggerganov as code owners June 28, 2026 16:02

github-actions Bot added model Model specific conversion labels Jun 28, 2026

barvhaim marked this pull request as draft June 28, 2026 18:04

pwilkin marked this pull request as ready for review June 28, 2026 18:14

pwilkin marked this pull request as draft June 28, 2026 18:14

barvhaim added 2 commits June 28, 2026 21:19

granite-switch: note n_layer_nextn reuse has no MTP

111d333

The router carving reuses n_layer_nextn, normally the MTP/next-token count. Clarify in the comment that it is borrowed here purely as the trailing-layers lever and that there is no MTP head, to spare readers the double-take.

barvhaim marked this pull request as ready for review June 28, 2026 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Granite-Switch Architecture#25107

Granite-Switch Architecture#25107
barvhaim wants to merge 16 commits into
ggml-org:masterfrom
barvhaim:feature/granite-switch

barvhaim commented Jun 28, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot Bot commented Jun 28, 2026

Uh oh!

barvhaim commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

barvhaim commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

ggml-gh-bot Bot commented Jun 28, 2026

Uh oh!

barvhaim commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

barvhaim commented Jun 28, 2026 •

edited

Loading