Disaggregated Architecture for LLM Serving#401
Conversation
|
I paused work on this change because I realised that making it work across all frontends consistently is difficult without first implementing support for reverse distributed topology (coordinator-last). I framed it into issue #428 and will be working on that first. |
ce28d0b to
ef0a4bd
Compare
|
This PR builds on changes made in #430 (Add reverse distributed topology with coordinator-owned output suffix) |
ef0a4bd to
6f690f0
Compare
6f690f0 to
6399c23
Compare
|
Hi @lobanov it is not lack of interest the reason why I delay checking PRs / Issues. Right now I'm focusing on GLM 5.2 support and I have no bandwidth to understand this PR or others. I just cherry pick casually when I have time from PRs / Issues from the ones that seem more interesting. This is why there are very few replies from me. Thanks. |
|
Thanks for prompt response @antirez! No rush for me. When you have time, I would appreciate any feedback. |
Summary
This PR implements feature request #304 by adding support for distributed prefill with generation continuing on the output-owning worker. This is conceptually similar to Kimi's Mooncake architecture in how it treats KV cache as a first-class distributed object, but scaled down to small number of nodes.
The goal of this change is to make distributed generation materially more performant for long prompts and interactive use when hardware allows it. It reduces repeated cross-node coordination during decode, keeps generation close to the output head, and fails closed if the worker state or route is no longer valid.
After the prompt is prefetched across the distributed route, the worker owning early layers can hand off the active KV state to the coordinator (reverse topology) to continue decoding fully locally instead of routing every next-token step back through the full chain.
User-Facing Behavior
When coordinator owns later layers + output, it can now be started with
--local-decode. When a compatible distributed route is available, the coordinator will retrieve worker's KV cache and switch to local generation automatically.What’s Included
--local-decodeValidation
Tested on:
Model quant used:
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrixChecks run:
make clean make -j4 make cpu make test ./ds4_test --server --dist-cli-parse --local-payload-stream --local-decode-push --local-decode-capability-reject ./ds4 --metal -m gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf --prompt-file README.md --nothink --temp 0 -n 8 -c 32768Distributed smoke validation:
CUDA -> Metalwith output worker--local-decode: passedMetal -> CUDAwith output worker--local-decode: passedRepresentative results:
README.md: prefill393.95 t/s, generation34.55 t/sCUDA -> Metal: prefill581.77 t/s, KV handoff112289000bytes in0.473 s, generation29.90 t/sMetal -> CUDA: prefill595.76 t/s, KV handoff112289000bytes in0.297 s, generation15.76 t/s