Skip to content

Disaggregated Architecture for LLM Serving#401

Open
lobanov wants to merge 4 commits into
antirez:mainfrom
lobanov:local-gen-with-dist-prefill
Open

Disaggregated Architecture for LLM Serving#401
lobanov wants to merge 4 commits into
antirez:mainfrom
lobanov:local-gen-with-dist-prefill

Conversation

@lobanov

@lobanov lobanov commented Jun 12, 2026

Copy link
Copy Markdown

Summary
This PR implements feature request #304 by adding support for distributed prefill with generation continuing on the output-owning worker. This is conceptually similar to Kimi's Mooncake architecture in how it treats KV cache as a first-class distributed object, but scaled down to small number of nodes.

The goal of this change is to make distributed generation materially more performant for long prompts and interactive use when hardware allows it. It reduces repeated cross-node coordination during decode, keeps generation close to the output head, and fails closed if the worker state or route is no longer valid.

After the prompt is prefetched across the distributed route, the worker owning early layers can hand off the active KV state to the coordinator (reverse topology) to continue decoding fully locally instead of routing every next-token step back through the full chain.

User-Facing Behavior
When coordinator owns later layers + output, it can now be started with --local-decode. When a compatible distributed route is available, the coordinator will retrieve worker's KV cache and switch to local generation automatically.

What’s Included

  • Distributed prefill followed by coordinator-local decode on the final output-owning coordinator
  • KV shard handoff from worker to coordinator before local generation begins
  • Recovery behavior for disconnects, stale sessions, and route/state mismatches
  • CLI and documentation updates for --local-decode
  • Minimal regression coverage for CLI validation, payload transfer, and distributed handoff behavior

Validation
Tested on:

  • Metal: Apple M5 Max, macOS 25.4.0
  • CUDA: DGX Spark, NVIDIA GB10

Model quant used:

  • DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix

Checks run:

make clean
make -j4
make cpu
make test
./ds4_test --server --dist-cli-parse --local-payload-stream --local-decode-push --local-decode-capability-reject
./ds4 --metal -m gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf --prompt-file README.md --nothink --temp 0 -n 8 -c 32768

Distributed smoke validation:

  • CUDA -> Metal with output worker --local-decode: passed
  • Metal -> CUDA with output worker --local-decode: passed

Representative results:

  • Local Metal smoke on README.md: prefill 393.95 t/s, generation 34.55 t/s
  • Distributed CUDA -> Metal: prefill 581.77 t/s, KV handoff 112289000 bytes in 0.473 s, generation 29.90 t/s
  • Distributed Metal -> CUDA: prefill 595.76 t/s, KV handoff 112289000 bytes in 0.297 s, generation 15.76 t/s

@lobanov lobanov marked this pull request as draft June 12, 2026 17:15
@lobanov

lobanov commented Jun 16, 2026

Copy link
Copy Markdown
Author

I paused work on this change because I realised that making it work across all frontends consistently is difficult without first implementing support for reverse distributed topology (coordinator-last). I framed it into issue #428 and will be working on that first.

@lobanov lobanov force-pushed the local-gen-with-dist-prefill branch 2 times, most recently from ce28d0b to ef0a4bd Compare June 17, 2026 16:41
@lobanov

lobanov commented Jun 17, 2026

Copy link
Copy Markdown
Author

This PR builds on changes made in #430 (Add reverse distributed topology with coordinator-owned output suffix)

@lobanov lobanov changed the title Local gen with dist prefill Disaggregated Architecture for LLM Serving Jun 26, 2026
@lobanov lobanov force-pushed the local-gen-with-dist-prefill branch from ef0a4bd to 6f690f0 Compare June 27, 2026 14:17
@lobanov lobanov force-pushed the local-gen-with-dist-prefill branch from 6f690f0 to 6399c23 Compare June 27, 2026 15:03
@lobanov lobanov marked this pull request as ready for review June 27, 2026 15:03
@lobanov

lobanov commented Jun 27, 2026

Copy link
Copy Markdown
Author

This is ready to merge now, rebased on updated main and also includes all commits from #430 as it builds on that change.

Hey @antirez, when you get a moment, could you please indicate if this is something you want to have in the project, and whether the implementation matches your expectations.

@antirez

antirez commented Jun 27, 2026

Copy link
Copy Markdown
Owner

Hi @lobanov it is not lack of interest the reason why I delay checking PRs / Issues. Right now I'm focusing on GLM 5.2 support and I have no bandwidth to understand this PR or others. I just cherry pick casually when I have time from PRs / Issues from the ones that seem more interesting. This is why there are very few replies from me. Thanks.

@lobanov

lobanov commented Jun 27, 2026

Copy link
Copy Markdown
Author

Thanks for prompt response @antirez! No rush for me. When you have time, I would appreciate any feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants