Disaggregated Architecture for LLM Serving by lobanov · Pull Request #401 · antirez/ds4

lobanov · 2026-06-12T17:01:13Z

Summary
This PR implements feature request #304 by adding support for distributed prefill with generation continuing on the output-owning worker. This is conceptually similar to Kimi's Mooncake architecture in how it treats KV cache as a first-class distributed object, but scaled down to small number of nodes.

The goal of this change is to make distributed generation materially more performant for long prompts and interactive use when hardware allows it. It reduces repeated cross-node coordination during decode, keeps generation close to the output head, and fails closed if the worker state or route is no longer valid.

After the prompt is prefetched across the distributed route, the worker owning early layers can hand off the active KV state to the coordinator (reverse topology) to continue decoding fully locally instead of routing every next-token step back through the full chain.

User-Facing Behavior
When coordinator owns later layers + output, it can now be started with --local-decode. When a compatible distributed route is available, the coordinator will retrieve worker's KV cache and switch to local generation automatically.

What’s Included

Distributed prefill followed by coordinator-local decode on the final output-owning coordinator
KV shard handoff from worker to coordinator before local generation begins
Recovery behavior for disconnects, stale sessions, and route/state mismatches
CLI and documentation updates for --local-decode
Minimal regression coverage for CLI validation, payload transfer, and distributed handoff behavior

Validation
Tested on:

Metal: Apple M5 Max, macOS 25.4.0
CUDA: DGX Spark, NVIDIA GB10

Model quant used:

DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix

Checks run:

make clean
make -j4
make cpu
make test
./ds4_test --server --dist-cli-parse --local-payload-stream --local-decode-push --local-decode-capability-reject
./ds4 --metal -m gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf --prompt-file README.md --nothink --temp 0 -n 8 -c 32768

Distributed smoke validation:

CUDA -> Metal with output worker --local-decode: passed
Metal -> CUDA with output worker --local-decode: passed

Representative results:

Local Metal smoke on README.md: prefill 393.95 t/s, generation 34.55 t/s
Distributed CUDA -> Metal: prefill 581.77 t/s, KV handoff 112289000 bytes in 0.473 s, generation 29.90 t/s
Distributed Metal -> CUDA: prefill 595.76 t/s, KV handoff 112289000 bytes in 0.297 s, generation 15.76 t/s

lobanov · 2026-06-16T15:19:47Z

I paused work on this change because I realised that making it work across all frontends consistently is difficult without first implementing support for reverse distributed topology (coordinator-last). I framed it into issue #428 and will be working on that first.

lobanov · 2026-06-17T16:42:39Z

This PR builds on changes made in #430 (Add reverse distributed topology with coordinator-owned output suffix)

lobanov · 2026-06-27T15:11:59Z

This is ready to merge now, rebased on updated main and also includes all commits from #430 as it builds on that change.

Hey @antirez, when you get a moment, could you please indicate if this is something you want to have in the project, and whether the implementation matches your expectations.

antirez · 2026-06-27T15:16:44Z

Hi @lobanov it is not lack of interest the reason why I delay checking PRs / Issues. Right now I'm focusing on GLM 5.2 support and I have no bandwidth to understand this PR or others. I just cherry pick casually when I have time from PRs / Issues from the ones that seem more interesting. This is why there are very few replies from me. Thanks.

lobanov · 2026-06-27T15:37:55Z

Thanks for prompt response @antirez! No rush for me. When you have time, I would appreciate any feedback.

lobanov marked this pull request as draft June 12, 2026 17:15

lobanov force-pushed the local-gen-with-dist-prefill branch 2 times, most recently from ce28d0b to ef0a4bd Compare June 17, 2026 16:41

lobanov changed the title ~~Local gen with dist prefill~~ Disaggregated Architecture for LLM Serving Jun 26, 2026

lobanov force-pushed the local-gen-with-dist-prefill branch from ef0a4bd to 6f690f0 Compare June 27, 2026 14:17

lobanov added 4 commits June 27, 2026 15:45

Add reverse distributed topology with coordinator-owned output suffix

57ae89a

Add tests for topology planner

bb40b54

improve topology error reporting

dc8eb1d

Allow local decode after distributed prefill

6399c23

lobanov force-pushed the local-gen-with-dist-prefill branch from 6f690f0 to 6399c23 Compare June 27, 2026 15:03

lobanov marked this pull request as ready for review June 27, 2026 15:03

lobanov mentioned this pull request Jun 27, 2026

Support distributed prefill with local decode #304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disaggregated Architecture for LLM Serving#401

Disaggregated Architecture for LLM Serving#401
lobanov wants to merge 4 commits into
antirez:mainfrom
lobanov:local-gen-with-dist-prefill

lobanov commented Jun 12, 2026 •

edited

Loading

Uh oh!

lobanov commented Jun 16, 2026

Uh oh!

lobanov commented Jun 17, 2026 •

edited

Loading

Uh oh!

lobanov commented Jun 27, 2026

Uh oh!

antirez commented Jun 27, 2026

Uh oh!

lobanov commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lobanov commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lobanov commented Jun 16, 2026

Uh oh!

lobanov commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lobanov commented Jun 27, 2026

Uh oh!

antirez commented Jun 27, 2026

Uh oh!

lobanov commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lobanov commented Jun 12, 2026 •

edited

Loading

lobanov commented Jun 17, 2026 •

edited

Loading