vulkan: add allreduce function with cross-device CPU proxy and fix Tensor Parallel crash [EXPERIMENTAL]#25051
vulkan: add allreduce function with cross-device CPU proxy and fix Tensor Parallel crash [EXPERIMENTAL]#25051pwilkin wants to merge 13 commits into
Conversation
…cache views When a per-device allocation exceeds the backend's max buffer size (e.g. a large KV cache), ggml-alloc returns a multi_buffer wrapping several real buffers. Compute-graph views inherited that multi_buffer as their backend buffer, so a backend that casts tensor->buffer->context to its own buffer-context type (the Vulkan backend does, e.g. in ggml_vk_tensors_overlap) dereferenced garbage and crashed with -sm tensor (issue ggml-org#22197). A view aliases its source's storage, so it must reference the source's real sub-buffer: set t_ij->buffer = t_ij->view_src->buffer. This is the correct ggml invariant and a no-op in the single-buffer case. Assisted-by: Claude Opus 4.8
Implements the backend-agnostic comm hook (ggml_backend_comm_init / _allreduce_tensor / _free, discovered by the meta backend via get_proc_address) for the Vulkan backend, so tensor-parallel inference no longer falls back to the meta backend's CPU-barriered butterfly AllReduce. Consumer GPUs have no P2P here, so the reduce stages through host memory, but everything is ordered on the GPU via exported timeline semaphores (no CPU barriers between layers). Each slice is split into chunks: the dedicated transfer queue streams this device's slice out to shared host memory while the compute queue pulls each peer chunk back as soon as it lands, so the two PCIe directions overlap (full-duplex). Partials are cast to F16 before the host transfer to halve the bytes on the bandwidth-bound link and added straight into the fp32 result via the mixed-type add pipeline. Large prefill activations use this pipeline; small (decode) tensors take a single-shot path where the fixed per-call overhead dominates. Roughly 2.5-3x the butterfly fallback; at long context it overtakes -sm layer and is competitive with CUDA/NCCL on prefill. GGML_VK_COMM_OFF disables the custom comm (falls back to butterfly); GGML_VK_COMM_FP32 forces fp32 staging. Assisted-by: Claude Opus 4.8
…duce The GPU-side cross-device ordering imports each peer's OPAQUE_FD timeline semaphore, but OPAQUE_FD payloads are driver-private, so the import only works when all devices share a driver (e.g. two NVIDIA GPUs). On mixed drivers or vendors it is out of spec. Add a portable fallback: a helper thread polls each peer's progress/upload timeline and host-signals a local timeline that the consumer's download is parked on (core timeline semaphores plus host signal/wait, no imported handle). Both the chunked pipeline (prefill) and the single-shot (decode) paths are bridged, so proxy mode no longer drops decode to the meta-backend butterfly. A capability gate (vkGetPhysicalDeviceExternalSemaphoreProperties plus a driverUUID match) selects the proxy deterministically on unsupported configs; GGML_VK_COMM_PROXY forces it, and the import try/catch stays as a safety net. Measured within ~4% of the native-import path on decode and on par for prefill, with byte-identical output. Assisted-by: Claude Opus 4.8
|
Did a few more tests with an A16 and a 4090: 4090 + A16-sm layer -sm tensor 2xA16-sm layer -sm tensor So as you can see, the 4090 is held down by the A16 and the boost from tensor parallel there is really small, but on 2xA16, the TG boost is almost double while the PP loss is negligible (almost 90% of the original). |
|
AMD Radeon RX 9070 XT & AMD Radeon AI PRO R9700
|
AMD RX7900XTX (2, 4, 8 GPUs)ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: Found 8 Vulkan devices:
And ROCm for comparison:ggml_cuda_init: found 2 ROCm devices (Total VRAM: 49120 MiB):
ggml_cuda_init: found 4 ROCm devices (Total VRAM: 98240 MiB):
ggml_cuda_init: found 8 ROCm devices (Total VRAM: 196480 MiB):
|
There was a problem hiding this comment.
Your way of dealing with the crash seems to be similar to the one proposed here (also AI generated, lol) that basically uses the view src to get around the multi buffer issue. I explain this a bit more in #22197 but I think this only works if the tensor is a view tensor. If it's not a view tensor then we don't have any way of knowing which buffer in the multi buffer the tensor is stored in.
There was a problem hiding this comment.
Yup, valid point. Didn't surface in the small models I've tested, but probably will in any bigger one. Paging @0cc4m here: do you have any qualms about just adding a tensor to sub-buffer map in multi_buffer?
|
OK, bit of oddness on my R9700s compared with the 7900XTX above - prefill takes a significant hit compared with Also, even more weirdness - when run as a server, the first request goes through at 36t/s, but the second falls off to 2t/s: |
Synthetic benchmark (pp4096 / tg128, ~4K context)Without this PR:
With this PR applied:
At short context the PR delivers Real-world big prompt benchmark (88K prompt, 164K context)I've asked the LLM to provide me a resume of a book which content has been sent in the prompt Without this PR:
With this PR applied:
The tensor-mode segfault (fixed by this PR)Without this PR, followed by The tensor decode regression at long contextWith the PR applied, tensor-mode tg drops from 43.6 (short ctx) to 9 (long ctx) — 4.8× worse. Row/layer stay at 37. This appears to be the For reference, |
This comment was marked as off-topic.
This comment was marked as off-topic.
2x Intel Arc Pro B60 (Battlemage, BMG G21)
Observations
|
|
@marksverdhei it's a prototype, don't worry :) when it's been decently tested and cleaned up I'll deslopify it :) |
Assisted-by: Claude Opus 4.8
|
So the damn clanker decided to only implement the proper allreduce for 2 GPUs, because why bother ;) sorry for all the people with 4+ who posted their results, could you please retest with the new commit? (I've tested up to 8 GPUs now for correctness) |
…ackend Assisted-by: Claude Opus 4.8
|
Sharing some benchmark results with 2/4/5x Radeon PRO W7900's:
##Vulkan PR TP (2 GPU):
##Vulkan PR TP (4 GPU):
##Vulkan PR TP (5 GPU):
#For comparison with ROCm:
build: 050ee92 (9821)
##ROCm TP RCCL (4 GPU):
##ROCm TP RCCL (5 GPU):
|
|
Could someone else check multi-turn conversations? As noted in my comment above, I'm finding that the first prompt works as expected, but any follow up tanks down to 2t/s. |
Thank you! At least I know I'm not holding it wrong ;) (EDIT: ...or we both are) |
|
I've also encountered some gibberish in llama-server with TP on 5 GPU's: above on commit e578ca2 |
|
It's faster now on TP=4 and TP=8 (RX7900XTX). Shorter table this time:
And for a multiturn conversation, I don't see any significant degradation. 2nd message: 3rd message: |
|
Looking into the multiGPU degradation, as for the slowdown, can one of you possibly capture a perf profile? |
Assisted-by: Claude Opus 4.8
The ring AllReduce was gated to the native peer-import path (!comm->proxy), so on RADV/cross-vendor setups (which use the CPU-proxy bridge because OPAQUE_FD timeline export is unsupported) GGML_VK_COMM_RING had no effect. Mirror the pipeline's proxy pattern in the ring: the per-step recv wait on the previous neighbour's transfer timeline, and the cross-round WAR wait on the next neighbour's compute timeline, are routed through the device's pxy semaphore and a bridge enqueued for the helper thread. Reserve nsteps+1 pxy values per round (nsteps recv bridges + 1 WAR bridge). Validated byte-identical: native ring == proxy ring == butterfly (660b3d04a269) on 4x A16 forced-proxy. fp32 staging only; F16 is a follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ApKCQ32VLqUW4Kus6tUvBL
Sure - not done that before, though. Totally revealing my lack of knowledge here, but can you point me to documentation for doing that? |
Sure: https://perfwiki.github.io/main/tutorial/ It's actually pretty easy: you run |
…VK_COMM_D2D) Opt-in path that replaces host-memory staging in the tensor-parallel AllReduce with direct peer reads of another GPU's VRAM over PCIe P2P: each device's partial lives in an exportable device buffer (DMA_BUF), peers import it, and the existing comm reads host_buf[k][i] -- now peer VRAM -- unchanged. Targets the O(n^2)-host-bandwidth scaling collapse; pairs with the O(n) ring. Ordering still uses native/CPU-proxy semaphores (proxy auto-selects on RADV), so this is the "D2D data + proxy semaphores" combo. Enables VK_KHR_external_memory_fd + VK_EXT_external_memory_dma_buf. DMA_BUF is the cross-device handle type (OPAQUE_FD memory is spec-locked to one physical device); this matches the amdgpu PCIe-P2P dma-buf mechanism RADV/ROCm use. Status: the fast path is AMD-targeted and UNVALIDATED -- NVIDIA's Vulkan driver rejects cross-device fd import (vkGetMemoryFdPropertiesKHR -> memoryTypeBits=0; confirmed against the Vulkan spec's same-deviceUUID rule and NVIDIA's own statements), and NVIDIA has no fd-import or device-group P2P for unlinked GPUs. So on NVIDIA it logs once and gracefully falls back to host staging -- verified byte-identical and at host speed (pp2048 692 vs 695) on 4x A16, not the slow butterfly. An AMD multi-GPU rig is needed to validate the actual P2P fast path (test recipe accompanies this work, incl. how to prove real VRAM P2P vs a silent amdgpu GTT fallback). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ApKCQ32VLqUW4Kus6tUvBL
The ring transported fp32 chunks; the pipeline already halves host/peer traffic by staging F16. Bring the ring to parity: keep tensors[i] as the fp32 accumulator (matching the pipeline's precision) but cast each chunk to F16 for transport. The cast is folded into the recv step (the just-reduced chunk is the next step's send), so it costs only one extra pre-cast prog value rather than a doubled scheme; per-step up16 slots avoid a send-buffer WAR. GGML_VK_COMM_FP32 still forces fp32. Verified byte-identical greedy output vs fp32 ring / pipeline / butterfly on 4x A16, and no regression (ring-f16 ~= ring-fp32 ~= pipeline on A16 and 4090). The bandwidth win only shows when comm is exposed (comm-bound hosts); on this NVIDIA box the ring's transfer/compute overlap hides the comm, so F16 is neutral here -- same comm-hidden reason the other comm micro-opts are neutral on NVIDIA. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ApKCQ32VLqUW4Kus6tUvBL
Remove all comment lines and trailing comments from the tensor-parallel comm implementation and its device-extension hooks. No behaviour change (verified byte-identical greedy output across ring/proxy/D2D/butterfly after the strip). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ApKCQ32VLqUW4Kus6tUvBL
|
All right - a prototype D2D implementation for AMD cards has landed, could anyone with a multi-AMD GPU setup here give it a try? EDIT: for the D2D you have to run with |
Previous result: After: Results largely within noise. This is on x8/x8 PCIE 4.0 connections, mind. Would you expect to see a performance difference? |
|
@digitalscream sorry, clever me forgot to mention you have to run with |
Results seem to be significantly worse with GGML_VK_COMM_D2D=1 enabled: GGML_VK_COMM_D2D Disabled:
GGML_VK_COMM_D2D Enabled:
|
|
Same here, except worse - with D2D, power draw drops from 250-260W previously to 130W during prefill and 75W during decode, and performance falls by 90% across the board. |
|
Interesting... would really love those perf dumps, I'm getting to the limit of what I can do without data from the actual hardware :) |
Would it help if you had direct access to my AI box? I can't be of much use regarding code, but if contributing time on there will help, I'm happy to do it. |
Heres the perf dump with command |
| VkImportMemoryFdInfoKHR imfi{}; | ||
| imfi.sType = VK_STRUCTURE_TYPE_IMPORT_MEMORY_FD_INFO_KHR; | ||
| imfi.handleType = VK_EXTERNAL_MEMORY_HANDLE_TYPE_DMA_BUF_BIT_EXT; | ||
| imfi.fd = fd; |
There was a problem hiding this comment.
Sadly, this usage violates https://docs.vulkan.org/refpages/latest/refpages/source/VkImportMemoryFdInfoKHR.html#VUID-VkImportMemoryFdInfoKHR-fd-00668 The memory from which fd was exported must have been created on the same underlying physical device as device. It also works on Intel devices (with the right kernel config options), but its not valid Vulkan. AFAIK (and I'm not a Vulkan expert by any means) the only correct way to do this in Vulkan is via the device-groups API, which is at least not implemented in mesa and possibly not elsewhere either.
There was a problem hiding this comment.
Now, the fact that this works on AMD devices and Intel devices with the right kernel config maybe means we should be pushing for a vendor extension for this. It does feel kinda absurd that you have to implement devices groups just to do D2D, but this is wayyy outside of my area, maybe @jeffbolznv has thoughts. There's some minor things around memory accounting that would also need to be addressed, but nothing major, see the discission on my previous attempt to fix those in mesa (which was primarily driven by a desire to do exactly this in vulkan in llama.cpp...): https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41733#note_3484600
There was a problem hiding this comment.
Yeah, was considering that option as well, the Vulkan ecosystem sure is a giant mess with all the supported drivers and hardware ;)
If you'd be willing to provide such access then of course I'd be grateful, my contact details are in my profile. |
- Remove the GGML_VK_COMM_D2D peer-buffer path entirely (helpers, VK_KHR_external_memory_fd / VK_EXT_external_memory_dma_buf detection+enable, comm fields, init validation, ensure() branch). On AMD it regresses badly: the dmabuf import lands in GTT (PCIe P2P not established on stock kernels), so peer reads stall the GPU (pp/tg down 40-90%, power collapses). Not viable without box access to validate true VRAM P2P; revisit later. - Make the O(n) ring the default large-tensor AllReduce. The old all-to-all pipeline is now opt-in via GGML_VK_COMM_PIPELINE (was: ring opt-in via GGML_VK_COMM_RING). - Remove the GGML_VK_COMM_OFF toggle (forced the meta-backend butterfly on Vulkan); the custom comm is always better. The generic butterfly fallback in the meta backend stays -- it is the shared fallback for CUDA/SYCL and for Vulkan configs without a usable custom comm (e.g. MoltenVK). Verified byte-identical greedy across ring / pipeline / proxy / fp32 on 4x A16. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ApKCQ32VLqUW4Kus6tUvBL
Sure - check your email :) |
I didn't test it but I think your method of using the address should work, that's much better than storing a list of tensors somewhere. By the way it might be worth just submitting this crash fix separately to get tensor parallel working and then work on the performance stuff later. |
With the ring as the default there is no reason to keep the slower paths: - Remove ggml_backend_vk_comm_allreduce_pipeline (the O(n^2) all-to-all) and GGML_VK_COMM_PIPELINE. The ring is now the unconditional large-tensor path; the comm->ring flag and pipe_round are gone, and pipeline_ok (the "has two queues" gate the ring needs) is renamed ring_ok. - Remove fp32 staging and GGML_VK_COMM_FP32. The ring always stages F16 (fp32 accumulator preserved); its fp32 branch and use_f16 are removed. Net ~-260 lines. Verified byte-identical greedy output (ring / proxy) and clean build on 4x A16. The decode single-shot and the meta-backend butterfly fallback are untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ApKCQ32VLqUW4Kus6tUvBL
| if (t_ij->view_src->buffer != nullptr) { | ||
| t_ij->buffer = t_ij->view_src->buffer; | ||
| } |
There was a problem hiding this comment.
I don't think it's necessary to do this now that we have ggml_backend_multi_buffer_get_buffer.
| // Resolve it from the tensor's data address -- works for any tensor, not just views. | ||
| if (t_ij->buffer != nullptr && t_ij->data != nullptr | ||
| && ggml_backend_buffer_is_multi_buffer(t_ij->buffer)) { | ||
| ggml_backend_buffer_t sub = ggml_backend_multi_buffer_get_buffer(t_ij->buffer, t_ij->data); |
There was a problem hiding this comment.
Wait a moment t_ij->data is copied from the view src earlier if this is a view tensor. If it's not a view tensor then is the original data pointer going to be correct?
There was a problem hiding this comment.
Yeah, the idea here is that since the multi-buffer's data pointer is a contiguous memory region, but divided into the individual buffers, we just run through all the buffers and see which buffer's offset the base address corresponds to.
|
Hum, FWIW, in rebasing my existing tensor-par branch I noted that a patch my clanker wrote to do vulkan graph caching was a 25% improvement on top of tensor-par, so there's plenty of room to grow with nontrivial patches later, at least compared to the CUDA stuff that does graph caching. |
|
@TheBlueMatt I did run correctness checks, but not on a 122B model, so it's possible that the F16 pipeline is causing overflows. Can you reproduce this on smaller models as well? |
|
The same command generates sensible output for |

Overview
I've heard from @0cc4m that Vulkan maintainers really like large, LLM assisted PRs, so here's one that should make them happy 😁
This fixes crashes in the pipeline and introduces a Vulkan F16 AllReduce.
Additional information
Benchmark results on my box - would love someone with non-NVidia hardware to test this:
Vulkan tensor-parallel benchmark v2 (rebased+cleaned build) — RTX 3080 + RTX 5060 Ti
pp512 / tg128 (tok/s).
llama-bench -fa 1, default ubatch. 19 LLMs, 2-16GB. Each split-mode in its own run.Cases: Vk-layer / CUDA-layer / Vk-tensor crashfix (butterfly,
GGML_VK_COMM_OFF) / Vk-tensor F16 (this work) / CUDA-tensor (NCCL).Depth 0
Depth 4096
Depth 40000
Requirements