Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
128 commits
Select commit Hold shift + click to select a range
edb1a11
feat(paged): vLLM-parity KV block manager (Phase 0, CPU-first prototype)
mudler Jun 19, 2026
c6698dd
feat(paged): Phase 1 - ggml paged write/gather mechanism (CPU)
mudler Jun 19, 2026
5a5d3df
feat(paged): Phase 2 core - attention over paged KV matches reference
mudler Jun 19, 2026
ddace5f
feat(paged): paged-bench - measure capacity & prefix-sharing wins
mudler Jun 19, 2026
3ed3279
docs(paged): status + integration map for in-model Gate 0
mudler Jun 19, 2026
bbc84a9
feat(paged): Gate 0 in-model - token-identical generation with paged …
mudler Jun 19, 2026
7aa61d4
docs(paged): DGX Blackwell gap analysis + lever plan (living doc)
mudler Jun 19, 2026
aba0bfd
feat(backend): auto-default physical batch to 2048 on Blackwell GPUs
mudler Jun 19, 2026
9f16a90
docs(paged): Lever 3 profiled + Q4/MXFP4 findings, auto-ubatch shipped
mudler Jun 19, 2026
1449b80
docs(paged): Lever-3 + paged-attention implementation plans + upstrea…
mudler Jun 19, 2026
b142146
docs(paged): Lever-3 phase-1 nwarps tweak = dead end (constants coupled)
mudler Jun 19, 2026
62f0ae1
docs(paged): upstream survey - no FP4 MoE GEMM to patch in; phase 3 i…
mudler Jun 19, 2026
ba3fa5a
build(paged): stacking patch-series scaffolding for llama.cpp paged a…
mudler Jun 19, 2026
ce48cc0
patch(paged) 0001: vendor PagedKVManager into llama.cpp src
mudler Jun 19, 2026
61ff738
patch(paged) 0002: LLAMA_KV_PAGED block placement, Gate 0 token-ident…
mudler Jun 19, 2026
c4b4f3a
docs(paged): series status 0001/0002 done+verified; honest parity note
mudler Jun 19, 2026
145e45b
docs(paged): exact executable plan for 0003 gather-read
mudler Jun 19, 2026
48fbb93
docs(paged): refine 0003 plan - used-cell gather, per-ubatch rebuild,…
mudler Jun 19, 2026
2a500c3
bench(paged): fresh GB10 head-to-head vs vLLM - two distinct gaps
mudler Jun 19, 2026
cb28ded
bench(paged): decode profile overturns 'engine-addressable' - decode …
mudler Jun 19, 2026
b7b2e82
kernel(fp4-grouped-moe): scaffold the FP4 grouped-GEMM MoE dispatch (…
mudler Jun 19, 2026
37cbc08
bench(dense): Qwen3-32B dense parity - dense has the kernel gap too (…
mudler Jun 20, 2026
ce60737
kernel(doc): dense scope resolved - two FP4 kernels (dense first, the…
mudler Jun 20, 2026
19742ae
bench(dense): FORCE_CUBLAS no-op for dense too (720.8 vs 721.8) - eve…
mudler Jun 20, 2026
d2651c8
bench(dense): root-cause the W4A4 NVFP4 hang; W4A16 vs Q4 is the head…
mudler Jun 20, 2026
f5e9cae
kernel: reframed Blackwell kernel-gap map (research + profiles)
mudler Jun 20, 2026
14e3da2
kernel: dense MXFP4 test = free 1.44x (765->1153) but FP4-MMA untuned…
mudler Jun 20, 2026
122df1c
analysis: vLLM throughput gap decomposed - spec-dec is the per-user l…
mudler Jun 20, 2026
76cc0b6
docs(paged): phased plan to make llama.cpp a viable vLLM alternative
mudler Jun 20, 2026
13e6ee8
kernel: validate cuBLAS dead-end (sm_80 fallback) + W4A16 Marlin impl…
mudler Jun 20, 2026
dae2679
kernel(P0): parity harness established + baseline (test-backend-ops 1…
mudler Jun 20, 2026
d291e15
kernel(P0): record precise op-level baseline (q4_K n=512 = 47 TFLOPS,…
mudler Jun 20, 2026
718b31d
kernel(P1): W4A16 dispatch seam (gated, byte-identical fallback to MMQ)
mudler Jun 20, 2026
9a71e81
kernel: written subagent dispatch briefs for P3/P4/P5
mudler Jun 20, 2026
4de0c3b
feat(cuda): W4A16 P2 correctness-first BF16 GEMM kernel
mudler Jun 20, 2026
9973fa9
feat(w4a16): P3 step 1 - block-tiled multi-warp Marlin GEMM (GB10)
mudler Jun 20, 2026
2f648dc
feat(w4a16): conflict-free skew-pad ldmatrix + BM128/8w tile (q4_K +2…
mudler Jun 21, 2026
2b79083
feat(w4a16): grow tile to BN128/16w (q4_K +17%, pp512 148->178)
mudler Jun 21, 2026
fc589b3
analysis: vLLM GB10 advantage is the SCHEDULER, not the kernel (pivot)
mudler Jun 21, 2026
07985ba
analysis: measured llama.cpp aggregate vs vLLM - already ~75-80% at n…
mudler Jun 21, 2026
fdb7f56
docs(llama-cpp): scope chunked prefill + n_batch/n_ubatch decouple
mudler Jun 21, 2026
92e93df
analysis: paged KV gives ZERO benefit on GB10 (measured) - not the lever
mudler Jun 21, 2026
d6c91b7
analysis: finalize PR #22569 paged-KV eval (full detail + compute-bou…
mudler Jun 21, 2026
40ee9cd
docs(paged): evaluate llama.cpp PR #17004 (GPU/backend sampling) on GB10
mudler Jun 21, 2026
1887385
analysis: MXFP4-dense fails quality check (~27% worse PPL than Q4_K) …
mudler Jun 21, 2026
037ad82
docs(paged): MXFP4-dense vs Q4_K quality gate on GB10 (do not recommend)
mudler Jun 21, 2026
aaf7b41
test(llama-cpp): NVFP4-dense FP4 quality+speed eval on GB10
mudler Jun 21, 2026
6e0b910
analysis: decode gap is GPU/kernel-bound, NOT host overhead (corrects…
mudler Jun 21, 2026
faeb5b4
analysis: NVFP4 closes the decode gap too (547->619, ~93% of vLLM)
mudler Jun 21, 2026
0337505
docs(paged): measure paged KV at high concurrency (LLAMA_MAX_SEQ=2048…
mudler Jun 21, 2026
931793a
feat(paged): target-readiness for 2xH200 - correctness PASS, load-gen…
mudler Jun 21, 2026
84d59e6
docs(paged): additive "hook, don't edit" layout for the patch series
mudler Jun 22, 2026
d9d846e
feat(paged): patch 0003 gather-read - Gate 0 green, token-identical, …
mudler Jun 22, 2026
37e0e1e
paged-attn 0003: lift gather-read to multi-stream
mudler Jun 22, 2026
4968cd8
paged-attn 0004: on-demand KV block allocation
mudler Jun 22, 2026
04e3d04
build(llama-cpp): isolate paged patches in patches/paged/ behind LLAM…
mudler Jun 22, 2026
667a21c
feat(llama-cpp): expose paged KV cache as a per-server option (patch …
mudler Jun 22, 2026
67c6208
feat(llama-cpp/paged): cross-request prefix caching patch 0006
mudler Jun 22, 2026
ecffd4b
feat(llama-cpp/paged): engine-level prefix recompute-skip (patch 0007)
mudler Jun 22, 2026
d1ba327
docs(paged): record GPU correctness + CUDA backend-build verification
mudler Jun 22, 2026
9537726
fix(llama-cpp/paged): stop double-applying the paged patches in prepa…
mudler Jun 22, 2026
0dd45f0
docs(llama-cpp/paged): GPU 0007 re-run + shared-prefix benchmark results
mudler Jun 22, 2026
f347f7c
docs(paged): stock GPU batch-shape determinism + vLLM shared-prefix c…
mudler Jun 22, 2026
52f0f7b
docs(paged): apples-to-apples paged llama.cpp vs vLLM (batched+NVFP4+…
mudler Jun 22, 2026
80e0c1a
feat(paged): wire cross-request prefix share into llama-server (patch…
mudler Jun 22, 2026
4dcbcfc
docs(paged): decode-step gap study vs vLLM on GB10
mudler Jun 22, 2026
ee13a94
paged: in-kernel decode read patch 0009 (kill the gather regression)
mudler Jun 22, 2026
2c5adda
feat(paged): tile in-kernel decode read + dispatch guard (patch 0010)
mudler Jun 22, 2026
e983919
feat(paged): route GQA-grouped tile kernel by default for paged decod…
mudler Jun 22, 2026
ba6bd94
feat(paged): assert mask-pad invariant for the paged tile route (patc…
mudler Jun 23, 2026
4bc2b4a
feat(paged): add patch 0013 decoupled per-step prefill-token budget
mudler Jun 23, 2026
dd6a442
feat(llama-cpp): per-model max_prefill_tokens option (chunked-prefill…
mudler Jun 23, 2026
a3abd60
docs(paged): GB10 head-to-head server sweep (llama-server vs vLLM)
mudler Jun 23, 2026
8925c00
docs(paged): scope durable grouped FP4-MMA MoE GEMM port for GB10
mudler Jun 23, 2026
010067d
feat(paged): mirror patch 0014 - expert-aware MoE token-tile cap
mudler Jun 23, 2026
acb22a6
feat(paged): mirror MoE token-tile density-aware auto-select (patch 0…
mudler Jun 23, 2026
ee78ae4
docs(paged): Qwen3.6 NVFP4 h2h bench doc - MoE llama.cpp table
mudler Jun 23, 2026
2975a74
docs(paged): Qwen3.6 NVFP4 apples-to-apples scorecard (llama vs vLLM,…
mudler Jun 23, 2026
c8b1f16
docs(paged): dense NVFP4 fair re-run with max_prefill_tokens budget s…
mudler Jun 23, 2026
c7075fb
docs(paged): MoE 35B-A3B NVFP4 fair re-run with max_prefill_tokens bu…
mudler Jun 23, 2026
362eea9
docs(paged): fair re-run verdict - synthesize NVFP4 llama vs vLLM sco…
mudler Jun 23, 2026
ed17fc8
docs(paged): scope token-granular continuous-batch scheduler for llam…
mudler Jun 23, 2026
5a38dd3
docs(paged): adversarial review of the continuous-batch scheduler scope
mudler Jun 23, 2026
fccbb40
docs(paged): ground vLLM 0.23.0 eager-decode architecture vs llama.cpp
mudler Jun 24, 2026
24ce7d0
feat(llama-cpp/paged): dynamic decode-first prefill budget (patch 001…
mudler Jun 24, 2026
f7500df
docs(paged): staggered-arrival evaluation of patch 0016 dynamic budget
mudler Jun 24, 2026
e4c6317
docs(paged): verify llama.cpp GDN decode is O(1)-in-context, not a 2.…
mudler Jun 24, 2026
ea634ee
docs(paged): scope track B - FP4-MMA decode-GEMM roofline + parity go…
mudler Jun 24, 2026
c1d7f33
docs(paged): enrich track-B scope with code-level FP4-GEMM inefficien…
mudler Jun 24, 2026
7434d64
docs(paged): build-ready track-B FP4-GEMM scope - kernel decision + p…
mudler Jun 24, 2026
39e16cc
docs(paged): adversarial review of track-B FP4-GEMM parity go/no-go
mudler Jun 24, 2026
40f019e
docs(paged): mirror FP4 decode-GEMM track-B P0 gate + P1 kill-gate re…
mudler Jun 24, 2026
da67fd8
docs(paged): A.2 CUDA-graph decode lever measurement and gap diagnosis
mudler Jun 24, 2026
2dd5d68
docs(paged): A.2 Phase 2 - locate the real decode lever (gated-DeltaN…
mudler Jun 24, 2026
34cadb6
docs(paged): A.2 final synthesis - CUDA-graph decode verdict
mudler Jun 24, 2026
5ce2f1d
feat(paged): qwen35 gated-DeltaNet in-place SSM state write-back (pat…
mudler Jun 24, 2026
6f0792c
feat(paged): qwen35 SSM decode fused recurrent-state gather (patch 0019)
mudler Jun 24, 2026
ee13fd1
docs(paged): profile-both-engines post-SSM ground-truth decode decomp…
mudler Jun 25, 2026
c0e0ed3
docs(paged): synthesize decode-parity exploration - the o_proj MMVQ l…
mudler Jun 25, 2026
b895f4d
feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape (patch 0020)
mudler Jun 25, 2026
e597a8a
docs(paged): vLLM GDN decode = 2 fused kernels under CUDA graph vs ll…
mudler Jun 25, 2026
2b57997
docs(paged): cudagraph-coverage - GDN serial chain IS graph-covered a…
mudler Jun 25, 2026
a723852
docs(paged): decisive node-level decode timeline gap - bubbles refuted
mudler Jun 25, 2026
5825b07
docs(paged): SYNTHESIS - validated decode-parity picture, ranked plan…
mudler Jun 25, 2026
fd4332e
docs(paged): GDN recurrence byte-gate SETTLED - re-stream ~1.0x, buil…
mudler Jun 25, 2026
2a8103c
docs(paged): FINAL DECISION - NO-BUILD fused recurrence, BUILD conv f…
mudler Jun 25, 2026
1785573
docs(paged): bf16 SSM-state build plan (PART C synthesis: edits, KL g…
mudler Jun 25, 2026
5cec1a6
docs(paged): bitexact-vs-vLLM verdict + verified f32 GDN-state correc…
mudler Jun 25, 2026
8f8777e
feat(paged): qwen35 decode conv-state in-place fusion (patch 0021)
mudler Jun 25, 2026
3c1ed67
feat(paged): qwen35 gated-DeltaNet decode occupancy/coalescing retune…
mudler Jun 25, 2026
02cbae5
feat(paged): qwen35moe NVFP4 activation-quantize de-dup (patch 0023)
mudler Jun 25, 2026
64766ec
Merge branch 'master' into worktree-feat+paged-attention
mudler Jun 25, 2026
634c0e5
docs(paged): rms_norm->fp4 fold analysis - bit-exact decode ceiling a…
mudler Jun 25, 2026
24833f0
docs(paged): bf16 SSM-state NO-SHIP - fails f32 KL gate (= vLLM's own…
mudler Jun 26, 2026
7c45447
docs(paged): FUTURE_LEVERS - parked decode-parity exploration trail
mudler Jun 26, 2026
aaaa90a
bench(paged): final apples-to-apples NVFP4 decode benchmark (0023 vs …
mudler Jun 26, 2026
ae0042f
docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode…
mudler Jun 26, 2026
7dd3431
docs(paged): promote TTFT/prefill + paged-pool burst-degradation bug …
mudler Jun 26, 2026
00f9265
docs(paged): correct vLLM recurrent-state precision (f32, not bf16)
mudler Jun 26, 2026
001d833
docs(paged): f16/bf16 glue probe - dense decode residual ceiling
mudler Jun 26, 2026
89e62fc
docs(paged): finalize f16 glue probe - cost analysis + build verdict
mudler Jun 26, 2026
b061e4a
docs(paged): OTHER_PATHS investigation - rank 4 post-0023 paths, pick…
mudler Jun 26, 2026
125d10a
feat(paged): paged-pool burst-reclaim (truncate + defrag + slot relea…
mudler Jun 26, 2026
167768c
feat(backend): llama-cpp-localai-paged variant + NVFP4 Qwen3.6 gallery
mudler Jun 26, 2026
30a2b59
Merge branch 'master' into worktree-feat+paged-attention (llama.cpp p…
mudler Jun 26, 2026
ec7c1b1
feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 pat…
mudler Jun 26, 2026
4d3fecd
docs(paged): MoE decode re-graph lever (patch 0025) + speedup-hunt B …
mudler Jun 26, 2026
6bfca14
docs(paged): speedup-hunt C section + final RANK + PLAN synthesis
mudler Jun 26, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions .docker/llama-cpp-localai-paged-compile.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#!/usr/bin/env bash
# Shared compile logic for backend/Dockerfile.llama-cpp-localai-paged.
# Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages.

set -euxo pipefail

export CCACHE_DIR=/root/.ccache
ccache --max-size=5G || true
ccache -z || true

export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache"

if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
rm -rf /LocalAI/backend/cpp/llama-cpp-localai-paged-*-build
fi

cd /LocalAI/backend/cpp/llama-cpp-localai-paged

if [ -z "${BUILD_TYPE:-}" ]; then
# Pure CPU image: one ggml CPU_ALL_VARIANTS build replaces the per-microarch binaries.
# arm64: the armv9.2 SME variants need gcc-14 (gcc-13 rejects +sme).
if [ "${TARGETARCH}" = "arm64" ]; then
apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
export CC=gcc-14 CXX=g++-14
fi
make llama-cpp-localai-paged-cpu-all
else
# GPU build (cublas/hipblas/sycl/vulkan/...): single fallback CPU build, the accelerator
# does the compute. Keeps the GPU compile from also building the CPU variant matrix and
# avoids the gcc-14 apt step on GPU base images such as nvidia l4t.
make llama-cpp-localai-paged-fallback
fi
make llama-cpp-localai-paged-grpc
make llama-cpp-localai-paged-rpc-server

ccache -s || true
163 changes: 163 additions & 0 deletions .github/backend-matrix.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4881,6 +4881,169 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# llama-cpp-localai-paged: the LocalAI paged-attention llama.cpp variant. Each
# row mirrors the corresponding llama-cpp row with backend/dockerfile/tag-suffix
# swapped; builder-base-image is left UNCHANGED so these reuse the same
# base-grpc-* prebuilt bases (same gRPC + same toolchain), needing no new
# base-images.yml variant.
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-llama-cpp-localai-paged'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-12-amd64'
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "llama-cpp-localai-paged"
dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-llama-cpp-localai-paged'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64'
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "llama-cpp-localai-paged"
dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-llama-cpp-localai-paged'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-arm64'
base-image: "ubuntu:24.04"
runs-on: 'ubuntu-24.04-arm'
ubuntu-version: '2404'
backend: "llama-cpp-localai-paged"
dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
context: "./"
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-llama-cpp-localai-paged'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-rocm-amd64'
runs-on: 'ubuntu-latest'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "llama-cpp-localai-paged"
dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-llama-cpp-localai-paged'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-intel-amd64'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "llama-cpp-localai-paged"
dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f16-llama-cpp-localai-paged'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-intel-amd64'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "llama-cpp-localai-paged"
dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-llama-cpp-localai-paged'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-amd64'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "llama-cpp-localai-paged"
dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-llama-cpp-localai-paged'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-arm64'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "llama-cpp-localai-paged"
dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-llama-cpp-localai-paged'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-l4t-cuda-12-arm64'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
runs-on: 'ubuntu-24.04-arm'
backend: "llama-cpp-localai-paged"
dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
context: "./"
ubuntu-version: '2204'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-llama-cpp-localai-paged'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-vulkan-amd64'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "llama-cpp-localai-paged"
dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-llama-cpp-localai-paged'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-vulkan-arm64'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "llama-cpp-localai-paged"
dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
context: "./"
ubuntu-version: '2404'

# Darwin matrix (consumed by backend-jobs-darwin).
includeDarwin:
Expand Down
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,15 @@ prepare-sources
/backend/cpp/llama-cpp/llama.cpp
/backend/cpp/llama-*
!backend/cpp/llama-cpp
# llama-cpp-localai-paged is a tracked source dir (a thin wrapper Makefile over
# backend/cpp/llama-cpp). Re-include it like llama-cpp above; its sibling
# *-build dirs are still ignored by the /backend/cpp/llama-* rule, and its
# in-dir build artifacts (binaries, package output, collected ggml .so set) are
# re-ignored just below.
!backend/cpp/llama-cpp-localai-paged
/backend/cpp/llama-cpp-localai-paged/llama-cpp-localai-paged-*
/backend/cpp/llama-cpp-localai-paged/package
/backend/cpp/llama-cpp-localai-paged/ggml-shared-libs
/backends
/backend-images
/result.yaml
Expand Down
18 changes: 16 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Disable parallel execution for backend builds
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio backends/supertonic backends/depth-anything-cpp backends/privacy-filter backends/privacy-filter-darwin
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio backends/supertonic backends/depth-anything-cpp backends/privacy-filter backends/privacy-filter-darwin backends/llama-cpp-localai-paged

GOCMD=go
GOTEST=$(GOCMD) test
Expand Down Expand Up @@ -664,6 +664,15 @@ test-extra-backend-llama-cpp: docker-build-llama-cpp
test-extra-backend-ik-llama-cpp: docker-build-ik-llama-cpp
BACKEND_IMAGE=local-ai-backend:ik-llama-cpp $(MAKE) test-extra-backend

## llama-cpp-localai-paged: the LocalAI paged-attention llama.cpp variant. Same
## GGUF surface as stock llama-cpp (the paged engine is runtime-gated by the
## LLAMA_KV_PAGED env the grpc-server option hooks set), so the standard
## llama-cpp capability set is what we exercise here.
test-extra-backend-llama-cpp-localai-paged: docker-build-llama-cpp-localai-paged
BACKEND_IMAGE=local-ai-backend:llama-cpp-localai-paged \
BACKEND_TEST_CAPS=health,load,predict,stream,logprobs,logit_bias \
$(MAKE) test-extra-backend

## turboquant: exercises the llama.cpp-fork backend with the fork's
## *TurboQuant-specific* KV-cache types (turbo3 for both K and V). turbo3
## is what makes this backend distinct from stock llama-cpp — picking q8_0
Expand Down Expand Up @@ -1174,6 +1183,10 @@ BACKEND_IK_LLAMA_CPP = ik-llama-cpp|ik-llama-cpp|.|false|false
# turboquant is a llama.cpp fork with TurboQuant KV-cache quantization.
# Reuses backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile.
BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
# llama-cpp-localai-paged = stock llama.cpp grpc-server + the LocalAI paged-attention
# patch series (LLAMA_PAGED=on). Reuses backend/cpp/llama-cpp sources via a thin
# wrapper Makefile (same upstream pin as stock llama-cpp; no fork, no patch-grpc-server).
BACKEND_LLAMA_CPP_LOCALAI_PAGED = llama-cpp-localai-paged|llama-cpp-localai-paged|.|false|false
# ds4 is antirez/ds4, a DeepSeek V4 Flash-specific inference engine.
# Single-model; hardware-only validation lives at tests/e2e-backends/
# (BACKEND_BINARY mode); see docs/superpowers/plans/2026-05-11-ds4-backend.md.
Expand Down Expand Up @@ -1275,6 +1288,7 @@ endef
$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_LOCALAI_PAGED)))
$(eval $(call generate-docker-build-target,$(BACKEND_DS4)))
$(eval $(call generate-docker-build-target,$(BACKEND_PRIVACY_FILTER)))
$(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
Expand Down Expand Up @@ -1338,7 +1352,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SUPERTONIC)))
docker-save-%: backend-images
docker save local-ai-backend:$* -o backend-images/$*.tar

docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-omnivoice-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy docker-build-supertonic docker-build-depth-anything-cpp docker-build-privacy-filter
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-llama-cpp-localai-paged docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-omnivoice-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy docker-build-supertonic docker-build-depth-anything-cpp docker-build-privacy-filter

########################################################
### Mock Backend for E2E Tests
Expand Down
Loading
Loading