Skip to content

feat(backends): add voice-detect + face-detect ggml backends (replace Python insightface/speaker-recognition)#10441

Open
localai-bot wants to merge 27 commits into
masterfrom
feat/recon-cpp-backends
Open

feat(backends): add voice-detect + face-detect ggml backends (replace Python insightface/speaker-recognition)#10441
localai-bot wants to merge 27 commits into
masterfrom
feat/recon-cpp-backends

Conversation

@localai-bot

@localai-bot localai-bot commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds two new ggml/C++ backends that replace the Python insightface (face) and speaker-recognition (voice) biometric backends with standalone, dependency-free engines, integrated via purego (cgo-less):

  • voice-detect (mudler/voice-detect.cpp): speaker recognition + voice analysis. Encoders ECAPA-TDNN, WeSpeaker ResNet34, 3D-Speaker ERes2Net, CAM++; Kaldi-compatible FBank; wav2vec2 emotion and age/gender analyze heads; f16/q8_0 GGUF. Backend RPCs: VoiceEmbed / VoiceVerify / VoiceAnalyze (identify is a Go-side registry over embeddings).
  • face-detect (mudler/face-detect.cpp): face recognition. SCRFD (det_10g/2.5g/500m) + ArcFace (buffalo_l/m/s) + MobileFaceNet + YuNet + SFace; genderage; MiniFASNet anti-spoof; bit-exact libjpeg-turbo decode; f16/q8_0 GGUF. Backend RPCs: Embeddings / Detect / FaceVerify / FaceAnalyze.

Both engines are gated component-by-component against the PyTorch/ONNX reference (speaker embeddings cosine 1.0, max abs diff ~1e-6; face detection within 1px, recognition cosine 1.0). The Go side (proto, /v1/{face,voice}/* routes, registries) is unchanged.

What's in this PR

  • backend/go/voice-detect/ and backend/go/face-detect/: purego backends (static-linked ggml, ldd-clean; face also static-vendors libjpeg-turbo with hidden symbols).
  • backend/index.yaml: meta-backend entries (capabilities maps, per-arch images) mirroring ced, covering the CUDA 12/13 and nvidia-l4t arm64 variants.
  • backend/Dockerfile.golang + the backend Makefiles: enable the engines' opt-in cuDNN implicit-GEMM conv path on the arm64 + CUDA 13 variant (GB10 / Jetson), where libcudnn already ships; OFF everywhere else (no new dependency, verified via make -p).
  • gallery/index.yaml: 12 models (6 voice, 6 face), each pointing at a published HF GGUF with a real sha256.
  • docs/content/features/{face,voice}-recognition.md, core/config/backend_capabilities.go, .github/backend-matrix.yml, .github/workflows/bump_deps.yaml.

Status

  • Engines are parity-verified and performance-tuned. On CPU they sit at the honest onnxruntime/MLAS ceiling (on par end to end; the win is roughly 2x to 13x lower peak RAM, no Python, one self-contained binary, breadth). On GPU they reach reference parity via a persistent CUDA-graph fix and cuDNN conv (SCRFD detect 14.8 -> 6.4 ms, ~parity with torch-cuDNN).
  • Both engine repos carry release-ready READMEs + demos.
  • All 12 GGUFs are published to mudler/voice-detect-gguf / mudler/face-detect-gguf; every gallery sha256 is filled (no TODO markers remain).

Pre-merge checklist

  • Make mudler/voice-detect.cpp and mudler/face-detect.cpp public. The backend Makefiles clone them over HTTPS at a pinned commit, so the backend-build CI jobs fail until the repos are public. This is the only remaining blocker for green CI.
  • Publish the converted GGUFs and fill every sha256 (done, 12 on HF).
  • docs/content/features/{face,voice}-recognition.md coverage (done).

Licensing

  • insightface buffalo packs: non-commercial (research-only). YuNet + SFace: Apache-2.0 (the commercial-friendly alternative).
  • wav2vec2 emotion / age-gender models: non-commercial.

Notes

  • The cuDNN conv path is opt-in (VOICEDETECT_GGML_CUDNN / FACEDETECT_GGML_CUDNN, default OFF), enabled only for the arm64 + CUDA 13 backend image, which already bundles libcudnn. No other variant gains a dependency.
  • Known: face end-to-end cosine for two crowded-fixture faces sits at 0.99984 to 0.99989 (just under 0.9999) from ggml-vs-onnxruntime SCRFD conv accumulation (~0.05px landmark drift, within the <=1px detection spec), far inside the verify threshold.

mudler added 6 commits June 22, 2026 00:00
Add backend/go/voice-detect implementing the Backend gRPC voice subset
(VoiceEmbed/VoiceVerify/VoiceAnalyze) over libvoicedetect.so via purego,
mirroring the parakeet-cpp / omnivoice-cpp backends.

The flat voicedetect_capi C ABI is dlopen'd cgo-less; malloc'd string and
float-vector returns are owned by Go and released through the matching capi
free functions, with the per-ctx last error surfaced into Go errors. Calls are
serialized via base.SingleThread since the C context is not reentrant.

Proto field mapping:
- VoiceEmbed: VoiceEmbedRequest.audio (path) -> embed_path -> Embedding+Model.
- VoiceVerify: audio1/audio2 + threshold (<=0 falls back to the
  verify_threshold option, default 0.25) -> verify_paths -> verified/distance/
  threshold/confidence/model/processing_time_ms.
- VoiceAnalyze: audio (path) -> analyze_path_json; the JSON age/gender/emotion
  document maps to a single VoiceAnalysis segment (start/end 0; gender "label"
  -> dominant_gender with the remaining float scores as the gender map; emotion
  label/scores -> dominant_emotion/emotion).

The Makefile pins voice-detect.cpp to 47546430, clones+builds libvoicedetect.so
with ggml static-linked (PIC, GGML_NATIVE off) so dlopen needs no external
libggml/libvoicedetect; ldd on the artifact shows only system libs. Ginkgo
tests cover option parsing and analyze-JSON mapping; embed/verify smoke specs
gate on VOICEDETECT_BACKEND_TEST_MODEL + VOICEDETECT_BACKEND_TEST_WAV.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Register the voice-detect.cpp speaker-recognition + voice-analysis
backend (added in Voice-INT-A) into LocalAI's distribution surfaces,
mirroring the ced backend (the closest mudler C++/ggml audio analogue):

- backend/index.yaml: add the &voicedetect meta-backend (capabilities
  platform map, no top-level uri) plus the full set of concrete per-arch
  image entries (cpu/cuda12/cuda13/metal/rocm/sycl/vulkan/l4t and the
  -development variants). Referential integrity audited - every alias
  target resolves.
- gallery/index.yaml: add 5 model entries on backend voice-detect -
  ECAPA-TDNN, WeSpeaker ResNet34, 3D-Speaker ERes2Net, CAM++ and the
  wav2vec2 age/gender/emotion analyze model. The engine architecture is
  read from GGUF metadata (voicedetect.arch) at load. GGUF artifacts are
  not yet published: each files: entry points at the intended
  mudler/voice-detect-gguf location with a TODO to fill sha256 after
  upload (no fabricated hashes).
- .github/backend-matrix.yml: add the linux build matrix block + the
  darwin metal entry mirroring ced.
- .github/workflows/bump_deps.yaml: track mudler/voice-detect.cpp via
  VOICEDETECT_VERSION (pin 47546430, = 4754643).
- core/config/backend_capabilities.go: register voice-detect in the
  backend capability map (VoiceVerify/VoiceEmbed/VoiceAnalyze ->
  speaker_recognition), mirroring speaker-recognition.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Add the LocalAI Go backend that dlopens libfacedetect.so (the flat
facedetect_capi_* C-ABI) via purego, mirroring the sibling voice-detect
backend. Implements the Face subset of the Backend gRPC service:

- Embeddings(PredictOptions): Images[0] base64 -> temp file -> embed_path
  -> L2-normalized ArcFace embedding.
- Detect(DetectOptions): src -> detect_path_json -> Detection boxes
  (class_name "face", [x1,y1,x2,y2] -> x/y/w/h).
- FaceVerify(FaceVerifyRequest): two images + threshold + anti_spoof ->
  verify_paths; best-effort img areas via detect.
- FaceAnalyze(FaceAnalyzeRequest): img -> analyze_path_json -> per-face
  age + gender ("M"/"F" normalized to "Man"/"Woman").

The Makefile pins face-detect.cpp to 636a1963 and builds the shared lib
with ggml + vendored libjpeg-turbo static (PIC), so the .so is
ldd-clean (no libggml) and exports only facedetect_capi_* (no jpeg_
symbols). Gated Ginkgo e2e mirrors voice-detect.

Note for the gallery-wiring task: backend registration (index.yaml,
gallery, core/config/backend_capabilities.go) is intentionally not
touched here.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Project style forbids em/en dashes. Replace the three U+2014 chars
introduced by the voice-detect gallery/index wiring with `-`/`:`.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Register the face-detect.cpp face detection / embedding / verification /
analysis backend (added in Face-INT-A) into LocalAI's distribution
surfaces, mirroring the voice-detect wiring (the closest mudler C++/ggml
recognition analogue):

- backend/index.yaml: add the &facedetect meta-backend (capabilities
  platform map, no top-level uri to avoid the meta-backend gotcha) plus
  the full set of concrete per-arch image entries (cpu/cuda12/cuda13/
  metal/rocm/sycl-f16/sycl-f32/vulkan/l4t and the -development variants),
  22 entries. Referential integrity audited: every alias target resolves.
- gallery/index.yaml: add 4 model entries on backend face-detect -
  face-detect-buffalo-l/m/s (insightface SCRFD + ArcFace/MBF, NON-COMMERCIAL)
  and face-detect-yunet-sface (OpenCV-Zoo YuNet + SFace, APACHE-2.0, the
  commercial-friendly alternative). The detector/embedder architecture is
  read from GGUF metadata (facedetect.arch) at load; only the real
  verify_threshold option is set (0.35 buffalo, 0.363 sface). GGUF
  artifacts are not yet published: each files: entry points at the
  intended mudler/face-detect-gguf location with a TODO to fill sha256
  after upload (no fabricated hashes).
- core/config/backend_capabilities.go: register face-detect in the
  backend capability map (Embedding/Detect/FaceVerify/FaceAnalyze ->
  face_recognition), mirroring insightface.
- .github/backend-matrix.yml: add the linux build matrix block + the
  darwin metal entry mirroring voice-detect.
- .github/workflows/bump_deps.yaml: track mudler/face-detect.cpp via
  FACEDETECT_VERSION (pin 636a1963).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
…cases

Add the missing metal BUILD_TYPE branch to the voice-detect Makefile
forwarding -DVOICEDETECT_GGML_METAL=ON, mirroring face-detect, so the
darwin metal CI artifact is built with the Metal backend instead of
CPU-only.

Expand the 4 face-detect gallery models' known_usecases to
[face_recognition, detection, embeddings] to match the backend
capabilities map and the mirrored insightface-buffalo entries, so
auto-selection for /v1/detect and /embeddings works.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
func (f *FaceDetect) embedPath(path string) ([]float32, error) {
var vec uintptr
var dim int32
rc := CppEmbedPath(f.ctxPtr, path, unsafe.Pointer(&vec), unsafe.Pointer(&dim))
func (f *FaceDetect) embedPath(path string) ([]float32, error) {
var vec uintptr
var dim int32
rc := CppEmbedPath(f.ctxPtr, path, unsafe.Pointer(&vec), unsafe.Pointer(&dim))
// uintptr->Pointer conversion trips vet's unsafeptr check, which can't tell
// a C heap pointer from Go-managed memory; safe here, the GC neither tracks
// nor moves this buffer and we copy immediately.
src := unsafe.Slice((*float32)(unsafe.Pointer(vec)), int(dim)) //nolint:govet // C-owned malloc'd vector, copied out before free
// uintptr->Pointer conversion trips vet's unsafeptr check, which can't tell
// a C heap pointer from Go-managed memory; safe here, the GC neither tracks
// nor moves this buffer and we copy immediately.
src := unsafe.Slice((*float32)(unsafe.Pointer(vec)), int(dim)) //nolint:govet // C-owned malloc'd vector, copied out before free
var distance float32
var verified int32
rc := CppVerifyPaths(f.ctxPtr, path1, path2, threshold, antiSpoof,
unsafe.Pointer(&distance), unsafe.Pointer(&verified))
// uintptr->Pointer conversion trips vet's unsafeptr check, which can't tell
// a C heap pointer from Go-managed memory; safe here, the GC neither tracks
// nor moves this buffer and we copy immediately.
src := unsafe.Slice((*float32)(unsafe.Pointer(vec)), int(dim)) //nolint:govet // C-owned malloc'd vector, copied out before free
var distance float32
var verified int32
rc := CppVerifyPaths(v.ctxPtr, req.Audio1, req.Audio2, threshold,
unsafe.Pointer(&distance), unsafe.Pointer(&verified))
var distance float32
var verified int32
rc := CppVerifyPaths(v.ctxPtr, req.Audio1, req.Audio2, threshold,
unsafe.Pointer(&distance), unsafe.Pointer(&verified))
if cptr == 0 {
return ""
}
p := unsafe.Pointer(cptr) //nolint:govet // C-owned malloc'd buffer, not Go-GC memory (see doc above)
for *(*byte)(unsafe.Add(p, n)) != 0 {
n++
}
return string(unsafe.Slice((*byte)(p), n))
mudler added 20 commits June 22, 2026 08:43
Document the new standalone C++/ggml biometric backends as the
recommended/default option for face and voice recognition, keeping the
existing Python insightface / speaker-recognition backends framed as the
legacy path.

- features/face-recognition.md: add a face-detect (ggml) backend section
  with the gallery entries (buffalo-l/m/s non-commercial, yunet-sface
  Apache-2.0), licensing, and verify/detect/analyze quickstart.
- features/voice-recognition.md: add a voice-detect (ggml) backend
  section with the gallery entries (ecapa-tdnn, wespeaker-resnet34,
  eres2net, campplus speaker recognizers; emotion-wav2vec2 non-commercial
  analyze head) and quickstart.
- reference/compatibility-table.md: add face-detect.cpp and
  voice-detect.cpp rows to the Vision, Detection & Recognition table.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Fill in the published HuggingFace GGUF uris and verified sha256 for the
9 recon gallery entries (voice-detect-* and face-detect-*), and remove
the TODO publish markers. Correct the eres2net, campplus, and
emotion-wav2vec2 uris to the actual published filenames.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
… voice model

Update the 3 buffalo face-detect GGUF sha256 (anti-spoof ensemble now
embedded and re-uploaded under the same filenames/uris) and note the
FaceVerify anti_spoof request flag in each description. Add a new
voice-detect-age-gender-wav2vec2 gallery entry mirroring the emotion
model.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Add gallery entries for two newly-published insightface face packs on
the face-detect backend: buffalo_sc (smallest pack, SCRFD-500M + small
ArcFace) and antelopev2 (higher-accuracy, SCRFD-10G + ArcFace glint360k
R100, 512-d). Both are non-commercial research-only.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
…kends

LocalAI spawns one backend process per model and serves requests
concurrently, so the engines' own min(hardware_concurrency, 8) default
can oversubscribe cores. Forward the per-model Threads value from the
gRPC LoadModel options into the engine via VOICEDETECT_THREADS /
FACEDETECT_THREADS (read at backend construction) before the capi load.
A non-positive Threads is treated as unset, leaving the engine default.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
voice-detect.cpp -> 0d9c1b3 (radix-2 FFT FBank, threads, flash attn + cached
pos-conv); face-detect.cpp -> 523aee1 (thread-gated direct conv, threads).
Brings the CPU optimizations into the LocalAI backend builds. GGUF format and
parity unchanged, so the published HF GGUFs remain valid.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
voice-detect.cpp -> fe7e6a3 (ERes2Net 1x1->mul_mat, CAM++ layout+context,
wav2vec2 conv-LN, ECAPA capture-drop, AVX512 dispatch opt-in); face-detect.cpp
-> 9c8adb7 (AVX2 Winograd F(2x2,3x3) for SCRFD/ArcFace 3x3 convs, ArcFace
BN-fold). Parity unchanged (cosine=1.0); GGUF format unchanged, HF GGUFs valid.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
voice-detect.cpp -> 45122ec (Winograd F(2x2,3x3) for WeSpeaker/ERes2Net 3x3
convs, -22%/-20% @8t); face-detect.cpp -> cd5c962 (Winograd F(4x4,3x3) for
SCRFD large maps, -22% @1T on top of F(2x2), more load-stable). Parity held
(cosine=1.0); GGUF format unchanged, HF GGUFs valid.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
…complete)

voice-detect.cpp -> d2839ca (CAM++ FCM 2D convs through Winograd, -15.5%/-10.3%);
face-detect.cpp -> c1db23d (AVX2-vectorized Winograd tile transforms, SCRFD
detect -14%/-9.6%). Final CPU optimization round; the conv-kernel lever class is
now exhausted (parity held cosine=1.0; GGUF/parity unchanged, HF GGUFs valid).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
face-detect.cpp -> 7ae5c4d: register-blocked winograd-domain GEMM microkernel
(2.8x isolated GFLOP/s), AVX-512 zmm evolution behind runtime CPUID dispatch
(ship-safe, AVX2 fallback bit-identical), bias/relu fused into the winograd
output transform, and SFace Conv+BN fold + bias/PReLU fusion. SCRFD detect
~1.4x faster end-to-end vs the round-4 baseline; parity bit-exact; portable
single binary (function-multiversioned, no global -mavx512f).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
voice-detect.cpp -> e9c56ae: weight-as-src0 mul_mat order in ECAPA's F32
conv1d_same (routes through tinyBLAS sgemm); ECAPA embed 1.67x @1T / ~1.3x @8t,
parity cosine=1.0. Isolated to encoder.cpp (ECAPA-only); ERes2Net/CAM++/WeSpeaker
do not call conv1d_same so are provably unaffected.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
…e 2d2d5f0)

face -> 2d2d5f0: route ArcFace 3x3 body convs through the AVX-512 winograd
microkernel (kWinoMinSize 80->14); ArcFace 1.62x @1T, SCRFD detect to 0.966 of
MLAS @1T, no regression. voice -> f7b9f89: runtime-CPUID-dispatched AVX-512
winograd-GEMM microkernel (ship-safe, AVX2 fallback bit-identical); WeSpeaker
1.90x @1T. Parity cosine=1.0 throughout; portable single binaries.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
…d07, face be22d67)

Hand-tuned nChw16c AVX-512 register-tiled direct-conv microkernel (~263 GFLOP/s,
within 6-7% of MLAS per-op efficiency), runtime-CPUID-dispatched + AVX2 fallback,
fused bias/relu. voice 7ecfd07: default 3x3-s1 kernel for WeSpeaker (+37%/+32%)
+ ERes2Net, CAM++ pinned to Winograd. face be22d67: shape-gated to the ArcFace
recognizer body (+25-27% @8t); SCRFD detector stays on Winograd (no regression).
Parity cosine=1.0 / detect <=1px on AVX-512 + AVX2 paths. Portable single binaries.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
WeSpeaker ResNet34 runs as one nChw16c blocked island (2 reorders/forward vs
~60) on AVX-512, default; per-conv directconv fallback on AVX2. +2.9% @1T /
+17-19% @8t vs per-conv directconv, parity cosine=1.0. The conv microkernel is
already FMA-bound near peak (~0.86-0.98x MLAS-implied); residual to MLAS is
sub-peak edge + non-conv tail, documented in docs/cpu-optimization.md.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
…ace d80092b)

voice 7f66871: AVX2-vectorized (ymm) blocked island - AVX2-only hosts now run
the blocked backbone for WeSpeaker (2.3x over per-conv-AVX2, cosine=1.0);
ERes2Net stays per-conv (blocked regresses, opt-in only); CAM++ Winograd-pinned.
face d80092b: ArcFace recognizer blocked island, AVX-512 default (-13% @8t, ~0.90x
MLAS, the closest conv result), auto per-conv on AVX2; SCRFD untouched on Winograd
(0 island invocations during detect). Parity cosine=1.0 / detect <=1px throughout.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
…9b1804, face 47fdab6)

Measured-gap-driven conv kernels: small-spatial (fill the register tile when
output width <= tile width) + small-IC stem + strided-1x1/downsample recovery.
ArcFace recognizer 0.57 -> 0.70x MLAS @1T (the closest conv model), WeSpeaker
0.65 -> 0.79x @1T. Parity cosine=1.0 / detect <=1px. The OC-block-sharing lever
was a measured dead-end (deep stride-1 is L3-weight-bandwidth bound, not
read-port bound) and was NOT shipped. Kernel ceiling reached; further gap needs
an algorithm-class change (cache-blocked weight-stationary GEMM, or q8 weights).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
…che (voice 45d2e6b, face 0a4799a)

GPU wins (CUDA/ggml backend, no CPU-path change): persistent per-shape graph+context
cache in Backend::compute() eliminates the per-call cudaGraph re-instantiation churn
-> wav2vec2 emotion+age-gender now AT GPU parity with torch-cuDNN on GB10 (0.97-0.98x),
CAM++ -5.7ms; bit-identical parity. Cache hardened multi-model-safe (invalidate-on-free
keyed by the ModelLoader weights buffer) so LocalAI multi-model hosting cannot stale-hit.
Conv models still trail cuDNN (im2col-materialization-bound) - cuDNN implicit-GEMM lever next.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
… face 6107a24)

Adds the opt-in cuDNN implicit-GEMM conv path (VOICEDETECT_GGML_CUDNN /
FACEDETECT_GGML_CUDNN, DEFAULT OFF -> zero build/runtime dep until enabled).
On GPU it kills the im2col-materialization bottleneck and reaches torch-cuDNN
parity on the spill-bound convs: SCRFD detect 14.8->6.4ms (2.3x, ~parity),
WeSpeaker ~parity, ERes2Net beats torch (1.10x); ArcFace/CAM++ neutral (no
spill). Parity exact (SCRFD <=1px, cosine=1.0). To USE it in LocalAI, the CUDA
backend build must enable the flag AND bundle libcudnn - deferred until a
cuDNN-bundled GPU image; flag stays OFF here.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
The voice-detect.cpp / face-detect.cpp engines have an opt-in cuDNN
implicit-GEMM conv path behind VOICEDETECT_GGML_CUDNN / FACEDETECT_GGML_CUDNN
(default OFF) that kills im2col on the GPU and reaches torch-cuDNN parity
(SCRFD 2.3x, WeSpeaker/ERes2Net parity), measured on the GB10
(arm64, CUDA 13, sm_121a).

Enable it for the CUDA build, but only where cuDNN actually ships: the
arm64 + CUDA 13 image (GB10/Jetson/L4T). x86 CUDA images carry no cuDNN,
so flipping it on globally for BUILD_TYPE=cublas would be a link failure.
The Makefiles gate on CUDA_MAJOR_VERSION=13 + arch (TARGETARCH from the
matrix/Docker build, uname -m fallback for local builds).

backend/Dockerfile.golang already installs the runtime libcudnn9-cuda-13
in the arm64+CUDA13 apt block; add the matching libcudnn9-dev-cuda-13 so
the build-time link resolves.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
…ecd)

Defaults VD_ERES2NET_BLOCKED ON: routes the ERes2Net Res2Net body through the
blocked nChw16c AVX-512 directconv island instead of the 1x1 mul_mat fast path
(CONT-transpose + skinny low-K GEMM). On the shipped GGML_NATIVE=OFF build (ggml
mul_mat is AVX2-only) this wins ~2x at every thread count (2.07x@1t, 2.2x@4t,
2.05x@8t); pure-AVX2 fallback still 1.3-1.62x. Parity exact (cosine=1.000000 vs
golden), so registered voices + verify/identify thresholds are unaffected. The
prior default-OFF rested on a stale comment whose 23pct regression only held on
the non-shipping GGML_NATIVE=ON build.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
@localai-bot localai-bot marked this pull request as ready for review June 24, 2026 21:58
…Latest News

Add a Latest News entry for the new from-scratch C++/ggml biometric backends
(voice-detect.cpp + face-detect.cpp) that replace the Python insightface and
speaker-recognition backends: no Python/onnxruntime at inference, self-contained
GGUF, bit-exact parity, GPU cuDNN parity. Mirrors the parakeet.cpp /
locate-anything.cpp native-backend news entries. Refs PR #10441.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants