Summary
cuda-entrypoint.sh (introduced in #831) uses nvidia-smi | awk '/CUDA Version/ {print $3}' to detect the host CUDA version. NVIDIA driver 6xx-series rearranged the nvidia-smi header line and renamed CUDA Version: → CUDA UMD Version:. The unanchored regex matches the new header as a substring, but $3 then yields the literal string "Version:" instead of a version number. Parsing degrades to DRIVER_INT=0, the script wrongly classifies the host as pre-12.9.1, and prepends /usr/local/cuda/compat (cuda-compat-12-9, libcuda.so.575.57.08) to LD_LIBRARY_PATH. On a CUDA 13 / driver 6xx host that compat shim cannot bind the new KMD ABI, so cuDeviceGetCount returns CUDA_ERROR_NO_DEVICE and candle/cudarc silently falls back to CPU.
Net effect: TEI runs on CPU on every driver-6xx host, with only a WARN line distinguishing it from a healthy GPU run.
Affected image tags
120-1.9.3, 120-latest, 120-sha-* (most recent main HEAD builds). All carry an identical unmodified cuda-entrypoint.sh. Verified via docker pull + docker inspect on each tag — same env, same entrypoint, same bug.
Reproducer (driver 610.43.02, CUDA UMD 13.3 on WSL2 Docker Desktop)
Host driver:
$ nvidia-smi
| NVIDIA-SMI 610.43.02 | KMD Version: 610.47 | CUDA UMD Version: 13.3 |
Container GPU access works:
$ docker run --rm --gpus all --entrypoint nvidia-smi \
ghcr.io/huggingface/text-embeddings-inference:120-1.9.3
# → reports both GPUs healthy
But the binary falls back to CPU:
$ docker run --rm --gpus all ghcr.io/huggingface/text-embeddings-inference:120-1.9.3 \
--model-id BAAI/bge-reranker-v2-m3 2>&1 | grep -E "model on|CUDA"
WARN Could not find a compatible CUDA device on host: CUDA is not available.
Caused by: DriverError(CUDA_ERROR_NO_DEVICE, "no CUDA-capable device is detected")
INFO Starting Bert model on Cpu
Reproducing the parse failure directly inside the container:
$ docker exec <container> bash -c '
DRIVER_CUDA=$(nvidia-smi 2>/dev/null | awk "/CUDA Version/ {print \$3; exit}")
echo "DRIVER_CUDA=$DRIVER_CUDA"
'
DRIVER_CUDA=
(Empty — should have been 13.3.)
Root cause line
cuda-entrypoint.sh:13:
DRIVER_CUDA=$(nvidia-smi 2>/dev/null | awk '/CUDA Version/ {print $3; exit}')
/CUDA Version/ is unanchored and matches the new header's CUDA UMD Version substring, but $3 in the new whitespace layout is the literal "Version:" — not a number. IFS='.' read MAJ MIN PATCH then yields MAJ="Version:" and 10#${MAJ} arithmetic silently coerces to 0. DRIVER_INT=0 < TARGET_INT=120901 → the LD_LIBRARY_PATH=/usr/local/cuda/compat:... branch fires on exactly the hosts PR #831 was designed to exclude.
Proposed fix — hybrid: machine-readable accessor + legacy fallback
# Primary: machine-readable interface (stable since R340 / ~2014, documented in nvidia-smi(1))
DRIVER_VER=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader,nounits 2>/dev/null | head -n1)
DRIVER_MAJ="${DRIVER_VER%%.*}"
# Derive max-supported CUDA major from R-branch floor (NVIDIA CUDA Toolkit Release Notes, Table 2).
# 11.x ≥ R450, 12.x ≥ R525, 13.x ≥ R580.
DRIVER_CUDA=""
case "$DRIVER_MAJ" in
''|*[!0-9]*) DRIVER_CUDA="" ;;
*)
if [ "$DRIVER_MAJ" -ge 580 ]; then DRIVER_CUDA="13.0"
elif [ "$DRIVER_MAJ" -ge 525 ]; then DRIVER_CUDA="12.9"
elif [ "$DRIVER_MAJ" -ge 450 ]; then DRIVER_CUDA="11.8"
fi
;;
esac
# Fallback: legacy header parse for hosts where --query-gpu fails (very rare).
# Tolerates BOTH old "CUDA Version:" and new "CUDA UMD Version:" spellings.
if [ -z "$DRIVER_CUDA" ]; then
DRIVER_CUDA=$(nvidia-smi 2>/dev/null \
| grep -Eo 'CUDA( UMD)? Version:[[:space:]]*[0-9]+\.[0-9]+' \
| head -n1 \
| grep -Eo '[0-9]+\.[0-9]+' \
| head -n1)
fi
Downstream IFS='.' read MAJ MIN PATCH / DRIVER_INT / TARGET_INT test stays unchanged — the case-statement returns dotted majors ("13.0", "12.9", "11.8") so the existing arithmetic is fed valid input. PATCH defaults to empty → coerces to 0 cleanly.
Why this shape
--query-gpu=driver_version is NVIDIA's contractually-stable machine accessor (R340 / 2014+, documented in nvidia-smi(1)). The text header line is for humans and has been rearranged at least once (5xx→6xx). Relying on it is the canonical anti-pattern NVIDIA explicitly warns against in the nvidia-smi manual.
- The branch-floor mapping comes from NVIDIA's CUDA Toolkit Release Notes Table 2 (https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/) — no per-release table maintenance, just one arm per future CUDA major.
- The regex fallback uses
grep -Eo chains rather than GAWK match($0, /…/, m) to stay POSIX-portable, and tolerates both header spellings as a belt-and-suspenders safety net.
Edge cases worth thinking about
- CUDA 14 (future): add one arm to the case statement. The branch-floor pattern means the only ongoing maintenance is one line per new CUDA major.
nvidia-smi absent: the existing script already tolerates a missing nvidia-smi (everything degrades to DRIVER_INT=0); the proposed change preserves that. An opt-in TEI_DRIVER_CUDA env override would be a nice future addition for users bringing their own libcuda but is out of scope here.
- WSL2:
nvidia-smi inside the container reports the Windows host driver version; the floor-lookup is still correct (this is how I observed and reproduced the bug).
- MIG / vGPU:
--query-gpu=driver_version returns the host driver fine; no special handling.
Workaround for users hitting this right now
While the fix lands, users can bypass cuda-entrypoint.sh entirely by overriding the entrypoint in their compose / docker run invocation:
services:
tei:
image: ghcr.io/huggingface/text-embeddings-inference:120-1.9.3
entrypoint: ["/usr/local/bin/text-embeddings-router"]
command: --model-id <your-model> --hostname 0.0.0.0 [...]
This skips the compat-library decision and runs the binary with the image's default LD_LIBRARY_PATH=/usr/local/cuda/lib64, which is correct for CUDA 13.x hosts (WSL2 GPU stub at /usr/lib/wsl/drivers/.../libcuda.so.1.1 dispatches to the host driver). Validated locally on driver 610.47 — Starting FlashBert model on Cuda(...) + warm /embed 14–26 ms, /rerank 27 ms.
References
Summary
cuda-entrypoint.sh(introduced in #831) usesnvidia-smi | awk '/CUDA Version/ {print $3}'to detect the host CUDA version. NVIDIA driver 6xx-series rearranged thenvidia-smiheader line and renamedCUDA Version:→CUDA UMD Version:. The unanchored regex matches the new header as a substring, but$3then yields the literal string"Version:"instead of a version number. Parsing degrades toDRIVER_INT=0, the script wrongly classifies the host as pre-12.9.1, and prepends/usr/local/cuda/compat(cuda-compat-12-9,libcuda.so.575.57.08) toLD_LIBRARY_PATH. On a CUDA 13 / driver 6xx host that compat shim cannot bind the new KMD ABI, socuDeviceGetCountreturnsCUDA_ERROR_NO_DEVICEand candle/cudarc silently falls back to CPU.Net effect: TEI runs on CPU on every driver-6xx host, with only a
WARNline distinguishing it from a healthy GPU run.Affected image tags
120-1.9.3,120-latest,120-sha-*(most recent main HEAD builds). All carry an identical unmodifiedcuda-entrypoint.sh. Verified viadocker pull+docker inspecton each tag — same env, same entrypoint, same bug.Reproducer (driver 610.43.02, CUDA UMD 13.3 on WSL2 Docker Desktop)
Host driver:
Container GPU access works:
But the binary falls back to CPU:
Reproducing the parse failure directly inside the container:
(Empty — should have been
13.3.)Root cause line
cuda-entrypoint.sh:13:DRIVER_CUDA=$(nvidia-smi 2>/dev/null | awk '/CUDA Version/ {print $3; exit}')/CUDA Version/is unanchored and matches the new header'sCUDA UMD Versionsubstring, but$3in the new whitespace layout is the literal"Version:"— not a number.IFS='.' read MAJ MIN PATCHthen yieldsMAJ="Version:"and10#${MAJ}arithmetic silently coerces to 0.DRIVER_INT=0 < TARGET_INT=120901→ theLD_LIBRARY_PATH=/usr/local/cuda/compat:...branch fires on exactly the hosts PR #831 was designed to exclude.Proposed fix — hybrid: machine-readable accessor + legacy fallback
Downstream
IFS='.' read MAJ MIN PATCH/DRIVER_INT/TARGET_INTtest stays unchanged — the case-statement returns dotted majors ("13.0","12.9","11.8") so the existing arithmetic is fed valid input.PATCHdefaults to empty → coerces to 0 cleanly.Why this shape
--query-gpu=driver_versionis NVIDIA's contractually-stable machine accessor (R340 / 2014+, documented innvidia-smi(1)). The text header line is for humans and has been rearranged at least once (5xx→6xx). Relying on it is the canonical anti-pattern NVIDIA explicitly warns against in thenvidia-smimanual.grep -Eochains rather than GAWKmatch($0, /…/, m)to stay POSIX-portable, and tolerates both header spellings as a belt-and-suspenders safety net.Edge cases worth thinking about
nvidia-smiabsent: the existing script already tolerates a missingnvidia-smi(everything degrades toDRIVER_INT=0); the proposed change preserves that. An opt-inTEI_DRIVER_CUDAenv override would be a nice future addition for users bringing their own libcuda but is out of scope here.nvidia-smiinside the container reports the Windows host driver version; the floor-lookup is still correct (this is how I observed and reproduced the bug).--query-gpu=driver_versionreturns the host driver fine; no special handling.Workaround for users hitting this right now
While the fix lands, users can bypass
cuda-entrypoint.shentirely by overriding the entrypoint in their compose /docker runinvocation:This skips the compat-library decision and runs the binary with the image's default
LD_LIBRARY_PATH=/usr/local/cuda/lib64, which is correct for CUDA 13.x hosts (WSL2 GPU stub at/usr/lib/wsl/drivers/.../libcuda.so.1.1dispatches to the host driver). Validated locally on driver 610.47 —Starting FlashBert model on Cuda(...)+ warm/embed14–26 ms,/rerank27 ms.References
nvidia-smi(1)documents--query-gpu=driver_versionas a stable machine accessor