Skip to content

cuda-entrypoint.sh awk pattern fails on NVIDIA driver 6xx (CUDA UMD Version: header rename) → CUDA_ERROR_NO_DEVICE → CPU fallback #870

@randomnimbus

Description

@randomnimbus

Summary

cuda-entrypoint.sh (introduced in #831) uses nvidia-smi | awk '/CUDA Version/ {print $3}' to detect the host CUDA version. NVIDIA driver 6xx-series rearranged the nvidia-smi header line and renamed CUDA Version:CUDA UMD Version:. The unanchored regex matches the new header as a substring, but $3 then yields the literal string "Version:" instead of a version number. Parsing degrades to DRIVER_INT=0, the script wrongly classifies the host as pre-12.9.1, and prepends /usr/local/cuda/compat (cuda-compat-12-9, libcuda.so.575.57.08) to LD_LIBRARY_PATH. On a CUDA 13 / driver 6xx host that compat shim cannot bind the new KMD ABI, so cuDeviceGetCount returns CUDA_ERROR_NO_DEVICE and candle/cudarc silently falls back to CPU.

Net effect: TEI runs on CPU on every driver-6xx host, with only a WARN line distinguishing it from a healthy GPU run.

Affected image tags

120-1.9.3, 120-latest, 120-sha-* (most recent main HEAD builds). All carry an identical unmodified cuda-entrypoint.sh. Verified via docker pull + docker inspect on each tag — same env, same entrypoint, same bug.

Reproducer (driver 610.43.02, CUDA UMD 13.3 on WSL2 Docker Desktop)

Host driver:

$ nvidia-smi
| NVIDIA-SMI 610.43.02              | KMD Version: 610.47        | CUDA UMD Version: 13.3        |

Container GPU access works:

$ docker run --rm --gpus all --entrypoint nvidia-smi \
    ghcr.io/huggingface/text-embeddings-inference:120-1.9.3
# → reports both GPUs healthy

But the binary falls back to CPU:

$ docker run --rm --gpus all ghcr.io/huggingface/text-embeddings-inference:120-1.9.3 \
    --model-id BAAI/bge-reranker-v2-m3 2>&1 | grep -E "model on|CUDA"
WARN Could not find a compatible CUDA device on host: CUDA is not available.
     Caused by: DriverError(CUDA_ERROR_NO_DEVICE, "no CUDA-capable device is detected")
INFO Starting Bert model on Cpu

Reproducing the parse failure directly inside the container:

$ docker exec <container> bash -c '
    DRIVER_CUDA=$(nvidia-smi 2>/dev/null | awk "/CUDA Version/ {print \$3; exit}")
    echo "DRIVER_CUDA=$DRIVER_CUDA"
'
DRIVER_CUDA=

(Empty — should have been 13.3.)

Root cause line

cuda-entrypoint.sh:13:

DRIVER_CUDA=$(nvidia-smi 2>/dev/null | awk '/CUDA Version/ {print $3; exit}')

/CUDA Version/ is unanchored and matches the new header's CUDA UMD Version substring, but $3 in the new whitespace layout is the literal "Version:" — not a number. IFS='.' read MAJ MIN PATCH then yields MAJ="Version:" and 10#${MAJ} arithmetic silently coerces to 0. DRIVER_INT=0 < TARGET_INT=120901 → the LD_LIBRARY_PATH=/usr/local/cuda/compat:... branch fires on exactly the hosts PR #831 was designed to exclude.

Proposed fix — hybrid: machine-readable accessor + legacy fallback

# Primary: machine-readable interface (stable since R340 / ~2014, documented in nvidia-smi(1))
DRIVER_VER=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader,nounits 2>/dev/null | head -n1)
DRIVER_MAJ="${DRIVER_VER%%.*}"

# Derive max-supported CUDA major from R-branch floor (NVIDIA CUDA Toolkit Release Notes, Table 2).
# 11.x ≥ R450, 12.x ≥ R525, 13.x ≥ R580.
DRIVER_CUDA=""
case "$DRIVER_MAJ" in
  ''|*[!0-9]*) DRIVER_CUDA="" ;;
  *)
    if   [ "$DRIVER_MAJ" -ge 580 ]; then DRIVER_CUDA="13.0"
    elif [ "$DRIVER_MAJ" -ge 525 ]; then DRIVER_CUDA="12.9"
    elif [ "$DRIVER_MAJ" -ge 450 ]; then DRIVER_CUDA="11.8"
    fi
    ;;
esac

# Fallback: legacy header parse for hosts where --query-gpu fails (very rare).
# Tolerates BOTH old "CUDA Version:" and new "CUDA UMD Version:" spellings.
if [ -z "$DRIVER_CUDA" ]; then
  DRIVER_CUDA=$(nvidia-smi 2>/dev/null \
    | grep -Eo 'CUDA( UMD)? Version:[[:space:]]*[0-9]+\.[0-9]+' \
    | head -n1 \
    | grep -Eo '[0-9]+\.[0-9]+' \
    | head -n1)
fi

Downstream IFS='.' read MAJ MIN PATCH / DRIVER_INT / TARGET_INT test stays unchanged — the case-statement returns dotted majors ("13.0", "12.9", "11.8") so the existing arithmetic is fed valid input. PATCH defaults to empty → coerces to 0 cleanly.

Why this shape

  1. --query-gpu=driver_version is NVIDIA's contractually-stable machine accessor (R340 / 2014+, documented in nvidia-smi(1)). The text header line is for humans and has been rearranged at least once (5xx→6xx). Relying on it is the canonical anti-pattern NVIDIA explicitly warns against in the nvidia-smi manual.
  2. The branch-floor mapping comes from NVIDIA's CUDA Toolkit Release Notes Table 2 (https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/) — no per-release table maintenance, just one arm per future CUDA major.
  3. The regex fallback uses grep -Eo chains rather than GAWK match($0, /…/, m) to stay POSIX-portable, and tolerates both header spellings as a belt-and-suspenders safety net.

Edge cases worth thinking about

  • CUDA 14 (future): add one arm to the case statement. The branch-floor pattern means the only ongoing maintenance is one line per new CUDA major.
  • nvidia-smi absent: the existing script already tolerates a missing nvidia-smi (everything degrades to DRIVER_INT=0); the proposed change preserves that. An opt-in TEI_DRIVER_CUDA env override would be a nice future addition for users bringing their own libcuda but is out of scope here.
  • WSL2: nvidia-smi inside the container reports the Windows host driver version; the floor-lookup is still correct (this is how I observed and reproduced the bug).
  • MIG / vGPU: --query-gpu=driver_version returns the host driver fine; no special handling.

Workaround for users hitting this right now

While the fix lands, users can bypass cuda-entrypoint.sh entirely by overriding the entrypoint in their compose / docker run invocation:

services:
  tei:
    image: ghcr.io/huggingface/text-embeddings-inference:120-1.9.3
    entrypoint: ["/usr/local/bin/text-embeddings-router"]
    command: --model-id <your-model> --hostname 0.0.0.0 [...]

This skips the compat-library decision and runs the binary with the image's default LD_LIBRARY_PATH=/usr/local/cuda/lib64, which is correct for CUDA 13.x hosts (WSL2 GPU stub at /usr/lib/wsl/drivers/.../libcuda.so.1.1 dispatches to the host driver). Validated locally on driver 610.47 — Starting FlashBert model on Cuda(...) + warm /embed 14–26 ms, /rerank 27 ms.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions