cuda-entrypoint.sh awk pattern fails on NVIDIA driver 6xx (CUDA UMD Version: header rename) → CUDA_ERROR_NO_DEVICE → CPU fallback

## Summary

`cuda-entrypoint.sh` (introduced in #831) uses `nvidia-smi | awk '/CUDA Version/ {print $3}'` to detect the host CUDA version. NVIDIA driver 6xx-series rearranged the `nvidia-smi` header line and renamed `CUDA Version:` → `CUDA UMD Version:`. The unanchored regex matches the new header as a substring, but `$3` then yields the literal string `"Version:"` instead of a version number. Parsing degrades to `DRIVER_INT=0`, the script wrongly classifies the host as pre-12.9.1, and prepends `/usr/local/cuda/compat` (cuda-compat-12-9, `libcuda.so.575.57.08`) to `LD_LIBRARY_PATH`. On a CUDA 13 / driver 6xx host that compat shim cannot bind the new KMD ABI, so `cuDeviceGetCount` returns `CUDA_ERROR_NO_DEVICE` and candle/cudarc silently falls back to CPU.

Net effect: TEI runs on CPU on every driver-6xx host, with only a `WARN` line distinguishing it from a healthy GPU run.

## Affected image tags

`120-1.9.3`, `120-latest`, `120-sha-*` (most recent main HEAD builds). All carry an identical unmodified `cuda-entrypoint.sh`. Verified via `docker pull` + `docker inspect` on each tag — same env, same entrypoint, same bug.

## Reproducer (driver 610.43.02, CUDA UMD 13.3 on WSL2 Docker Desktop)

Host driver:

```
$ nvidia-smi
| NVIDIA-SMI 610.43.02              | KMD Version: 610.47        | CUDA UMD Version: 13.3        |
```

Container GPU access works:

```
$ docker run --rm --gpus all --entrypoint nvidia-smi \
    ghcr.io/huggingface/text-embeddings-inference:120-1.9.3
# → reports both GPUs healthy
```

But the binary falls back to CPU:

```
$ docker run --rm --gpus all ghcr.io/huggingface/text-embeddings-inference:120-1.9.3 \
    --model-id BAAI/bge-reranker-v2-m3 2>&1 | grep -E "model on|CUDA"
WARN Could not find a compatible CUDA device on host: CUDA is not available.
     Caused by: DriverError(CUDA_ERROR_NO_DEVICE, "no CUDA-capable device is detected")
INFO Starting Bert model on Cpu
```

Reproducing the parse failure directly inside the container:

```
$ docker exec <container> bash -c '
    DRIVER_CUDA=$(nvidia-smi 2>/dev/null | awk "/CUDA Version/ {print \$3; exit}")
    echo "DRIVER_CUDA=$DRIVER_CUDA"
'
DRIVER_CUDA=
```

(Empty — should have been `13.3`.)

## Root cause line

[`cuda-entrypoint.sh:13`](https://github.com/huggingface/text-embeddings-inference/blob/main/cuda-entrypoint.sh#L13):

```bash
DRIVER_CUDA=$(nvidia-smi 2>/dev/null | awk '/CUDA Version/ {print $3; exit}')
```

`/CUDA Version/` is unanchored and matches the new header's `CUDA UMD Version` substring, but `$3` in the new whitespace layout is the literal `"Version:"` — not a number. `IFS='.' read MAJ MIN PATCH` then yields `MAJ="Version:"` and `10#${MAJ}` arithmetic silently coerces to 0. `DRIVER_INT=0 < TARGET_INT=120901` → the `LD_LIBRARY_PATH=/usr/local/cuda/compat:...` branch fires on exactly the hosts PR #831 was designed to exclude.

## Proposed fix — hybrid: machine-readable accessor + legacy fallback

```bash
# Primary: machine-readable interface (stable since R340 / ~2014, documented in nvidia-smi(1))
DRIVER_VER=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader,nounits 2>/dev/null | head -n1)
DRIVER_MAJ="${DRIVER_VER%%.*}"

# Derive max-supported CUDA major from R-branch floor (NVIDIA CUDA Toolkit Release Notes, Table 2).
# 11.x ≥ R450, 12.x ≥ R525, 13.x ≥ R580.
DRIVER_CUDA=""
case "$DRIVER_MAJ" in
  ''|*[!0-9]*) DRIVER_CUDA="" ;;
  *)
    if   [ "$DRIVER_MAJ" -ge 580 ]; then DRIVER_CUDA="13.0"
    elif [ "$DRIVER_MAJ" -ge 525 ]; then DRIVER_CUDA="12.9"
    elif [ "$DRIVER_MAJ" -ge 450 ]; then DRIVER_CUDA="11.8"
    fi
    ;;
esac

# Fallback: legacy header parse for hosts where --query-gpu fails (very rare).
# Tolerates BOTH old "CUDA Version:" and new "CUDA UMD Version:" spellings.
if [ -z "$DRIVER_CUDA" ]; then
  DRIVER_CUDA=$(nvidia-smi 2>/dev/null \
    | grep -Eo 'CUDA( UMD)? Version:[[:space:]]*[0-9]+\.[0-9]+' \
    | head -n1 \
    | grep -Eo '[0-9]+\.[0-9]+' \
    | head -n1)
fi
```

Downstream `IFS='.' read MAJ MIN PATCH` / `DRIVER_INT` / `TARGET_INT` test stays unchanged — the case-statement returns dotted majors (`"13.0"`, `"12.9"`, `"11.8"`) so the existing arithmetic is fed valid input. `PATCH` defaults to empty → coerces to 0 cleanly.

### Why this shape

1. **`--query-gpu=driver_version` is NVIDIA's contractually-stable machine accessor** (R340 / 2014+, documented in `nvidia-smi(1)`). The text header line is for humans and has been rearranged at least once (5xx→6xx). Relying on it is the canonical anti-pattern NVIDIA explicitly warns against in the `nvidia-smi` manual.
2. **The branch-floor mapping comes from NVIDIA's CUDA Toolkit Release Notes Table 2** (https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/) — no per-release table maintenance, just one arm per future CUDA major.
3. **The regex fallback** uses `grep -Eo` chains rather than GAWK `match($0, /…/, m)` to stay POSIX-portable, and tolerates both header spellings as a belt-and-suspenders safety net.

## Edge cases worth thinking about

- **CUDA 14 (future):** add one arm to the case statement. The branch-floor pattern means the only ongoing maintenance is one line per new CUDA major.
- **`nvidia-smi` absent:** the existing script already tolerates a missing `nvidia-smi` (everything degrades to `DRIVER_INT=0`); the proposed change preserves that. An opt-in `TEI_DRIVER_CUDA` env override would be a nice future addition for users bringing their own libcuda but is out of scope here.
- **WSL2:** `nvidia-smi` inside the container reports the Windows host driver version; the floor-lookup is still correct (this is how I observed and reproduced the bug).
- **MIG / vGPU:** `--query-gpu=driver_version` returns the host driver fine; no special handling.

## Workaround for users hitting this right now

While the fix lands, users can bypass `cuda-entrypoint.sh` entirely by overriding the entrypoint in their compose / `docker run` invocation:

```yaml
services:
  tei:
    image: ghcr.io/huggingface/text-embeddings-inference:120-1.9.3
    entrypoint: ["/usr/local/bin/text-embeddings-router"]
    command: --model-id <your-model> --hostname 0.0.0.0 [...]
```

This skips the compat-library decision and runs the binary with the image's default `LD_LIBRARY_PATH=/usr/local/cuda/lib64`, which is correct for CUDA 13.x hosts (WSL2 GPU stub at `/usr/lib/wsl/drivers/.../libcuda.so.1.1` dispatches to the host driver). Validated locally on driver 610.47 — `Starting FlashBert model on Cuda(...)` + warm `/embed` 14–26 ms, `/rerank` 27 ms.

## References

- Source of the buggy entrypoint: #831 ("Fix support for containers w/ CUDA 13.0+", merged 2026-02-17)
- NVIDIA CUDA Toolkit Release Notes Table 2 (driver branch ↔ CUDA major): https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/
- `nvidia-smi(1)` documents `--query-gpu=driver_version` as a stable machine accessor


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda-entrypoint.sh awk pattern fails on NVIDIA driver 6xx (CUDA UMD Version: header rename) → CUDA_ERROR_NO_DEVICE → CPU fallback #870

Summary

Affected image tags

Reproducer (driver 610.43.02, CUDA UMD 13.3 on WSL2 Docker Desktop)

Root cause line

Proposed fix — hybrid: machine-readable accessor + legacy fallback

Why this shape

Edge cases worth thinking about

Workaround for users hitting this right now

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

cuda-entrypoint.sh awk pattern fails on NVIDIA driver 6xx (CUDA UMD Version: header rename) → CUDA_ERROR_NO_DEVICE → CPU fallback #870

Description

Summary

Affected image tags

Reproducer (driver 610.43.02, CUDA UMD 13.3 on WSL2 Docker Desktop)

Root cause line

Proposed fix — hybrid: machine-readable accessor + legacy fallback

Why this shape

Edge cases worth thinking about

Workaround for users hitting this right now

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions