System Info
Docker image: ghcr.io/huggingface/text-embeddings-inference:cuda-1.9
Start command: --tokenization-workers=16 --dtype float16 --auto-truncate --max-client-batch-size 128
Host OS: Ubuntu
Information
Tasks
Reproduction
I've been running Qwen3-Embedding-8B with float16 dtype and noticed that any input starts with the token "import" (token ID 474), such as "importance", "import", and "important" will cause all-NaN vectors returned.
# Reproduction
curl <TEI_URL>/embed \
-X POST \
-H "Content-Type: application/json" \
-d '{"inputs": "importance"}'
# Returns: [[NaN, NaN, NaN, ...]]
Expected behavior
When investigating, I found people having the exactly same issue -> https://huggingface.co/Qwen/Qwen3-Embedding-8B/discussions/21, and padding the word with a leading space does seem to mitigate it.
I tried checking out tag v1.9.2 and traced through the model layer by layer and found that the NaN originates from
an FP16 overflow in the MLP layers. Here's the chain of events:
- RMSNorm normalizes hidden states to ~1.0 — perfectly safe in F16
- Attention runs fine on these normalized values
- The MLP's down_proj output, however, reaches values around ~2.95 million for this token
- F16 can only represent values up to ~65504, so this overflows to Inf
- The residual add (Inf + finite) stays Inf
- The next layer's RMSNorm receives Inf and produces NaN
- NaN propagates through every remaining layer
The overflow first appears at layer 2 and corrupts the entire output from that point on.
Not sure whether it's reasonable, my hypothesis is that Qwen3-Embedding-8B was trained in BF16 -> The MLP weights learned during BF16 training produce activations exceed what F16 range so precision mismatch error occurred.
System Info
Docker image: ghcr.io/huggingface/text-embeddings-inference:cuda-1.9
Start command:
--tokenization-workers=16 --dtype float16 --auto-truncate --max-client-batch-size 128Host OS: Ubuntu
Information
Tasks
Reproduction
I've been running Qwen3-Embedding-8B with float16 dtype and noticed that any input starts with the token "import" (token ID 474), such as "importance", "import", and "important" will cause all-NaN vectors returned.
Expected behavior
When investigating, I found people having the exactly same issue -> https://huggingface.co/Qwen/Qwen3-Embedding-8B/discussions/21, and padding the word with a leading space does seem to mitigate it.
I tried checking out tag v1.9.2 and traced through the model layer by layer and found that the NaN originates from
an FP16 overflow in the MLP layers. Here's the chain of events:
The overflow first appears at layer 2 and corrupts the entire output from that point on.
Not sure whether it's reasonable, my hypothesis is that Qwen3-Embedding-8B was trained in BF16 -> The MLP weights learned during BF16 training produce activations exceed what F16 range so precision mismatch error occurred.