| Documentation | Intel® Gaudi® Documentation | Optimizing Training Platform Guide |
Latest News 🔥
-
[2026/04] Version 0.19.0 is now available, built on vLLM 0.19.0 and fully compatible with Intel® Gaudi® v1.24.0 with PyTorch 2.10.
This release upgrades the platform to Intel® Gaudi® Software v1.24.0 with PyTorch 2.10. It introduces Qwen 3.5 model support, Mamba prefix caching for hybrid models, MxFP4 weight dequantization, LMCache integration, and a custom depthwise conv1d TPC kernel for MambaMixer2. Performance improvements include torch.compile-compatible online defragmentation, improved warmup time, and optimized hybrid KV cache visibility.
-
[2026/04] Version 0.17.1 is now available, built on vLLM 0.17.1 and fully compatible with Intel® Gaudi® v1.23.0.
This patch release backports critical fixes and improvements including MxFP4 weight loading, Granite 4.0-h calibration, prefix caching for HPUMambaMixer2, OOM crash fixes, and SDL secure error handling improvements.
-
[2026/03] Version 0.16.0 is now available, built on vLLM 0.16.0 and fully compatible with Intel® Gaudi® v1.23.0.
This release introduces validated support and critical stability fixes for Qwen3-VL models leveraging HPUMMEncoderAttention. Performance and stability were improved through backported Mamba architecture optimizations, Docker and UBI infrastructure enhancements, and a forced CPU loading mechanism for INC quantization to prevent OOM errors.
The vLLM Hardware Plugin for Intel® Gaudi® integrates Intel® Gaudi® AI accelerators with vLLM to optimize large language model inference. It follows the [RFC]: Hardware pluggable and [RFC]: Enhancing vLLM Plugin Architecture principles, providing a modular interface for Intel® Gaudi® hardware. For more information, see the Plugin System document.
-
Set up your execution environment. Additionally, to achieve the best performance on HPU, follow the methods outlined in the Optimizing Training Platform Guide.
-
Get the last verified vLLM commit. While vLLM Hardware Plugin for Intel® Gaudi® follows the latest vLLM commits, upstream API updates may introduce compatibility issues. The saved commit has been thoroughly validated.
git clone https://github.com/vllm-project/vllm-gaudi cd vllm-gaudi export VLLM_COMMIT_HASH=$(git show "origin/vllm/last-good-commit-for-vllm-gaudi:VLLM_STABLE_COMMIT" 2>/dev/null) cd ..
-
Install vLLM using
pipor build it from source:# Build vLLM from source for empty platform, reusing existing torch installation git clone https://github.com/vllm-project/vllm cd vllm git checkout $VLLM_COMMIT_HASH pip install -r <(sed '/^torch/d' requirements/build/cuda.txt) VLLM_TARGET_DEVICE=empty pip install --no-build-isolation -e . cd ..
-
Install vLLM Hardware Plugin for Intel® Gaudi® from source:
cd vllm-gaudi pip install -e . cd ..
-
Install torchaudio (required by some upstream vLLM models such as QWEN3_5). Use the CPU wheel with
--no-depsto avoid pulling a conflicting CUDA torch:TORCH_VERSION=$(python3 -c "import re, torch; print(re.match(r'(\d+\.\d+\.\d+)', torch.__version__).group(1))") pip install --no-deps torchaudio==$TORCH_VERSION --extra-index-url https://download.pytorch.org/whl/cpu
To see all the available installation methods, such as NIXL, see the Installation guide.
On HPU, multi-card serving uses vLLM's mp, the Python multiprocessing distributed executor backend, by default whenever world_size > 1, that is, TP * PP * DP > 1. When world_size == 1, vLLM uses the in-process uni backend.
The worker start method is controlled by VLLM_WORKER_MULTIPROC_METHOD, with fork or spawn as the available options. Upstream vLLM defaults to fork; however, on HPU, the platform layer automatically overrides it to spawn because forking after HPU driver initialization leaves driver state in child processes and can cause hangs on exit. A warning is logged when the override is applied. To opt out, set VLLM_WORKER_MULTIPROC_METHOD=fork explicitly, although this is not recommended. The uni, external_launcher, and ray backends do not start workers via Python multiprocessing, so the value has no practical effect for them.
For more information, see docs/configuration/env_variables.md.
We welcome and value any contributions and collaborations.
- For technical questions and feature requests, please use GitHub Issues.

