GitHub - vllm-project/vllm-gaudi: Community maintained hardware plugin for vLLM on Intel Gaudi

x

vLLM Hardware Plugin for Intel® Gaudi®

| Documentation | Intel® Gaudi® Documentation | Optimizing Training Platform Guide |

Latest News 🔥

[2026/04] Version 0.19.0 is now available, built on vLLM 0.19.0 and fully compatible with Intel® Gaudi® v1.24.0 with PyTorch 2.10.

This release upgrades the platform to Intel® Gaudi® Software v1.24.0 with PyTorch 2.10. It introduces Qwen 3.5 model support, Mamba prefix caching for hybrid models, MxFP4 weight dequantization, LMCache integration, and a custom depthwise conv1d TPC kernel for MambaMixer2. Performance improvements include torch.compile-compatible online defragmentation, improved warmup time, and optimized hybrid KV cache visibility.
[2026/04] Version 0.17.1 is now available, built on vLLM 0.17.1 and fully compatible with Intel® Gaudi® v1.23.0.

This patch release backports critical fixes and improvements including MxFP4 weight loading, Granite 4.0-h calibration, prefix caching for HPUMambaMixer2, OOM crash fixes, and SDL secure error handling improvements.
[2026/03] Version 0.16.0 is now available, built on vLLM 0.16.0 and fully compatible with Intel® Gaudi® v1.23.0.

This release introduces validated support and critical stability fixes for Qwen3-VL models leveraging HPUMMEncoderAttention. Performance and stability were improved through backported Mamba architecture optimizations, Docker and UBI infrastructure enhancements, and a forced CPU loading mechanism for INC quantization to prevent OOM errors.

About

The vLLM Hardware Plugin for Intel® Gaudi® integrates Intel® Gaudi® AI accelerators with vLLM to optimize large language model inference. It follows the [RFC]: Hardware pluggable and [RFC]: Enhancing vLLM Plugin Architecture principles, providing a modular interface for Intel® Gaudi® hardware. For more information, see the Plugin System document.

Getting Started

Set up your execution environment. Additionally, to achieve the best performance on HPU, follow the methods outlined in the Optimizing Training Platform Guide.
Get the last verified vLLM commit. While vLLM Hardware Plugin for Intel® Gaudi® follows the latest vLLM commits, upstream API updates may introduce compatibility issues. The saved commit has been thoroughly validated.
```
git clone https://github.com/vllm-project/vllm-gaudi
cd vllm-gaudi
export VLLM_COMMIT_HASH=$(git show "origin/vllm/last-good-commit-for-vllm-gaudi:VLLM_STABLE_COMMIT" 2>/dev/null)
cd ..
```

Install vLLM using pip or build it from source:

# Build vLLM from source for empty platform, reusing existing torch installation
git clone https://github.com/vllm-project/vllm
cd vllm
git checkout $VLLM_COMMIT_HASH
pip install -r <(sed '/^torch/d' requirements/build/cuda.txt)
VLLM_TARGET_DEVICE=empty pip install --no-build-isolation -e .
cd ..

Install vLLM Hardware Plugin for Intel® Gaudi® from source:
```
cd vllm-gaudi
pip install -e .
cd ..
```
Install torchaudio (required by some upstream vLLM models such as QWEN3_5). Use the CPU wheel with --no-deps to avoid pulling a conflicting CUDA torch:
```
TORCH_VERSION=$(python3 -c "import re, torch; print(re.match(r'(\d+\.\d+\.\d+)', torch.__version__).group(1))")
pip install --no-deps torchaudio==$TORCH_VERSION --extra-index-url https://download.pytorch.org/whl/cpu
```
To see all the available installation methods, such as NIXL, see the Installation guide.

Distributed Executor Backend and Worker Start Method

On HPU, multi-card serving uses vLLM's mp, the Python multiprocessing distributed executor backend, by default whenever world_size > 1, that is, TP * PP * DP > 1. When world_size == 1, vLLM uses the in-process uni backend.

The worker start method is controlled by VLLM_WORKER_MULTIPROC_METHOD, with fork or spawn as the available options. Upstream vLLM defaults to fork; however, on HPU, the platform layer automatically overrides it to spawn because forking after HPU driver initialization leaves driver state in child processes and can cause hangs on exit. A warning is logged when the override is applied. To opt out, set VLLM_WORKER_MULTIPROC_METHOD=fork explicitly, although this is not recommended. The uni, external_launcher, and ray backends do not start workers via Python multiprocessing, so the value has no practical effect for them.

For more information, see docs/configuration/env_variables.md.

Contributing

We welcome and value any contributions and collaborations.

Contact Us

For technical questions and feature requests, please use GitHub Issues.

Name		Name	Last commit message	Last commit date
Latest commit History 810 Commits
.cd		.cd
.github		.github
.jenkins		.jenkins
calibration		calibration
docs		docs
examples		examples
tests		tests
tools		tools
vllm_gaudi		vllm_gaudi
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
format.sh		format.sh
install_nixl.py		install_nixl.py
main.py		main.py
mkdocs.yaml		mkdocs.yaml
pyproject.toml		pyproject.toml
pytest_compat.py		pytest_compat.py
requirements-docs.txt		requirements-docs.txt
requirements-lint.txt		requirements-lint.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM Hardware Plugin for Intel® Gaudi®

About

Getting Started

Distributed Executor Backend and Worker Start Method

Contributing

Contact Us

About

Uh oh!

Releases 12

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vLLM Hardware Plugin for Intel® Gaudi®

About

Getting Started

Distributed Executor Backend and Worker Start Method

Contributing

Contact Us

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 12

Uh oh!

Contributors

Uh oh!

Languages