This guide walks through deploying an LLM on Intel GPUs with LLMKube.
- Kubernetes cluster with Intel GPU device plugin installed
- At least one node advertising Intel GPU resources
- LLMKube operator installed
- kubectl configured for your cluster
Check allocatable Intel GPU resources on nodes:
kubectl get nodes -o custom-columns=NAME:.metadata.name,I915:.status.allocatable."gpu\.intel\.com/i915",XE:.status.allocatable."gpu\.intel\.com/xe"Expected result: at least one of gpu.intel.com/i915 or gpu.intel.com/xe is non-empty on a node.
By default, the controller uses gpu.intel.com/i915 for Intel workloads.
If your cluster uses gpu.intel.com/xe, configure the controller:
kubectl -n llmkube-system set env deployment/llmkube-controller-manager LLMKUBE_INTEL_GPU_RESOURCE=gpu.intel.com/xe
kubectl -n llmkube-system rollout status deployment/llmkube-controller-managerllmkube deploy phi-4-mini \
--gpu \
--gpu-count 1 \
--accelerator intelNotes:
--accelerator intelsets model hardware accelerator to Intel.- Default Intel runtime image is
ghcr.io/ggml-org/llama.cpp:server-intel.
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
name: intel-phi4
namespace: default
spec:
source: https://huggingface.co/microsoft/Phi-4-mini-instruct-gguf/resolve/main/Phi-4-mini-instruct-q4_0.gguf
format: gguf
hardware:
accelerator: intel
gpu:
enabled: true
count: 1
vendor: intel
layers: -1
resources:
cpu: "2"
memory: "4Gi"
---
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
name: intel-phi4
namespace: default
spec:
modelRef: intel-phi4
replicas: 1
image: ghcr.io/ggml-org/llama.cpp:server-intel
resources:
gpu: 1
cpu: "2"
memory: "4Gi"
podSecurityContext:
runAsUser: 0
runAsGroup: 0
fsGroup: 0Apply:
kubectl apply -f intel-phi4.yamlWatch resource state:
kubectl get model intel-phi4 -w
kubectl get inferenceservice intel-phi4 -w
kubectl get pods -l app=intel-phi4 -o wideCheck deployment resource limits and toleration key:
kubectl get deploy intel-phi4 -o jsonpath='{.spec.template.spec.containers[0].resources.limits}{"\n"}{.spec.template.spec.tolerations}{"\n"}'Expected to include Intel GPU resource key (for example gpu.intel.com/i915).
Confirm SYCL backend and GPU offload in logs:
POD=$(kubectl get pod -l app=intel-phi4 -o jsonpath='{.items[0].metadata.name}')
kubectl logs "$POD" -c llama-server --tail=200Expected log markers:
loaded SYCL backendusing device SYCL0offloaded ... layers to GPU
Port-forward and run a request:
kubectl port-forward svc/intel-phi4 8080:8080In another terminal:
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"messages": [{"role": "user", "content": "Explain Kubernetes in one sentence."}],
"max_tokens": 128,
"temperature": 0.2,
"stream": false
}' | jq '.timings.prompt_per_second, .timings.predicted_per_second'prompt_per_secondis prompt token throughput.predicted_per_secondis generation throughput.
Use:
ghcr.io/ggml-org/llama.cpp:server-intel
If pulling private images from GHCR, configure imagePullSecrets on the controller or workload namespace.
If model-downloader fails with permission denied errors, set podSecurityContext on the InferenceService:
podSecurityContext:
runAsUser: 0
runAsGroup: 0
fsGroup: 0- Verify node allocatable resource key matches controller configuration.
- If nodes expose
gpu.intel.com/xe, setLLMKUBE_INTEL_GPU_RESOURCE=gpu.intel.com/xe. - Check pod events:
kubectl describe pod -l app=intel-phi4