This guide covers going from a freshly provisioned GPU host to a working
production deployment of the gateway plus any supported open-source LLM
backend. Read BACKENDS.md for per-backend specifics.
- Ubuntu 22.04+ or equivalent (WSL2 works for dev)
- Docker Engine 24+ and Docker Compose v2
- NVIDIA drivers + container toolkit for GPU profiles (skip for CPU-only via Ollama / llama.cpp)
nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smiFix the GPU runtime first if the second command fails.
git clone <YOUR_REPO_URL> selfhosted-chat-api
cd selfhosted-chat-api
make env-vllm # or env-ollama / env-tgi / env-sglang / env-llamacpp / env-localai / env-external
$EDITOR .env # set API_KEYS, MODEL_NAME, any backend knobsMinimum production checklist in .env:
API_KEYS=set to one or more long random tokensAPI_HOST=127.0.0.1(unless you front the API container with Nginx on the same host, keep it bound to loopback)BACKEND_KINDmatches the profile you startCORS_ORIGINSrestricted to your frontendsRATE_LIMIT_ENABLED=trueunless you front with a real limiter
make up BACKEND=vllm # or ollama / tgi / sglang / llamacpp / localai / noneUnder the hood this runs docker compose --profile <backend> up -d --build.
make health
# {"status": "ok", "backend_kind": "vllm", "backend": "http://vllm:8001/v1", "backend_ok": true}
curl http://127.0.0.1:8000/v1/models -H "Authorization: Bearer $API_KEY"Smoke-test an inference call:
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"model": "'"$MODEL_NAME"'",
"messages": [{"role": "user", "content": "Reply with exactly: deployment works"}],
"temperature": 0
}'For anything reachable beyond the host:
- Use
deploy/nginx/selfhosted-chat-api.conf. - Terminate TLS at Nginx (Let's Encrypt, ACME, mTLS — your pick).
- Keep
API_HOST=127.0.0.1so the API never listens on a public interface. - Restrict ingress with a cloud firewall / security group.
- Scrape
/metricswith Prometheus (no auth by default; restrict at the proxy if metrics should be private). - Tail JSON logs from
docker compose logs -f apiinto your log platform. - Wire
/livezand/readyzinto your orchestrator's liveness/readiness probes if you deploy under Kubernetes.
git pull
docker compose --profile <backend> build --pull
docker compose --profile <backend> up -ddocker compose logs --tail=200 api
docker compose logs --tail=200 <backend>
docker compose --profile <backend> down
git checkout <last-known-good-commit>
docker compose --profile <backend> up -d --buildSee MODELS.md for a GPU-class cheat sheet and curated model
catalog. A 24 GB card serves a 7B–8B model at FP16 comfortably; push to 12B–14B
with care; drop to quantised GGUF for 30B+ on the same class.
- Mirror the
vllm/vllm-openai,ollama/ollama, etc. images into your private registry and replace the image references indocker-compose.yml. - Pre-populate
./data/hf-cacheor./data/modelson a connected host, then rsync the directory onto the air-gapped target beforedocker compose up.