Deployment Guide

This guide covers going from a freshly provisioned GPU host to a working production deployment of the gateway plus any supported open-source LLM backend. Read BACKENDS.md for per-backend specifics.

Prerequisites

Ubuntu 22.04+ or equivalent (WSL2 works for dev)
Docker Engine 24+ and Docker Compose v2
NVIDIA drivers + container toolkit for GPU profiles (skip for CPU-only via Ollama / llama.cpp)

1. Verify GPU access

nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smi

Fix the GPU runtime first if the second command fails.

2. Clone and configure

git clone <YOUR_REPO_URL> selfhosted-chat-api
cd selfhosted-chat-api
make env-vllm            # or env-ollama / env-tgi / env-sglang / env-llamacpp / env-localai / env-external
$EDITOR .env             # set API_KEYS, MODEL_NAME, any backend knobs

Minimum production checklist in .env:

API_KEYS= set to one or more long random tokens
API_HOST=127.0.0.1 (unless you front the API container with Nginx on the same host, keep it bound to loopback)
BACKEND_KIND matches the profile you start
CORS_ORIGINS restricted to your frontends
RATE_LIMIT_ENABLED=true unless you front with a real limiter

3. Start the stack

make up BACKEND=vllm            # or ollama / tgi / sglang / llamacpp / localai / none

Under the hood this runs docker compose --profile <backend> up -d --build.

4. Verify

make health
# {"status": "ok", "backend_kind": "vllm", "backend": "http://vllm:8001/v1", "backend_ok": true}

curl http://127.0.0.1:8000/v1/models -H "Authorization: Bearer $API_KEY"

Smoke-test an inference call:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "'"$MODEL_NAME"'",
    "messages": [{"role": "user", "content": "Reply with exactly: deployment works"}],
    "temperature": 0
  }'

5. Put Nginx in front

For anything reachable beyond the host:

Use deploy/nginx/selfhosted-chat-api.conf.
Terminate TLS at Nginx (Let's Encrypt, ACME, mTLS — your pick).
Keep API_HOST=127.0.0.1 so the API never listens on a public interface.
Restrict ingress with a cloud firewall / security group.

6. Plug into monitoring

Scrape /metrics with Prometheus (no auth by default; restrict at the proxy if metrics should be private).
Tail JSON logs from docker compose logs -f api into your log platform.
Wire /livez and /readyz into your orchestrator's liveness/readiness probes if you deploy under Kubernetes.

7. Upgrade

git pull
docker compose --profile <backend> build --pull
docker compose --profile <backend> up -d

8. Rollback

docker compose logs --tail=200 api
docker compose logs --tail=200 <backend>
docker compose --profile <backend> down
git checkout <last-known-good-commit>
docker compose --profile <backend> up -d --build

Sizing notes

See MODELS.md for a GPU-class cheat sheet and curated model catalog. A 24 GB card serves a 7B–8B model at FP16 comfortably; push to 12B–14B with care; drop to quantised GGUF for 30B+ on the same class.

Air-gapped or registry-mirrored deployments

Mirror the vllm/vllm-openai, ollama/ollama, etc. images into your private registry and replace the image references in docker-compose.yml.
Pre-populate ./data/hf-cache or ./data/models on a connected host, then rsync the directory onto the air-gapped target before docker compose up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment Guide

Prerequisites

1. Verify GPU access

2. Clone and configure

3. Start the stack

4. Verify

5. Put Nginx in front

6. Plug into monitoring

7. Upgrade

8. Rollback

Sizing notes

Air-gapped or registry-mirrored deployments

FilesExpand file tree

DEPLOYMENT.md

Latest commit

History

DEPLOYMENT.md

File metadata and controls

Deployment Guide

Prerequisites

1. Verify GPU access

2. Clone and configure

3. Start the stack

4. Verify

5. Put Nginx in front

6. Plug into monitoring

7. Upgrade

8. Rollback

Sizing notes

Air-gapped or registry-mirrored deployments