Skip to content

Latest commit

 

History

History
118 lines (87 loc) · 3.61 KB

File metadata and controls

118 lines (87 loc) · 3.61 KB

Deployment Guide

This guide covers going from a freshly provisioned GPU host to a working production deployment of the gateway plus any supported open-source LLM backend. Read BACKENDS.md for per-backend specifics.

Prerequisites

  • Ubuntu 22.04+ or equivalent (WSL2 works for dev)
  • Docker Engine 24+ and Docker Compose v2
  • NVIDIA drivers + container toolkit for GPU profiles (skip for CPU-only via Ollama / llama.cpp)

1. Verify GPU access

nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smi

Fix the GPU runtime first if the second command fails.

2. Clone and configure

git clone <YOUR_REPO_URL> selfhosted-chat-api
cd selfhosted-chat-api
make env-vllm            # or env-ollama / env-tgi / env-sglang / env-llamacpp / env-localai / env-external
$EDITOR .env             # set API_KEYS, MODEL_NAME, any backend knobs

Minimum production checklist in .env:

  • API_KEYS= set to one or more long random tokens
  • API_HOST=127.0.0.1 (unless you front the API container with Nginx on the same host, keep it bound to loopback)
  • BACKEND_KIND matches the profile you start
  • CORS_ORIGINS restricted to your frontends
  • RATE_LIMIT_ENABLED=true unless you front with a real limiter

3. Start the stack

make up BACKEND=vllm            # or ollama / tgi / sglang / llamacpp / localai / none

Under the hood this runs docker compose --profile <backend> up -d --build.

4. Verify

make health
# {"status": "ok", "backend_kind": "vllm", "backend": "http://vllm:8001/v1", "backend_ok": true}

curl http://127.0.0.1:8000/v1/models -H "Authorization: Bearer $API_KEY"

Smoke-test an inference call:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "'"$MODEL_NAME"'",
    "messages": [{"role": "user", "content": "Reply with exactly: deployment works"}],
    "temperature": 0
  }'

5. Put Nginx in front

For anything reachable beyond the host:

  • Use deploy/nginx/selfhosted-chat-api.conf.
  • Terminate TLS at Nginx (Let's Encrypt, ACME, mTLS — your pick).
  • Keep API_HOST=127.0.0.1 so the API never listens on a public interface.
  • Restrict ingress with a cloud firewall / security group.

6. Plug into monitoring

  • Scrape /metrics with Prometheus (no auth by default; restrict at the proxy if metrics should be private).
  • Tail JSON logs from docker compose logs -f api into your log platform.
  • Wire /livez and /readyz into your orchestrator's liveness/readiness probes if you deploy under Kubernetes.

7. Upgrade

git pull
docker compose --profile <backend> build --pull
docker compose --profile <backend> up -d

8. Rollback

docker compose logs --tail=200 api
docker compose logs --tail=200 <backend>
docker compose --profile <backend> down
git checkout <last-known-good-commit>
docker compose --profile <backend> up -d --build

Sizing notes

See MODELS.md for a GPU-class cheat sheet and curated model catalog. A 24 GB card serves a 7B–8B model at FP16 comfortably; push to 12B–14B with care; drop to quantised GGUF for 30B+ on the same class.

Air-gapped or registry-mirrored deployments

  • Mirror the vllm/vllm-openai, ollama/ollama, etc. images into your private registry and replace the image references in docker-compose.yml.
  • Pre-populate ./data/hf-cache or ./data/models on a connected host, then rsync the directory onto the air-gapped target before docker compose up.