api— gateway / compatibility layer (always on)- one of:
vllm,ollama,llamacpp,tgi,sglang,localai— inference runtime selected by compose profile
make up BACKEND=vllm # docker compose --profile vllm up -d --build
make down # docker compose down
make ps # container status
make logs # tail api + backend
make health # pretty-print /healthRebuild only the gateway:
docker compose build api
docker compose up -d api| Signal | Where |
|---|---|
| Liveness | GET /livez (always 200 when the process is up) |
| Readiness | GET /readyz (200 when the backend is reachable) |
| Health detail | GET /health |
| Metrics | GET /metrics (Prometheus text) |
| Logs | docker compose logs -f api (JSON, one event per line, includes request_id) |
curl -s -D- http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-H "x-request-id: debug-$(date +%s)" \
-d '{"model":"'"$MODEL"'","messages":[{"role":"user","content":"ping"}]}'The x-request-id echoes back and shows up in chat_api.http log records,
making it easy to correlate a client call with the server log line.
./data/hf-cache— Hugging Face weights & kernel caches (vLLM, TGI, SGLang)./data/vllm-cache— vLLM compile cache./data/ollama— Ollama model store./data/models— llama.cpp GGUF files./data/localai-models— LocalAI model catalog
Back these up; restoring them on a new host skips the first-boot download.
Gateway is up, backend isn't. Check:
docker compose logs --tail=200 <backend>
docker compose logs --tail=200 apiTypical causes: weights downloading, wrong BACKEND_BASE_URL, OOM on model
load, GPU runtime not available to the container.
API_KEYS is set and the request used no key or the wrong one. Send
Authorization: Bearer $API_KEY or x-api-key: $API_KEY.
In-process rate limiter hit. Raise RATE_LIMIT_RPM/RATE_LIMIT_BURST, or
disable the in-process limiter (RATE_LIMIT_ENABLED=false) and do limiting
at the proxy.
Backend doesn't expose embeddings (TGI in particular). Switch BACKEND_KIND
or run a dedicated embeddings server and point a second gateway at it.
Either the backend returned an error (check docker compose logs api for
claude_stream_upstream_error) or the model emitted no content. Try the same
prompt against /v1/chat/completions to isolate the issue.
Drop VLLM_MAX_MODEL_LEN, lower VLLM_GPU_MEMORY_UTILIZATION, or choose a
smaller / quantised model. See MODELS.md for a sizing table.
Expected. Weights download on first boot and kernels compile. Subsequent restarts hit the on-disk caches.
Backend:
docker compose down
make env-ollama # switch env template
make up BACKEND=ollamaModel on the same backend:
$EDITOR .env # update MODEL_NAME
docker compose up -d # or restart the specific service- Gateway: edit
api/requirements.txt, thendocker compose build api && docker compose up -d api. - Inference runtime: pin its image tag in
docker-compose.yml, pull, redeploy.
- Rotate
API_KEYSperiodically; multiple keys are supported so you can rotate without downtime. - Never log prompts unless you accept the PII/retention risk (
LOG_PROMPTS=falseby default). - Keep the inference backend on the internal Docker network.
- Put TLS termination at Nginx/Caddy in front of the gateway.