Operations Manual

Services

api — gateway / compatibility layer (always on)
one of: vllm, ollama, llamacpp, tgi, sglang, localai — inference runtime selected by compose profile

Standard commands

make up BACKEND=vllm          # docker compose --profile vllm up -d --build
make down                     # docker compose down
make ps                       # container status
make logs                     # tail api + backend
make health                   # pretty-print /health

Rebuild only the gateway:

docker compose build api
docker compose up -d api

Observability

Signal	Where
Liveness	`GET /livez` (always 200 when the process is up)
Readiness	`GET /readyz` (200 when the backend is reachable)
Health detail	`GET /health`
Metrics	`GET /metrics` (Prometheus text)
Logs	`docker compose logs -f api` (JSON, one event per line, includes `request_id`)

Probing a request end-to-end

curl -s -D- http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -H "x-request-id: debug-$(date +%s)" \
  -d '{"model":"'"$MODEL"'","messages":[{"role":"user","content":"ping"}]}'

The x-request-id echoes back and shows up in chat_api.http log records, making it easy to correlate a client call with the server log line.

Where data lives

./data/hf-cache — Hugging Face weights & kernel caches (vLLM, TGI, SGLang)
./data/vllm-cache — vLLM compile cache
./data/ollama — Ollama model store
./data/models — llama.cpp GGUF files
./data/localai-models — LocalAI model catalog

Back these up; restoring them on a new host skips the first-boot download.

Common problems

`/health` returns 503

Gateway is up, backend isn't. Check:

docker compose logs --tail=200 <backend>
docker compose logs --tail=200 api

Typical causes: weights downloading, wrong BACKEND_BASE_URL, OOM on model load, GPU runtime not available to the container.

401 unauthorized

API_KEYS is set and the request used no key or the wrong one. Send Authorization: Bearer $API_KEY or x-api-key: $API_KEY.

429 too many requests

In-process rate limiter hit. Raise RATE_LIMIT_RPM/RATE_LIMIT_BURST, or disable the in-process limiter (RATE_LIMIT_ENABLED=false) and do limiting at the proxy.

501 from `/v1/embeddings`

Backend doesn't expose embeddings (TGI in particular). Switch BACKEND_KIND or run a dedicated embeddings server and point a second gateway at it.

Claude streaming is empty

Either the backend returned an error (check docker compose logs api for claude_stream_upstream_error) or the model emitted no content. Try the same prompt against /v1/chat/completions to isolate the issue.

OOM on model load

Drop VLLM_MAX_MODEL_LEN, lower VLLM_GPU_MEMORY_UTILIZATION, or choose a smaller / quantised model. See MODELS.md for a sizing table.

Slow first request

Expected. Weights download on first boot and kernels compile. Subsequent restarts hit the on-disk caches.

Changing backend or model

Backend:

docker compose down
make env-ollama           # switch env template
make up BACKEND=ollama

Model on the same backend:

$EDITOR .env              # update MODEL_NAME
docker compose up -d      # or restart the specific service

Updating dependencies

Gateway: edit api/requirements.txt, then docker compose build api && docker compose up -d api.
Inference runtime: pin its image tag in docker-compose.yml, pull, redeploy.

Security hygiene

Rotate API_KEYS periodically; multiple keys are supported so you can rotate without downtime.
Never log prompts unless you accept the PII/retention risk (LOG_PROMPTS=false by default).
Keep the inference backend on the internal Docker network.
Put TLS termination at Nginx/Caddy in front of the gateway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operations Manual

Services

Standard commands

Observability

Probing a request end-to-end

Where data lives

Common problems

`/health` returns 503

401 unauthorized

429 too many requests

501 from `/v1/embeddings`

Claude streaming is empty

OOM on model load

Slow first request

Changing backend or model

Updating dependencies

Security hygiene

FilesExpand file tree

OPERATIONS.md

Latest commit

History

OPERATIONS.md

File metadata and controls

Operations Manual

Services

Standard commands

Observability

Probing a request end-to-end

Where data lives

Common problems

/health returns 503

401 unauthorized

429 too many requests

501 from /v1/embeddings

Claude streaming is empty

OOM on model load

Slow first request

Changing backend or model

Updating dependencies

Security hygiene

`/health` returns 503

501 from `/v1/embeddings`