Skip to content

Latest commit

 

History

History
133 lines (93 loc) · 3.89 KB

File metadata and controls

133 lines (93 loc) · 3.89 KB

Operations Manual

Services

  • api — gateway / compatibility layer (always on)
  • one of: vllm, ollama, llamacpp, tgi, sglang, localai — inference runtime selected by compose profile

Standard commands

make up BACKEND=vllm          # docker compose --profile vllm up -d --build
make down                     # docker compose down
make ps                       # container status
make logs                     # tail api + backend
make health                   # pretty-print /health

Rebuild only the gateway:

docker compose build api
docker compose up -d api

Observability

Signal Where
Liveness GET /livez (always 200 when the process is up)
Readiness GET /readyz (200 when the backend is reachable)
Health detail GET /health
Metrics GET /metrics (Prometheus text)
Logs docker compose logs -f api (JSON, one event per line, includes request_id)

Probing a request end-to-end

curl -s -D- http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -H "x-request-id: debug-$(date +%s)" \
  -d '{"model":"'"$MODEL"'","messages":[{"role":"user","content":"ping"}]}'

The x-request-id echoes back and shows up in chat_api.http log records, making it easy to correlate a client call with the server log line.

Where data lives

  • ./data/hf-cache — Hugging Face weights & kernel caches (vLLM, TGI, SGLang)
  • ./data/vllm-cache — vLLM compile cache
  • ./data/ollama — Ollama model store
  • ./data/models — llama.cpp GGUF files
  • ./data/localai-models — LocalAI model catalog

Back these up; restoring them on a new host skips the first-boot download.

Common problems

/health returns 503

Gateway is up, backend isn't. Check:

docker compose logs --tail=200 <backend>
docker compose logs --tail=200 api

Typical causes: weights downloading, wrong BACKEND_BASE_URL, OOM on model load, GPU runtime not available to the container.

401 unauthorized

API_KEYS is set and the request used no key or the wrong one. Send Authorization: Bearer $API_KEY or x-api-key: $API_KEY.

429 too many requests

In-process rate limiter hit. Raise RATE_LIMIT_RPM/RATE_LIMIT_BURST, or disable the in-process limiter (RATE_LIMIT_ENABLED=false) and do limiting at the proxy.

501 from /v1/embeddings

Backend doesn't expose embeddings (TGI in particular). Switch BACKEND_KIND or run a dedicated embeddings server and point a second gateway at it.

Claude streaming is empty

Either the backend returned an error (check docker compose logs api for claude_stream_upstream_error) or the model emitted no content. Try the same prompt against /v1/chat/completions to isolate the issue.

OOM on model load

Drop VLLM_MAX_MODEL_LEN, lower VLLM_GPU_MEMORY_UTILIZATION, or choose a smaller / quantised model. See MODELS.md for a sizing table.

Slow first request

Expected. Weights download on first boot and kernels compile. Subsequent restarts hit the on-disk caches.

Changing backend or model

Backend:

docker compose down
make env-ollama           # switch env template
make up BACKEND=ollama

Model on the same backend:

$EDITOR .env              # update MODEL_NAME
docker compose up -d      # or restart the specific service

Updating dependencies

  • Gateway: edit api/requirements.txt, then docker compose build api && docker compose up -d api.
  • Inference runtime: pin its image tag in docker-compose.yml, pull, redeploy.

Security hygiene

  • Rotate API_KEYS periodically; multiple keys are supported so you can rotate without downtime.
  • Never log prompts unless you accept the PII/retention risk (LOG_PROMPTS=false by default).
  • Keep the inference backend on the internal Docker network.
  • Put TLS termination at Nginx/Caddy in front of the gateway.