Every backend exposes an OpenAI-compatible HTTP surface. The gateway normalises health paths, capability flags, and auth headers; beyond that it is a thin, fast proxy.
Select a backend by setting BACKEND_KIND in .env and launching with the
matching compose profile.
BACKEND_KIND |
Docker profile | Internal base URL | Best-suited for |
|---|---|---|---|
vllm |
--profile vllm |
http://vllm:8001/v1 |
Large-model throughput, tensor parallelism |
ollama |
--profile ollama |
http://ollama:11434/v1 |
Laptops, quick model swaps, GGUF |
llamacpp |
--profile llamacpp |
http://llamacpp:8001/v1 |
Pure GGUF serving, CPU + GPU split |
tgi |
--profile tgi |
http://tgi:8001/v1 |
Hugging Face-native deployments |
sglang |
--profile sglang |
http://sglang:8001/v1 |
Long-context + structured output workloads |
localai |
--profile localai |
http://localai:8001/v1 |
All-in-one container with model catalog |
lmstudio |
(run on host) | http://host.docker.internal:1234/v1 |
Desktop dev against LM Studio |
openai |
--profile none |
any URL | External OpenAI-compatible endpoints |
Upstream: https://github.com/vllm-project/vllm
Launch:
make env-vllm
$EDITOR .env # set MODEL_NAME, API_KEYS
make up BACKEND=vllmKnobs worth tuning in .env:
VLLM_DTYPE:half(FP16) by default.bfloat16on Ampere+ is often better.VLLM_MAX_MODEL_LEN: hard cap on context length per request.VLLM_GPU_MEMORY_UTILIZATION: fraction of GPU memory the engine may claim.
Streaming, tools, and embeddings all work when the chosen model supports them.
Upstream: https://ollama.com/
Launch:
make env-ollama
make up BACKEND=ollama
docker exec -it ollama ollama pull llama3.1:8b-instructSet MODEL_NAME to whatever tag you pulled (e.g. llama3.1:8b-instruct).
Ollama exposes an OpenAI-compatible API at /v1. Embeddings are supported via
/v1/embeddings. First request after a model pull may be slow while the
engine warms the weights.
Upstream: https://github.com/ggerganov/llama.cpp
Launch:
make env-llamacpp
mkdir -p data/models
cp /path/to/my-model.Q4_K_M.gguf data/models/model.gguf
make up BACKEND=llamacppBACKEND_API_KEY in .env is passed to llama-server --api-key; the gateway
propagates it as an Authorization header to the backend. Tools are partial
(model-dependent); embeddings work when the GGUF has an embedding head.
Upstream: https://github.com/huggingface/text-generation-inference
Launch:
make env-tgi
make up BACKEND=tgiTGI does not expose /v1/embeddings. The gateway returns a structured 501 if
you try. Use a dedicated embeddings server (e.g. text-embeddings-inference)
or switch BACKEND_KIND.
Upstream: https://github.com/sgl-project/sglang
Launch:
make env-sglang
make up BACKEND=sglangSGLang is a strong fit for long-context and structured-output workloads. It supports OpenAI-compatible chat, completions, embeddings, and tools.
Upstream: https://github.com/mudler/LocalAI
Launch:
make env-localai
make up BACKEND=localaiLocalAI bundles a model catalog and exposes the OpenAI surface on port 8080 (mapped to 8001 in this repo). Configure models via the LocalAI catalog — the gateway does not manage weights for this backend.
Run LM Studio on the host and enable its Local Server. Then:
make env-external
$EDITOR .env # set BACKEND_KIND=lmstudio, BACKEND_BASE_URL=http://host.docker.internal:1234/v1
make up BACKEND=nonemake env-external
$EDITOR .env # set BACKEND_KIND=openai, BACKEND_BASE_URL, BACKEND_API_KEY if needed
make up BACKEND=noneThis mode is useful when the inference engine already runs somewhere — another host, a managed service with an OpenAI-compatible surface, or a cluster behind its own load balancer.
The abstraction lives in api/app/backends.py:
- Add a
BackendCapabilities(...)entry in_PROFILESdescribing what the runtime supports. - Add a display name in
_DISPLAY. - Add a non-default health path in
_HEALTH_PATHSif the runtime does not expose/health. - Add a compose service under a new profile in
docker-compose.yml. - Add an env template to
deploy/env/<name>.env. - Cover it in
api/tests/test_backends.py.
The gateway routes do not need to change — they consult the profile at runtime.