Backends

Every backend exposes an OpenAI-compatible HTTP surface. The gateway normalises health paths, capability flags, and auth headers; beyond that it is a thin, fast proxy.

Select a backend by setting BACKEND_KIND in .env and launching with the matching compose profile.

`BACKEND_KIND`	Docker profile	Internal base URL	Best-suited for
`vllm`	`--profile vllm`	`http://vllm:8001/v1`	Large-model throughput, tensor parallelism
`ollama`	`--profile ollama`	`http://ollama:11434/v1`	Laptops, quick model swaps, GGUF
`llamacpp`	`--profile llamacpp`	`http://llamacpp:8001/v1`	Pure GGUF serving, CPU + GPU split
`tgi`	`--profile tgi`	`http://tgi:8001/v1`	Hugging Face-native deployments
`sglang`	`--profile sglang`	`http://sglang:8001/v1`	Long-context + structured output workloads
`localai`	`--profile localai`	`http://localai:8001/v1`	All-in-one container with model catalog
`lmstudio`	(run on host)	`http://host.docker.internal:1234/v1`	Desktop dev against LM Studio
`openai`	`--profile none`	any URL	External OpenAI-compatible endpoints

vLLM

Upstream: https://github.com/vllm-project/vllm

Launch:

make env-vllm
$EDITOR .env     # set MODEL_NAME, API_KEYS
make up BACKEND=vllm

Knobs worth tuning in .env:

VLLM_DTYPE: half (FP16) by default. bfloat16 on Ampere+ is often better.
VLLM_MAX_MODEL_LEN: hard cap on context length per request.
VLLM_GPU_MEMORY_UTILIZATION: fraction of GPU memory the engine may claim.

Streaming, tools, and embeddings all work when the chosen model supports them.

Ollama

Upstream: https://ollama.com/

Launch:

make env-ollama
make up BACKEND=ollama
docker exec -it ollama ollama pull llama3.1:8b-instruct

Set MODEL_NAME to whatever tag you pulled (e.g. llama3.1:8b-instruct). Ollama exposes an OpenAI-compatible API at /v1. Embeddings are supported via /v1/embeddings. First request after a model pull may be slow while the engine warms the weights.

llama.cpp (`llama-server`)

Upstream: https://github.com/ggerganov/llama.cpp

Launch:

make env-llamacpp
mkdir -p data/models
cp /path/to/my-model.Q4_K_M.gguf data/models/model.gguf
make up BACKEND=llamacpp

BACKEND_API_KEY in .env is passed to llama-server --api-key; the gateway propagates it as an Authorization header to the backend. Tools are partial (model-dependent); embeddings work when the GGUF has an embedding head.

Text Generation Inference (TGI)

Upstream: https://github.com/huggingface/text-generation-inference

Launch:

make env-tgi
make up BACKEND=tgi

TGI does not expose /v1/embeddings. The gateway returns a structured 501 if you try. Use a dedicated embeddings server (e.g. text-embeddings-inference) or switch BACKEND_KIND.

SGLang

Upstream: https://github.com/sgl-project/sglang

Launch:

make env-sglang
make up BACKEND=sglang

SGLang is a strong fit for long-context and structured-output workloads. It supports OpenAI-compatible chat, completions, embeddings, and tools.

LocalAI

Upstream: https://github.com/mudler/LocalAI

Launch:

make env-localai
make up BACKEND=localai

LocalAI bundles a model catalog and exposes the OpenAI surface on port 8080 (mapped to 8001 in this repo). Configure models via the LocalAI catalog — the gateway does not manage weights for this backend.

LM Studio

Run LM Studio on the host and enable its Local Server. Then:

make env-external
$EDITOR .env   # set BACKEND_KIND=lmstudio, BACKEND_BASE_URL=http://host.docker.internal:1234/v1
make up BACKEND=none

External (any OpenAI-compatible endpoint)

make env-external
$EDITOR .env   # set BACKEND_KIND=openai, BACKEND_BASE_URL, BACKEND_API_KEY if needed
make up BACKEND=none

This mode is useful when the inference engine already runs somewhere — another host, a managed service with an OpenAI-compatible surface, or a cluster behind its own load balancer.

Adding a new backend

The abstraction lives in api/app/backends.py:

Add a BackendCapabilities(...) entry in _PROFILES describing what the runtime supports.
Add a display name in _DISPLAY.
Add a non-default health path in _HEALTH_PATHS if the runtime does not expose /health.
Add a compose service under a new profile in docker-compose.yml.
Add an env template to deploy/env/<name>.env.
Cover it in api/tests/test_backends.py.

The gateway routes do not need to change — they consult the profile at runtime.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backends

vLLM

Ollama

llama.cpp (`llama-server`)

Text Generation Inference (TGI)

SGLang

LocalAI

LM Studio

External (any OpenAI-compatible endpoint)

Adding a new backend

FilesExpand file tree

BACKENDS.md

Latest commit

History

BACKENDS.md

File metadata and controls

Backends

vLLM

Ollama

llama.cpp (llama-server)

Text Generation Inference (TGI)

SGLang

LocalAI

LM Studio

External (any OpenAI-compatible endpoint)

Adding a new backend

llama.cpp (`llama-server`)