Skip to content

Latest commit

 

History

History
159 lines (112 loc) · 4.6 KB

File metadata and controls

159 lines (112 loc) · 4.6 KB

Backends

Every backend exposes an OpenAI-compatible HTTP surface. The gateway normalises health paths, capability flags, and auth headers; beyond that it is a thin, fast proxy.

Select a backend by setting BACKEND_KIND in .env and launching with the matching compose profile.

BACKEND_KIND Docker profile Internal base URL Best-suited for
vllm --profile vllm http://vllm:8001/v1 Large-model throughput, tensor parallelism
ollama --profile ollama http://ollama:11434/v1 Laptops, quick model swaps, GGUF
llamacpp --profile llamacpp http://llamacpp:8001/v1 Pure GGUF serving, CPU + GPU split
tgi --profile tgi http://tgi:8001/v1 Hugging Face-native deployments
sglang --profile sglang http://sglang:8001/v1 Long-context + structured output workloads
localai --profile localai http://localai:8001/v1 All-in-one container with model catalog
lmstudio (run on host) http://host.docker.internal:1234/v1 Desktop dev against LM Studio
openai --profile none any URL External OpenAI-compatible endpoints

vLLM

Upstream: https://github.com/vllm-project/vllm

Launch:

make env-vllm
$EDITOR .env     # set MODEL_NAME, API_KEYS
make up BACKEND=vllm

Knobs worth tuning in .env:

  • VLLM_DTYPE: half (FP16) by default. bfloat16 on Ampere+ is often better.
  • VLLM_MAX_MODEL_LEN: hard cap on context length per request.
  • VLLM_GPU_MEMORY_UTILIZATION: fraction of GPU memory the engine may claim.

Streaming, tools, and embeddings all work when the chosen model supports them.

Ollama

Upstream: https://ollama.com/

Launch:

make env-ollama
make up BACKEND=ollama
docker exec -it ollama ollama pull llama3.1:8b-instruct

Set MODEL_NAME to whatever tag you pulled (e.g. llama3.1:8b-instruct). Ollama exposes an OpenAI-compatible API at /v1. Embeddings are supported via /v1/embeddings. First request after a model pull may be slow while the engine warms the weights.

llama.cpp (llama-server)

Upstream: https://github.com/ggerganov/llama.cpp

Launch:

make env-llamacpp
mkdir -p data/models
cp /path/to/my-model.Q4_K_M.gguf data/models/model.gguf
make up BACKEND=llamacpp

BACKEND_API_KEY in .env is passed to llama-server --api-key; the gateway propagates it as an Authorization header to the backend. Tools are partial (model-dependent); embeddings work when the GGUF has an embedding head.

Text Generation Inference (TGI)

Upstream: https://github.com/huggingface/text-generation-inference

Launch:

make env-tgi
make up BACKEND=tgi

TGI does not expose /v1/embeddings. The gateway returns a structured 501 if you try. Use a dedicated embeddings server (e.g. text-embeddings-inference) or switch BACKEND_KIND.

SGLang

Upstream: https://github.com/sgl-project/sglang

Launch:

make env-sglang
make up BACKEND=sglang

SGLang is a strong fit for long-context and structured-output workloads. It supports OpenAI-compatible chat, completions, embeddings, and tools.

LocalAI

Upstream: https://github.com/mudler/LocalAI

Launch:

make env-localai
make up BACKEND=localai

LocalAI bundles a model catalog and exposes the OpenAI surface on port 8080 (mapped to 8001 in this repo). Configure models via the LocalAI catalog — the gateway does not manage weights for this backend.

LM Studio

Run LM Studio on the host and enable its Local Server. Then:

make env-external
$EDITOR .env   # set BACKEND_KIND=lmstudio, BACKEND_BASE_URL=http://host.docker.internal:1234/v1
make up BACKEND=none

External (any OpenAI-compatible endpoint)

make env-external
$EDITOR .env   # set BACKEND_KIND=openai, BACKEND_BASE_URL, BACKEND_API_KEY if needed
make up BACKEND=none

This mode is useful when the inference engine already runs somewhere — another host, a managed service with an OpenAI-compatible surface, or a cluster behind its own load balancer.


Adding a new backend

The abstraction lives in api/app/backends.py:

  1. Add a BackendCapabilities(...) entry in _PROFILES describing what the runtime supports.
  2. Add a display name in _DISPLAY.
  3. Add a non-default health path in _HEALTH_PATHS if the runtime does not expose /health.
  4. Add a compose service under a new profile in docker-compose.yml.
  5. Add an env template to deploy/env/<name>.env.
  6. Cover it in api/tests/test_backends.py.

The gateway routes do not need to change — they consult the profile at runtime.