A curated, opinionated short-list of open-weight models that pair well with this stack, organised by task and GPU class. Everything below ships under a permissive-enough license for self-hosted deployment, but read the actual license before shipping anything to users.
This catalog is intentionally short. A long list of "everything available" is not useful — pick one baseline per task, measure, then iterate.
Legend:
- Class: recommended GPU memory envelope at FP16 or Q4 GGUF.
- Runtime: which
BACKEND_KINDworks cleanly without heroics.
| Model | Class | Runtime | Notes |
|---|---|---|---|
Qwen/Qwen2.5-7B-Instruct |
16 GB | vllm, sglang, tgi, ollama | Repo default. Strong at chat, RAG, and JSON extraction. |
meta-llama/Llama-3.1-8B-Instruct |
16 GB | vllm, sglang, tgi, ollama | Gated (needs HF token). Tool calling works. |
mistralai/Mistral-7B-Instruct-v0.3 |
16 GB | vllm, sglang, tgi, ollama | Permissive, solid general baseline. |
Qwen/Qwen2.5-14B-Instruct |
24 GB | vllm, sglang, tgi | Upgrade step for chat quality on a 24 GB card. |
microsoft/Phi-3.5-mini-instruct |
8 GB | vllm, ollama | Compact, fast, surprisingly capable. |
| Model | Class | Runtime | Notes |
|---|---|---|---|
Qwen/Qwen2.5-7B-Instruct |
16 GB | vllm, sglang | 128K context configurations exist; set VLLM_MAX_MODEL_LEN carefully. |
meta-llama/Llama-3.1-8B-Instruct |
16 GB | vllm, sglang | 128K native context. |
mistralai/Mistral-Nemo-Instruct-2407 |
24 GB | vllm, sglang | 128K native context. |
| Model | Class | Runtime | Notes |
|---|---|---|---|
Qwen/Qwen2.5-Coder-7B-Instruct |
16 GB | vllm, sglang, tgi, ollama | Strong all-purpose code model. |
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct |
24 GB | vllm, sglang | MoE-light, excellent fill-in-middle. |
bigcode/starcoder2-7b |
16 GB | vllm, tgi | Base model; use Instruct finetunes for chat. |
| Model | Class | Runtime | Notes |
|---|---|---|---|
Qwen/Qwen2.5-7B-Instruct |
16 GB | vllm, sglang | Good tool-calling adherence. |
meta-llama/Llama-3.1-8B-Instruct |
16 GB | vllm, sglang | Native tool formats supported. |
NousResearch/Hermes-3-Llama-3.1-8B |
16 GB | vllm, sglang | Designed for structured/tool output. |
| Model | Class | Runtime | Notes |
|---|---|---|---|
BAAI/bge-small-en-v1.5 |
2 GB | ollama, localai | Fast, tiny, 384-dim. |
BAAI/bge-m3 |
6 GB | ollama, localai | Multilingual, multi-granularity. |
intfloat/e5-large-v2 |
4 GB | ollama, localai | High-quality English embeddings. |
nomic-ai/nomic-embed-text-v1.5 |
2 GB | ollama | Long-context text embeddings. |
TGI does not expose embeddings; run a separate
text-embeddings-inference
server and point /v1/embeddings at it (BACKEND_KIND=openai,
BACKEND_BASE_URL=http://tei:8002/v1).
| Model | Class | Runtime | Notes |
|---|---|---|---|
TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
CPU | ollama, llamacpp | Sanity-check the pipeline. |
microsoft/Phi-3-mini-4k-instruct |
CPU-ok | ollama, llamacpp | Usable on a modern laptop CPU. |
Qwen/Qwen2.5-1.5B-Instruct |
CPU-ok | ollama, llamacpp | Tiny but capable for lightweight tasks. |
Set MODEL_NAME in .env to a Hugging Face model id, then restart:
docker compose down
docker compose --profile vllm up -ddocker exec -it ollama ollama pull qwen2.5:7b-instruct
# then set MODEL_NAME=qwen2.5:7b-instruct in .env and restart the api container:
docker compose restart apiDrop a .gguf file into ./data/models/ and set LLAMACPP_MODEL_FILE in
.env. Restart with docker compose restart llamacpp.
| GPU | Safe FP16 | Safe Q4 GGUF |
|---|---|---|
| 12 GB (3060, T4) | 7B | 13B |
| 16 GB (4070 Ti Super, A4000) | 7B–8B | 14B–20B |
| 24 GB (A10G, L4, 3090, 4090) | 12B–14B | 30B–34B |
| 48 GB (A6000, L40) | 30B–34B | 70B |
| 80 GB (A100, H100) | 70B | 70B+ |
FP16 numbers assume standard context and reasonable concurrency. Long-context
workloads (>16K tokens) consume noticeably more memory for the KV cache —
reduce VLLM_MAX_MODEL_LEN before blaming the model.
- Do not deploy a model without reading its license, especially for closed-commercial-use variants.
- Do not chase benchmark leaderboards for a production deployment. Pick a baseline that the team can evaluate against your prompts, and only move off it when you have data.
- Do not mix incompatible quantizations (e.g. GGUF K-quants loaded with the wrong imatrix) and expect determinism.
- Do not enable long contexts you are not actually using — the KV memory budget kills concurrency even when the prompt is short.