Skip to content

Latest commit

 

History

History
127 lines (94 loc) · 4.86 KB

File metadata and controls

127 lines (94 loc) · 4.86 KB

Open-source model catalog

A curated, opinionated short-list of open-weight models that pair well with this stack, organised by task and GPU class. Everything below ships under a permissive-enough license for self-hosted deployment, but read the actual license before shipping anything to users.

This catalog is intentionally short. A long list of "everything available" is not useful — pick one baseline per task, measure, then iterate.

Legend:

  • Class: recommended GPU memory envelope at FP16 or Q4 GGUF.
  • Runtime: which BACKEND_KIND works cleanly without heroics.

General chat / mixed workloads

Model Class Runtime Notes
Qwen/Qwen2.5-7B-Instruct 16 GB vllm, sglang, tgi, ollama Repo default. Strong at chat, RAG, and JSON extraction.
meta-llama/Llama-3.1-8B-Instruct 16 GB vllm, sglang, tgi, ollama Gated (needs HF token). Tool calling works.
mistralai/Mistral-7B-Instruct-v0.3 16 GB vllm, sglang, tgi, ollama Permissive, solid general baseline.
Qwen/Qwen2.5-14B-Instruct 24 GB vllm, sglang, tgi Upgrade step for chat quality on a 24 GB card.
microsoft/Phi-3.5-mini-instruct 8 GB vllm, ollama Compact, fast, surprisingly capable.

Long-context / RAG

Model Class Runtime Notes
Qwen/Qwen2.5-7B-Instruct 16 GB vllm, sglang 128K context configurations exist; set VLLM_MAX_MODEL_LEN carefully.
meta-llama/Llama-3.1-8B-Instruct 16 GB vllm, sglang 128K native context.
mistralai/Mistral-Nemo-Instruct-2407 24 GB vllm, sglang 128K native context.

Code

Model Class Runtime Notes
Qwen/Qwen2.5-Coder-7B-Instruct 16 GB vllm, sglang, tgi, ollama Strong all-purpose code model.
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct 24 GB vllm, sglang MoE-light, excellent fill-in-middle.
bigcode/starcoder2-7b 16 GB vllm, tgi Base model; use Instruct finetunes for chat.

Tool / function calling

Model Class Runtime Notes
Qwen/Qwen2.5-7B-Instruct 16 GB vllm, sglang Good tool-calling adherence.
meta-llama/Llama-3.1-8B-Instruct 16 GB vllm, sglang Native tool formats supported.
NousResearch/Hermes-3-Llama-3.1-8B 16 GB vllm, sglang Designed for structured/tool output.

Embeddings

Model Class Runtime Notes
BAAI/bge-small-en-v1.5 2 GB ollama, localai Fast, tiny, 384-dim.
BAAI/bge-m3 6 GB ollama, localai Multilingual, multi-granularity.
intfloat/e5-large-v2 4 GB ollama, localai High-quality English embeddings.
nomic-ai/nomic-embed-text-v1.5 2 GB ollama Long-context text embeddings.

TGI does not expose embeddings; run a separate text-embeddings-inference server and point /v1/embeddings at it (BACKEND_KIND=openai, BACKEND_BASE_URL=http://tei:8002/v1).

Small / CPU-only

Model Class Runtime Notes
TinyLlama/TinyLlama-1.1B-Chat-v1.0 CPU ollama, llamacpp Sanity-check the pipeline.
microsoft/Phi-3-mini-4k-instruct CPU-ok ollama, llamacpp Usable on a modern laptop CPU.
Qwen/Qwen2.5-1.5B-Instruct CPU-ok ollama, llamacpp Tiny but capable for lightweight tasks.

How to change the model

vLLM / TGI / SGLang

Set MODEL_NAME in .env to a Hugging Face model id, then restart:

docker compose down
docker compose --profile vllm up -d

Ollama

docker exec -it ollama ollama pull qwen2.5:7b-instruct
# then set MODEL_NAME=qwen2.5:7b-instruct in .env and restart the api container:
docker compose restart api

llama.cpp

Drop a .gguf file into ./data/models/ and set LLAMACPP_MODEL_FILE in .env. Restart with docker compose restart llamacpp.


Sizing cheat-sheet

GPU Safe FP16 Safe Q4 GGUF
12 GB (3060, T4) 7B 13B
16 GB (4070 Ti Super, A4000) 7B–8B 14B–20B
24 GB (A10G, L4, 3090, 4090) 12B–14B 30B–34B
48 GB (A6000, L40) 30B–34B 70B
80 GB (A100, H100) 70B 70B+

FP16 numbers assume standard context and reasonable concurrency. Long-context workloads (>16K tokens) consume noticeably more memory for the KV cache — reduce VLLM_MAX_MODEL_LEN before blaming the model.


What not to do

  • Do not deploy a model without reading its license, especially for closed-commercial-use variants.
  • Do not chase benchmark leaderboards for a production deployment. Pick a baseline that the team can evaluate against your prompts, and only move off it when you have data.
  • Do not mix incompatible quantizations (e.g. GGUF K-quants loaded with the wrong imatrix) and expect determinism.
  • Do not enable long contexts you are not actually using — the KV memory budget kills concurrency even when the prompt is short.