Open-source model catalog

A curated, opinionated short-list of open-weight models that pair well with this stack, organised by task and GPU class. Everything below ships under a permissive-enough license for self-hosted deployment, but read the actual license before shipping anything to users.

This catalog is intentionally short. A long list of "everything available" is not useful — pick one baseline per task, measure, then iterate.

Legend:

Class: recommended GPU memory envelope at FP16 or Q4 GGUF.
Runtime: which BACKEND_KIND works cleanly without heroics.

General chat / mixed workloads

Model	Class	Runtime	Notes
`Qwen/Qwen2.5-7B-Instruct`	16 GB	vllm, sglang, tgi, ollama	Repo default. Strong at chat, RAG, and JSON extraction.
`meta-llama/Llama-3.1-8B-Instruct`	16 GB	vllm, sglang, tgi, ollama	Gated (needs HF token). Tool calling works.
`mistralai/Mistral-7B-Instruct-v0.3`	16 GB	vllm, sglang, tgi, ollama	Permissive, solid general baseline.
`Qwen/Qwen2.5-14B-Instruct`	24 GB	vllm, sglang, tgi	Upgrade step for chat quality on a 24 GB card.
`microsoft/Phi-3.5-mini-instruct`	8 GB	vllm, ollama	Compact, fast, surprisingly capable.

Long-context / RAG

Model	Class	Runtime	Notes
`Qwen/Qwen2.5-7B-Instruct`	16 GB	vllm, sglang	128K context configurations exist; set `VLLM_MAX_MODEL_LEN` carefully.
`meta-llama/Llama-3.1-8B-Instruct`	16 GB	vllm, sglang	128K native context.
`mistralai/Mistral-Nemo-Instruct-2407`	24 GB	vllm, sglang	128K native context.

Code

Model	Class	Runtime	Notes
`Qwen/Qwen2.5-Coder-7B-Instruct`	16 GB	vllm, sglang, tgi, ollama	Strong all-purpose code model.
`deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct`	24 GB	vllm, sglang	MoE-light, excellent fill-in-middle.
`bigcode/starcoder2-7b`	16 GB	vllm, tgi	Base model; use Instruct finetunes for chat.

Tool / function calling

Model	Class	Runtime	Notes
`Qwen/Qwen2.5-7B-Instruct`	16 GB	vllm, sglang	Good tool-calling adherence.
`meta-llama/Llama-3.1-8B-Instruct`	16 GB	vllm, sglang	Native tool formats supported.
`NousResearch/Hermes-3-Llama-3.1-8B`	16 GB	vllm, sglang	Designed for structured/tool output.

Embeddings

Model	Class	Runtime	Notes
`BAAI/bge-small-en-v1.5`	2 GB	ollama, localai	Fast, tiny, 384-dim.
`BAAI/bge-m3`	6 GB	ollama, localai	Multilingual, multi-granularity.
`intfloat/e5-large-v2`	4 GB	ollama, localai	High-quality English embeddings.
`nomic-ai/nomic-embed-text-v1.5`	2 GB	ollama	Long-context text embeddings.

TGI does not expose embeddings; run a separate text-embeddings-inference server and point /v1/embeddings at it (BACKEND_KIND=openai, BACKEND_BASE_URL=http://tei:8002/v1).

Small / CPU-only

Model	Class	Runtime	Notes
`TinyLlama/TinyLlama-1.1B-Chat-v1.0`	CPU	ollama, llamacpp	Sanity-check the pipeline.
`microsoft/Phi-3-mini-4k-instruct`	CPU-ok	ollama, llamacpp	Usable on a modern laptop CPU.
`Qwen/Qwen2.5-1.5B-Instruct`	CPU-ok	ollama, llamacpp	Tiny but capable for lightweight tasks.

How to change the model

vLLM / TGI / SGLang

Set MODEL_NAME in .env to a Hugging Face model id, then restart:

docker compose down
docker compose --profile vllm up -d

Ollama

docker exec -it ollama ollama pull qwen2.5:7b-instruct
# then set MODEL_NAME=qwen2.5:7b-instruct in .env and restart the api container:
docker compose restart api

llama.cpp

Drop a .gguf file into ./data/models/ and set LLAMACPP_MODEL_FILE in .env. Restart with docker compose restart llamacpp.

Sizing cheat-sheet

GPU	Safe FP16	Safe Q4 GGUF
12 GB (3060, T4)	7B	13B
16 GB (4070 Ti Super, A4000)	7B–8B	14B–20B
24 GB (A10G, L4, 3090, 4090)	12B–14B	30B–34B
48 GB (A6000, L40)	30B–34B	70B
80 GB (A100, H100)	70B	70B+

FP16 numbers assume standard context and reasonable concurrency. Long-context workloads (>16K tokens) consume noticeably more memory for the KV cache — reduce VLLM_MAX_MODEL_LEN before blaming the model.

What not to do

Do not deploy a model without reading its license, especially for closed-commercial-use variants.
Do not chase benchmark leaderboards for a production deployment. Pick a baseline that the team can evaluate against your prompts, and only move off it when you have data.
Do not mix incompatible quantizations (e.g. GGUF K-quants loaded with the wrong imatrix) and expect determinism.
Do not enable long contexts you are not actually using — the KV memory budget kills concurrency even when the prompt is short.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open-source model catalog

General chat / mixed workloads

Long-context / RAG

Code

Tool / function calling

Embeddings

Small / CPU-only

How to change the model

vLLM / TGI / SGLang

Ollama

llama.cpp

Sizing cheat-sheet

What not to do

FilesExpand file tree

MODELS.md

Latest commit

History

MODELS.md

File metadata and controls

Open-source model catalog

General chat / mixed workloads

Long-context / RAG

Code

Tool / function calling

Embeddings

Small / CPU-only

How to change the model

vLLM / TGI / SGLang

Ollama

llama.cpp

Sizing cheat-sheet

What not to do