diff --git a/models/index.html b/models/index.html index 1d5cf8e7..c0b99822 100644 --- a/models/index.html +++ b/models/index.html @@ -608,6 +608,7 @@ ✍️ Writing 🔍 Search 🧮 Reasoning + 🖥️ Local models ⚙️ How to configure @@ -1174,6 +1175,127 @@
+
+ + +
+
+
+
🖥️
+
+

Best models to run locally

+

Open-weight models you can run on hardware you own — no API key, no monthly bill, no data leaving your machine. Ranked by practical capability on consumer GPU and Apple Silicon hardware.

+
+
+ +
+ +
+
🥇
+
+

Gemma 4 31B

+
Google · Apache 2.0 · 256K context · ~20 GB VRAM (Q4)
+
The best single-GPU open model in 2026. 84.3% on GPQA Diamond, 80.0% on LiveCodeBench v6, 89.2% on AIME 2026 math. Dense architecture (all 31B active every call) gives it consistent quality without the coordination overhead of MoE. Genuinely multimodal — text and images. Runs on an RTX 3090/4090 or an M2/M3 Pro MacBook. Apache 2.0 means you can fine-tune and deploy commercially. The cloud-model quality gap is now thin at this tier.
+
+ GPQA 84.3% + Apache 2.0 + 20 GB VRAM +
+
+
+ 84.3% + GPQA Diamond +
+
+ + +
+
🥈
+
+

Qwen 3.5 27B

+
Alibaba · Apache 2.0 · 800K context · ~16 GB VRAM (Q4)
+
Best coding benchmark of any model that fits on a 16 GB GPU — 72.4% SWE-bench Verified, which beats models twice its size. The 800K token context window handles large codebases in a single pass. Dual-mode: fast direct answers or slow chain-of-thought reasoning when you need it. Runs on an RTX 4080, M2/M3 Max, or any machine with 16 GB VRAM. Instruction-following (IFBench 76.5%) beats GPT-5.2. The pragmatic local pick for developers.
+
+ SWE 72.4% + 800K context + 16 GB VRAM +
+
+
+ 72.4% + SWE-bench +
+
+ + +
+
🥉
+
+

DeepSeek R1 32B (distill)

+
DeepSeek · MIT · 128K context · ~17 GB VRAM (Q4)
+
The strongest reasoning model you can run on a single RTX 4090. This is the 32B knowledge-distilled version of the 671B DeepSeek R1 — same chain-of-thought training, fraction of the compute. 62.1% GPQA Diamond, 72.0% AIME 2025, 85.4% HumanEval. It approaches the full model on math and logical deduction. MIT license, free to fine-tune. The right pick when you need to solve hard problems locally — theorem-level math, complex debugging, multi-step analysis — on hardware you already own.
+
+ MIT license + Chain-of-thought + 17 GB VRAM +
+
+
+ 62.1% + GPQA Diamond +
+
+ + +
+
4
+
+

Llama 4 Scout

+
Meta · Llama 4 Community License · 10M context · ~24 GB VRAM (Q4)
+
One trick that nothing else matches: a 10 million token context window — fit entire codebases, entire books, months of logs in a single prompt. MoE architecture (109B total, only 17B active) keeps inference fast despite the scale. Natively multimodal with text and image support. MMLU 74.3%, HumanEval 81.2%. The context window alone makes it worth running if your use case involves huge documents or large repo Q&A. Note: not true open source — the Llama 4 Community License restricts deployment at 700M+ MAU.
+
+ 10M context + Multimodal + 24 GB VRAM +
+
+
+ 10M + token context +
+
+ + +
+
5
+
+

Phi-4 Reasoning 14B

+
Microsoft · MIT · 32–64K context · ~8 GB VRAM (Q4)
+
The best model for machines with limited VRAM — an 8 GB GPU or a MacBook with 16 GB RAM. At only 14B parameters, Phi-4 Reasoning outperforms the DeepSeek R1 70B distill on several reasoning benchmarks, and the 3.8B mini variant runs on phones. Trained with Microsoft's compute-optimal recipe that prioritizes reasoning ability over raw parameter count. Short context (32–64K) is the main constraint; not suitable for large documents. But for logic puzzles, code review, math, and structured analysis, it's the most accessible local reasoning model available.
+
+ 8 GB VRAM + MIT license + Laptop-friendly +
+
+
+ 8 GB + min VRAM +
+
+
+ +
+ 💡 + Running locally means you own the model and the data. Use Ollama or llama.cpp to serve any of these, then point Hermes at your local server: set provider: openai with base_url: http://localhost:11434/v1. Your API key can be any string. +
+
+
+ +
+ @@ -1258,6 +1380,11 @@
OpenRouter
One API key, every model. Switch between Claude, GPT, Gemini instantly.
+
+
🖥️ Run privately
+
Gemma 4 31B or Qwen 3.5 27B
+
No API key, no data leaving your machine. Best two on consumer hardware.
+
@@ -1326,7 +1453,7 @@ }); // Update active tab on scroll - var sections = ['overall','coding','writing','search','reasoning','howto']; + var sections = ['overall','coding','writing','search','reasoning','local','howto']; window.addEventListener('scroll', function() { var scrollY = window.scrollY + 120; var current = sections[0];