diff --git a/models/index.html b/models/index.html
index 1d5cf8e7..c0b99822 100644
--- a/models/index.html
+++ b/models/index.html
@@ -608,6 +608,7 @@
✍️ Writing
🔍 Search
🧮 Reasoning
+ 🖥️ Local models
⚙️ How to configure
@@ -1174,6 +1175,127 @@
+
+
+
+
+
+
+
+
+
+
+
🥇
+
+
Gemma 4 31B
+
Google · Apache 2.0 · 256K context · ~20 GB VRAM (Q4)
+
The best single-GPU open model in 2026. 84.3% on GPQA Diamond, 80.0% on LiveCodeBench v6, 89.2% on AIME 2026 math. Dense architecture (all 31B active every call) gives it consistent quality without the coordination overhead of MoE. Genuinely multimodal — text and images. Runs on an RTX 3090/4090 or an M2/M3 Pro MacBook. Apache 2.0 means you can fine-tune and deploy commercially. The cloud-model quality gap is now thin at this tier.
+
+ GPQA 84.3%
+ Apache 2.0
+ 20 GB VRAM
+
+
+
+ 84.3%
+ GPQA Diamond
+
+
+
+
+
+
🥈
+
+
Qwen 3.5 27B
+
Alibaba · Apache 2.0 · 800K context · ~16 GB VRAM (Q4)
+
Best coding benchmark of any model that fits on a 16 GB GPU — 72.4% SWE-bench Verified, which beats models twice its size. The 800K token context window handles large codebases in a single pass. Dual-mode: fast direct answers or slow chain-of-thought reasoning when you need it. Runs on an RTX 4080, M2/M3 Max, or any machine with 16 GB VRAM. Instruction-following (IFBench 76.5%) beats GPT-5.2. The pragmatic local pick for developers.
+
+ SWE 72.4%
+ 800K context
+ 16 GB VRAM
+
+
+
+ 72.4%
+ SWE-bench
+
+
+
+
+
+
🥉
+
+
DeepSeek R1 32B (distill)
+
DeepSeek · MIT · 128K context · ~17 GB VRAM (Q4)
+
The strongest reasoning model you can run on a single RTX 4090. This is the 32B knowledge-distilled version of the 671B DeepSeek R1 — same chain-of-thought training, fraction of the compute. 62.1% GPQA Diamond, 72.0% AIME 2025, 85.4% HumanEval. It approaches the full model on math and logical deduction. MIT license, free to fine-tune. The right pick when you need to solve hard problems locally — theorem-level math, complex debugging, multi-step analysis — on hardware you already own.
+
+ MIT license
+ Chain-of-thought
+ 17 GB VRAM
+
+
+
+ 62.1%
+ GPQA Diamond
+
+
+
+
+
+
4
+
+
Llama 4 Scout
+
Meta · Llama 4 Community License · 10M context · ~24 GB VRAM (Q4)
+
One trick that nothing else matches: a 10 million token context window — fit entire codebases, entire books, months of logs in a single prompt. MoE architecture (109B total, only 17B active) keeps inference fast despite the scale. Natively multimodal with text and image support. MMLU 74.3%, HumanEval 81.2%. The context window alone makes it worth running if your use case involves huge documents or large repo Q&A. Note: not true open source — the Llama 4 Community License restricts deployment at 700M+ MAU.
+
+ 10M context
+ Multimodal
+ 24 GB VRAM
+
+
+
+ 10M
+ token context
+
+
+
+
+
+
5
+
+
Phi-4 Reasoning 14B
+
Microsoft · MIT · 32–64K context · ~8 GB VRAM (Q4)
+
The best model for machines with limited VRAM — an 8 GB GPU or a MacBook with 16 GB RAM. At only 14B parameters, Phi-4 Reasoning outperforms the DeepSeek R1 70B distill on several reasoning benchmarks, and the 3.8B mini variant runs on phones. Trained with Microsoft's compute-optimal recipe that prioritizes reasoning ability over raw parameter count. Short context (32–64K) is the main constraint; not suitable for large documents. But for logic puzzles, code review, math, and structured analysis, it's the most accessible local reasoning model available.
+
+ 8 GB VRAM
+ MIT license
+ Laptop-friendly
+
+
+
+ 8 GB
+ min VRAM
+
+
+
+
+
+
💡
+
Running locally means you own the model and the data. Use Ollama or llama.cpp to serve any of these, then point Hermes at your local server: set provider: openai with base_url: http://localhost:11434/v1. Your API key can be any string.
+
+
+
+
+
+
@@ -1258,6 +1380,11 @@
OpenRouter
One API key, every model. Switch between Claude, GPT, Gemini instantly.
+
+
🖥️ Run privately
+
Gemma 4 31B or Qwen 3.5 27B
+
No API key, no data leaving your machine. Best two on consumer hardware.
+
@@ -1326,7 +1453,7 @@
});
// Update active tab on scroll
- var sections = ['overall','coding','writing','search','reasoning','howto'];
+ var sections = ['overall','coding','writing','search','reasoning','local','howto'];
window.addEventListener('scroll', function() {
var scrollY = window.scrollY + 120;
var current = sections[0];