⚡ Hermes Home Why Hermes Features Compare Install 🧠 ELI5 — What is Hermes? 🤖 Best Models 2026 🌎 Community Get started →
🏆 Overall 💻 Coding ✍️ Writing 🔍 Search 🧮 Reasoning 🖥️ Local models ⚙️ How to configure
2026 Model Guide

Best AI models for Hermes

Top picks across coding, writing, search, and reasoning — so you know exactly what to plug in and why.

Data from SWE-bench Pro, GPQA Diamond, Chatbot Arena, and BenchLM. Updated May 2, 2026. Source →

🏆

Overall best models

Great at everything. If you only pick one, pick from here. These handle coding, writing, research, and reasoning with minimal trade-offs.

🥇

Claude Opus 4.7

Anthropic · 1M context · $5 / $25 per 1M tokens · Released Apr 16, 2026
Anthropic's most capable generally available model as of April 2026. 70% on CursorBench (vs 58% for Opus 4.6), 98.5% XBOW visual-acuity (vs 54.5%), 3x more resolved production tasks at Rakuten. Catches its own logical faults mid-planning. 3x higher image resolution — 3.75MP vs 1.15MP. Substantially better at multi-session memory and long agentic work. Use it for your most demanding tasks where quality matters more than speed.
New Anthropic flagship Best all-rounder 1M context
70% CursorBench
🥈

Gemini 3.1 Pro

Google DeepMind · 1–2M context · $2 / $12 per 1M tokens
#1 on Chatbot Arena (1,505 Elo) — now the top-ranked publicly available model. Also #1 on GPQA Diamond reasoning (94.1%), #1 on ARC-AGI-2 (77.1%), and #1 on creative writing Arena. Exceptional value at $2/$12 per 1M tokens. The 2M context handles entire codebases and book-length documents.
Arena #1 (1505 Elo) Best value 2M context
1492 Arena Elo
🥉

GPT-5.5

OpenAI · 1M context · $5 / $30 per 1M tokens · Released Apr 23, 2026
OpenAI's new flagship. Massive long-context leap: recall at 512K–1M tokens jumped from ~21% to 74%, making the million-token window actually usable. ARC-AGI-2 up 11.7 points to 85.0%. Terminal-Bench 2.0 at 82.7% and SWE-bench Pro at 58.6% tie it with Kimi K2.6 at the top of the coding charts. GPQA Diamond at 93.6%. Priced at 2× GPT-5.4 — use GPT-5.4 ($2.50/$15) for cost-sensitive work and GPT-5.5 when you need the full capability jump.
New OpenAI flagship ARC-AGI-2: 85% 1M context
85.0% ARC-AGI-2
4

Claude Sonnet 4.6

Anthropic · 200K context · $3 / $15 per 1M tokens
The fast, smart everyday workhorse. 87.5% on GPQA Diamond, 79.6% on SWE-bench Verified, 1,462 Elo on Chatbot Arena. Nearly as capable as Opus at lower cost and higher speed. The right choice for high-volume tasks — daily summaries, coding, research — where you need quality without always paying Opus prices.
Great value Low latency Reliable
1462 Arena Elo
5

DeepSeek V4-Flash

DeepSeek · 1M context · $0.14 / $0.28 per 1M tokens · MIT license · Released Apr 24, 2026
The new extreme budget pick. Released April 24, 2026 alongside V4-Pro. 284B total / 13B active MoE — at $0.14 per 1M tokens it costs 20× less than GPT-5.4 while staying in the same conversation. MIT license, 1M token context, self-hostable. DeepSeek V3.2 (its predecessor) retires July 2026. If you're cost-optimizing at scale, nothing at this price point comes close.
MIT license Ultra cheap Self-hostable
$0.14 per 1M input
💡 Starting out? Claude Sonnet 4.6 is the best first pick — strong, affordable, and the model this project was built and tested with. Get your API key at console.anthropic.com.
💻

Best for coding

Ranked on SWE-bench Pro (1,865 real GitHub issues, multi-language, standardised scaffold — the current clean benchmark) and SWE-bench Verified. These fix bugs and ship features, not just autocomplete.

🥇

Claude Opus 4.7

Anthropic · 1M context · $5 / $25 per 1M tokens
Powers Claude Code and Cursor — the two most-used AI coding tools. 70% on CursorBench (vs 58% for Opus 4.6), 90.9% on BigLaw Bench at high effort, 10-15% task success lift at Factory, 10%+ recall improvement on complex PRs at CodeRabbit. Where Opus 4.7 shines specifically is deep multi-file reasoning with self-correction: it catches its own logical faults during planning before reporting back. The 1M context window fits entire codebases.
Powers Claude Code Self-correcting 1M context
70% CursorBench
🥈

Kimi K2.6

Moonshot AI · 262K context · $0.60 / $3.00 per 1M tokens · Open weights
Leads SWE-bench Pro at 58.6% — beating GPT-5.4 (57.7%) and every other closed model. The only major open coding model with native image and video input (MoonViT-3D encoder). Supports agent swarms up to 300 parallel sub-agents with 4,000 coordinated tool calls and 12+ hours of sustained autonomous execution. Demonstrated real-world gains: 185% throughput improvement on a production financial matching engine, and 15% task success lift reported by Factory.ai. Self-hostable under a modified MIT license.
SWE-Pro #1 (58.6%) 300-agent swarm Open weights
58.6% SWE-bench Pro
🥉

GPT-5.5

OpenAI · 1M context · $5 / $30 per 1M tokens · Terminal-Bench 2.0: 82.7%
Terminal-Bench 2.0 at 82.7% — strongest agentic CLI and DevOps performance of any model. SWE-bench Pro at 58.6% ties Kimi K2.6 for #1 on real-world code fixes. The rebuilt inference stack co-designed with NVIDIA GB300 hardware delivers 20%+ throughput improvement, partially offsetting the 2× price jump over GPT-5.4. For intensive agentic code pipelines where you need both coding depth and shell automation, GPT-5.5 is the pick. Note: GPT-5.4 ($2.50/$15) remains the better value for most everyday coding.
Terminal-Bench: 82.7% SWE-Pro #1 (58.6%) Agentic pipelines
82.7% Terminal-Bench 2.0
4

Claude Sonnet 4.6

Anthropic · 200K context · $3 / $15 per 1M tokens
79.6% on SWE-bench Verified — within 1.2 points of Opus at 40% lower cost and significantly faster. The right everyday coding model for iterative development, unit tests, code explanation, and high-volume agentic loops where paying Opus prices for every call doesn't make sense. Outperforms the now-deprecated Sonnet 4.5 on every benchmark.
Best value Fast iteration High volume
79.6% SWE-bench Verified
5

Gemini 3.1 Pro

Google DeepMind · 1–2M context · $2 / $12 per 1M tokens
80.6% on SWE-bench Verified and 54.2% on SWE-bench Pro at the most competitive price of any frontier model. The massive 2M context window means you can load an entire large codebase and reason across it in one pass — no chunking, no retrieval. The cheapest path to top-tier coding performance.
2M context Cheapest frontier Full-codebase
80.6% SWE-bench Verified
✍️

Best for writing

Creative writing, copywriting, long-form docs, email drafts. Ranked on EQ-Bench Creative Writing Elo (sycophancy-resistant, community-verified) and Chatbot Arena creative writing scores.

🥇

Claude Sonnet 4.6

Anthropic · 200K context · $3 / $15 per 1M tokens
Leads EQ-Bench Creative Writing at 1,936 Elo — higher than Opus (1,932) on the benchmark specifically designed to resist sycophancy and measure genuine literary quality. Best voice consistency over long documents: tone, register, and style stay coherent across sessions. At 85% lower cost than Opus, it's the smart pick for high-volume writing — drafts, summaries, long-form content pipelines.
EQ-Bench CW #1 (1936) Best value Voice consistency
1936 EQ-Bench CW Elo
🥈

Claude Opus 4.7

Anthropic · 1M context · $5 / $25 per 1M tokens
The upgraded heir to Opus 4.6 on literary and instruction-following benchmarks. More direct and opinionated tone than 4.6 — fewer hedges, more conviction. Described as "best model in the world for building dashboards and data-rich interfaces" by Val Town. Raises the bar on professional output quality — interfaces, slides, long-form docs. Best for projects demanding precision and creative depth where spending more per token is worth it.
Professional output More opinionated Literary depth
#2 EQ-Bench CW
🥉

Gemini 3.1 Pro

Google DeepMind · 1–2M context · $2 / $12 per 1M tokens
#1 on Chatbot Arena creative writing Elo (1,487) — human raters prefer it for fiction and blogs — and best for AI-tell avoidance across independent evals. The 2M context and 65K output limit are unmatched for long-form projects: entire chapters, full reports, long narrative arcs. Strong multilingual creative writing in 40+ languages. 12x cheaper on input than Opus.
Arena CW #1 (1487) 2M context AI-tell avoidance
1487 Arena CW Elo
4

GPT-5.4

OpenAI · 1.1M context · $2.50 / $15 per 1M tokens
Solid for structured and commercial writing: technical docs, reports, pitch decks, email campaigns. Excellent instruction-following (~92% IFEval). Ranked ~9th on Arena creative writing — noticeably behind Claude and Gemini for fiction and literary prose, but a strong default when your output needs precise formatting or structured argument flow over stylistic voice.
Structured docs 92% IFEval Technical writing
~9th Arena CW rank
5

Kimi K2.6

Moonshot AI · 262K context · $0.60 / $3.00 per 1M tokens · Open weights
~1,700 EQ-Bench Creative Writing Elo — roughly 87% of Sonnet's literary quality at 80% lower cost. The budget pick for high-volume content: product descriptions, social copy, blog drafts, and content pipelines where you need coherent writing at scale without paying frontier prices on every call. Upgraded from K2.5 with a larger 262K context window and native image input. API live via Moonshot platform.
Budget CW pick ~1700 EQ-Bench CW Volume content
$0.60 per 1M input
🧮

Best for reasoning and analysis

Hard math, PhD-level science, complex multi-step logic, knowledge work, and second-brain tasks. Ranked on GPQA Diamond (PhD expert baseline: 65%), Humanity's Last Exam, and ARC-AGI-2.

🥇

Gemini 3.1 Pro

Google DeepMind · 1–2M context · 94.1% GPQA Diamond
Leads GPQA Diamond at 94.1% and ARC-AGI-2 visual reasoning at 77.1% — the highest score on both of the hardest published reasoning benchmarks. Near-perfect AIME 2025 math (98%+). Strong across physics, chemistry, biology, and multi-domain expert knowledge. The 2M context makes it uniquely capable for research requiring both depth and breadth in one pass.
GPQA #1 (94.1%) ARC-AGI-2 #1 (77.1%) 2M context
94.1% GPQA Diamond
🥈

GPT-5.4

OpenAI · 1.1M context · ~92% GPQA Diamond
~92% on GPQA Diamond, 41.6% on Humanity's Last Exam, and ~92% on IFEval strict compliance. The best-balanced reasoning model: strong on scientific knowledge, reliable at following precise analytical instructions, and capable across coding and writing tasks simultaneously. The GPT-5.4 Pro variant adds extended reasoning for the genuinely hard problems.
GPQA ~92% 92% IFEval Balanced
93.6% GPQA Diamond
🥉

Claude Opus 4.7

Anthropic · 1M context · 128K max output
21% fewer errors on OfficeQA Pro document reasoning (Databricks), 13% resolution lift on 93-task coding benchmark (Morph), and 90.9% on BigLaw Bench (Harvey). Now accepts images up to 3.75MP — 3x more than Opus 4.6 — making it the strongest pick for visual research, dense diagrams, and chart analysis. Best for research requiring both depth and 1M-token coherence. Claude Sonnet 4.6 leads GDPval-AA retrieval (1,633 Elo) if throughput matters.
Best Anthropic model 1M context 3.75MP vision
90.9% BigLaw Bench
4

Gemini 3 Flash (Thinking)

Google DeepMind · 1M context · $0.50 / $3 per 1M tokens
89.8% on GPQA Diamond at just $0.50/$3 per 1M tokens — the best reasoning value by a large margin, outperforming models that cost 20x more. The thinking mode shows its chain-of-thought for auditing. 0.34s time-to-first-token. If you run many hard reasoning calls per day and can't justify frontier pricing, nothing else comes close at this price point.
89.8% GPQA Best value Auditable thinking
$0.50 per 1M input
5

Qwen 3.6-Plus

Alibaba · 1M context · 88.2% GPQA Diamond · Competitive pricing
88.2% on GPQA Diamond with a 1M token context window at a fraction of frontier pricing — a genuine dark-horse for knowledge work. Strong on structured knowledge retrieval and multi-step analytical tasks. The same model that leads Terminal-Bench for coding also performs well on reasoning benchmarks, making it unusually versatile. Available via Alibaba Cloud and OpenRouter.
GPQA 88.2% 1M context Value pick
88.2% GPQA Diamond
🖥️

Best models to run locally

Open-weight models you can run on hardware you own — no API key, no monthly bill, no data leaving your machine. Ranked by practical capability on consumer GPU and Apple Silicon hardware.

🥇

Gemma 4 31B

Google · Apache 2.0 · 256K context · ~20 GB VRAM (Q4)
The best single-GPU open model in 2026. 84.3% on GPQA Diamond, 80.0% on LiveCodeBench v6, 89.2% on AIME 2026 math. Dense architecture (all 31B active every call) gives it consistent quality without the coordination overhead of MoE. Genuinely multimodal — text and images. Runs on an RTX 3090/4090 or an M2/M3 Pro MacBook. Apache 2.0 means you can fine-tune and deploy commercially. The cloud-model quality gap is now thin at this tier.
GPQA 84.3% Apache 2.0 20 GB VRAM
84.3% GPQA Diamond
🥈

Qwen3.6 27B

Alibaba · Apache 2.0 · 128K context · ~16 GB VRAM (Q4)
Released April 22, 2026. 77.2% SWE-bench Verified and 87.8% GPQA Diamond — surpassing the previous Qwen3.6 35B MoE on every major benchmark with a single dense 27B checkpoint. Terminal-Bench 2.0: 59.3%. Multimodal (vision + language), unified thinking/non-thinking mode. Apache 2.0. Runs on a 16 GB Mac Mini or RTX 4090. For now use llama.cpp or Unsloth Studio — Ollama GGUFs don't yet pair the vision projector. Supersedes Qwen3.6 35B released just a week prior.
SWE 77.2% 128K context 16 GB VRAM
77.2% SWE-bench
🥉

DeepSeek R1 32B (distill)

DeepSeek · MIT · 128K context · ~20 GB VRAM (Q4)
The strongest battle-tested reasoning model on a single RTX 4090. This is the 32B knowledge-distilled version of the 671B DeepSeek R1 — same chain-of-thought training, fraction of the compute. 62.1% GPQA Diamond, 72.6% AIME 2024, 94.3% MATH-500. Released January 2025 and still the most downloaded reasoning model on Ollama (82M+ pulls) — genuinely proven at scale. MIT license, free to fine-tune. Run it with ollama run deepseek-r1:32b.
MIT license Chain-of-thought 20 GB VRAM
62.1% GPQA Diamond
4

Llama 4 Scout

Meta · Llama 4 Community License · 10M ctx (128K via Ollama) · ~24 GB VRAM (Q4, 67 GB download)
One trick nothing else matches: a 10 million token context window — fit entire codebases, books, or months of logs in a single prompt. MoE architecture (109B total, only 17B active per token) keeps inference fast despite the scale. Natively multimodal: text and images. MMLU Pro 74.3%, GPQA Diamond 57.2%, DocVQA 94.4%. Note: Ollama serves it at 128K context by default — the full 10M requires a multi-GPU server build. The 67 GB Q4 download fits a single RTX 4090 (24 GB) but it is tight. Not true open source — the Llama 4 Community License restricts deployment at 700M+ MAU.
10M context Multimodal 24 GB VRAM
10M token context
5

Phi-4 Reasoning 14B

Microsoft · MIT · 32–64K context · ~8 GB VRAM (Q4)
The best reasoning model for machines with limited VRAM — an 8 GB GPU or a MacBook with 16 GB RAM. The "plus" variant (Phi-4-reasoning-plus) adds RL training on top of the base SFT and is the one to use: 81.3% AIME 2024 and 68.9% GPQA Diamond, which outperforms the DeepSeek R1 70B distill (5× its size) on both benchmarks. Run it with ollama run phi4-reasoning:14b-plus. Short context (32K, tested to 64K) is the main limit — not suitable for large documents. For math, structured analysis, and code review on laptop hardware, nothing at this weight class comes close.
8 GB VRAM MIT license Laptop-friendly
8 GB min VRAM
💡 Running locally means you own the model and the data. Use Ollama or llama.cpp to serve any of these, then point Hermes at your local server: set provider: openai with base_url: http://localhost:11434/v1. Your API key can be any string.
⚙️

How to set up a model in Hermes

Each model needs an API key and a provider setting. Here's the quick version for each major provider.

Anthropic (Claude models)

Get your key at console.anthropic.com, then in Hermes settings set provider: anthropic and ANTHROPIC_API_KEY in your environment. Model names: claude-opus-4-7, claude-opus-4-6, claude-sonnet-4-6.

OpenAI (GPT models)

Get your key at platform.openai.com, then set provider: openai and OPENAI_API_KEY. Current models: gpt-5-5 (flagship, $5/$30), gpt-5-4 (affordable, $2.50/$15), gpt-5-4-mini (small), gpt-5-4-nano (nano).

Google (Gemini models)

Get your key at aistudio.google.com, then set provider: google and GOOGLE_API_KEY. Model names: gemini-3-1-pro-preview, gemini-3-pro, gemini-3-flash.

OpenRouter (all models via one key)

The easiest way to try multiple models without multiple accounts. Get a key at openrouter.ai, set provider: openrouter and OPENROUTER_API_KEY. Access Claude, GPT, Gemini, DeepSeek, Llama and more with one key.

Self-hosted (DeepSeek V4-Flash, Qwen 3.6-Plus, Gemma 4)

Run models locally with llama.cpp or Ollama. DeepSeek V4-Flash (MIT, $0.14/$0.28 per 1M — released Apr 24, 2026) and Qwen 3.6-Plus are the top open-weight picks for coding. Gemma 4 26B MoE (Apache 2.0, 82.3% GPQA Diamond with only 3.8B active parameters) is the best edge/self-hosted reasoning option. Point Hermes at your local server: set provider: openai with a base_url like http://localhost:11434/v1. Your API key can be any string.

Pick by use case
Not sure which model to start with? Match your task to a pick.
🤖 Daily assistant
Claude Sonnet 4.6
Fast, affordable, strong across everything. The best starting point.
💻 Complex coding
Claude Opus 4.7
Newest Anthropic flagship. Powers Claude Code. Self-correcting reasoning.
✍️ Creative writing
Claude Sonnet 4.6
EQ-Bench CW #1 (1936). Best voice consistency, 85% cheaper than Opus.
🔍 Search & news
Gemini 3.1 Pro
Native Google Search grounding. 2M context for long research.
🧮 Hard reasoning
Gemini 3.1 Pro
GPQA #1 at 94.1%, ARC-AGI-2 #1 at 77.1%.
💰 Budget pick
Gemini 3 Flash (Thinking)
$0.50/1M, 89.8% GPQA Diamond. Best reasoning per dollar by far.
📄 Huge documents
Gemini 3.1 Pro
Up to 2M token context. Load entire codebases or books in one pass.
🗝️ Try everything
OpenRouter
One API key, every model. Switch between Claude, GPT, Gemini instantly.
🖥️ Run privately
Gemma 4 31B or Qwen3.6 27B
No API key, no data leaving your machine. Best two on consumer hardware.
🤖 Agentic CLI/DevOps
GPT-5.5
Terminal-Bench 2.0: 82.7%. New OpenAI flagship, Apr 23 2026. $5/$30 per 1M.

Ready to get started?

Self-host Hermes in under five minutes and bring your own API key.