docs: update models page — Kimi K2.6 replaces K2.5 (58.6% SWE-Pro #1 coding, writing section updated)

This commit is contained in:
nesquena-hermes
2026-04-21 04:19:29 +00:00
parent 68c56e6a2f
commit d33a5d72da
+25 -23
View File
@@ -617,7 +617,7 @@
<div class="hero-badge">2026 Model Guide</div>
<h1>Best AI models for <em>Hermes</em></h1>
<p class="hero-sub">Top picks across coding, writing, search, and reasoning — so you know exactly what to plug in and why.</p>
<p class="hero-note">Data from SWE-bench Pro, GPQA Diamond, Chatbot Arena, and BenchLM. Updated April 20, 2026. <a href="https://lmarena.ai/leaderboard" target="_blank">Source →</a></p>
<p class="hero-note">Data from SWE-bench Pro, GPQA Diamond, Chatbot Arena, and BenchLM. Updated April 21, 2026. <a href="https://lmarena.ai/leaderboard" target="_blank">Source →</a></p>
</section>
<!-- =========================================
@@ -771,8 +771,26 @@
</div>
</div>
<div class="model-card" style="--model-color: #58a6ff;">
<div class="model-card" style="--model-color: #ff9040;">
<div class="rank-badge silver">🥈</div>
<div class="model-info">
<h3>Kimi K2.6</h3>
<div class="model-meta">Moonshot AI &middot; 262K context &middot; $0.60 / $3.00 per 1M tokens &middot; Open weights</div>
<div class="model-why">Leads SWE-bench Pro at 58.6% — beating GPT-5.4 (57.7%) and every other closed model. The only major open coding model with native image and video input (MoonViT-3D encoder). Supports agent swarms up to 300 parallel sub-agents with 4,000 coordinated tool calls and 12+ hours of sustained autonomous execution. Demonstrated real-world gains: 185% throughput improvement on a production financial matching engine, and 15% task success lift reported by Factory.ai. Self-hostable under a modified MIT license.</div>
<div class="model-pills">
<span class="pill orange">SWE-Pro #1 (58.6%)</span>
<span class="pill blue">300-agent swarm</span>
<span class="pill green">Open weights</span>
</div>
</div>
<div class="model-score">
<span class="score-val">58.6%</span>
<span class="score-label">SWE-bench Pro</span>
</div>
</div>
<div class="model-card" style="--model-color: #58a6ff;">
<div class="rank-badge silver">🥉</div>
<div class="model-info">
<h3>GPT-5.4</h3>
<div class="model-meta">OpenAI &middot; 1.1M context &middot; $2.50 / $15 per 1M tokens</div>
@@ -790,7 +808,7 @@
</div>
<div class="model-card" style="--model-color: #3fb950;">
<div class="rank-badge bronze">🥉</div>
<div class="rank-badge bronze">4</div>
<div class="model-info">
<h3>Claude Sonnet 4.6</h3>
<div class="model-meta">Anthropic &middot; 200K context &middot; $3 / $15 per 1M tokens</div>
@@ -808,7 +826,7 @@
</div>
<div class="model-card" style="--model-color: #8b949e;">
<div class="rank-badge">4</div>
<div class="rank-badge">5</div>
<div class="model-info">
<h3>Gemini 3.1 Pro</h3>
<div class="model-meta">Google DeepMind &middot; 12M context &middot; $2 / $12 per 1M tokens</div>
@@ -825,22 +843,6 @@
</div>
</div>
<div class="model-card" style="--model-color: #ff9040;">
<div class="rank-badge">5</div>
<div class="model-info">
<h3>Qwen 3.6-Plus</h3>
<div class="model-meta">Alibaba &middot; 1M context &middot; Emerging agentic pick &middot; OpenRouter available</div>
<div class="model-why">Leads Terminal-Bench at 61.6% — ahead of both GPT-5.4 and Gemini 3.1 Pro on CLI and DevOps automation. 88.2% on GPQA Diamond. The 1M token context fits large codebases cleanly. An emerging dark-horse for agentic coding pipelines with strong independent eval scores. Available now via Alibaba Cloud and OpenRouter.</div>
<div class="model-pills">
<span class="pill orange">Terminal-Bench #1 (61.6%)</span>
<span class="pill blue">1M context</span>
<span class="pill green">Agentic emerging</span>
</div>
</div>
<div class="model-score">
<span class="score-val">61.6%</span>
<span class="score-label">Terminal-Bench</span>
</div>
</div>
</div>
</div>
@@ -937,9 +939,9 @@
<div class="model-card" style="--model-color: #ff9040;">
<div class="rank-badge">5</div>
<div class="model-info">
<h3>Kimi K2.5</h3>
<div class="model-meta">Moonshot AI &middot; 128K context &middot; $0.60 / $2.50 per 1M tokens</div>
<div class="model-why">~1,700 EQ-Bench Creative Writing Elo — roughly 87% of Sonnet's literary quality at 80% lower cost. The budget pick for high-volume content: product descriptions, social copy, blog drafts, content pipelines where you need coherent writing at scale without paying frontier prices on every call. API is live and available now.</div>
<h3>Kimi K2.6</h3>
<div class="model-meta">Moonshot AI &middot; 262K context &middot; $0.60 / $3.00 per 1M tokens &middot; Open weights</div>
<div class="model-why">~1,700 EQ-Bench Creative Writing Elo — roughly 87% of Sonnet's literary quality at 80% lower cost. The budget pick for high-volume content: product descriptions, social copy, blog drafts, and content pipelines where you need coherent writing at scale without paying frontier prices on every call. Upgraded from K2.5 with a larger 262K context window and native image input. API live via Moonshot platform.</div>
<div class="model-pills">
<span class="pill orange">Budget CW pick</span>
<span class="pill green">~1700 EQ-Bench CW</span>