Files
hermes-webui/models
Nathan Esquenazi a68276fefc docs: synthesise second-agent data into models page
Coding section:
- Gemini 3.1 Pro: add Terminal-Bench 78.4% (highest of any frontier model on CLI/DevOps)
  Score badge updated to show Terminal-Bench rather than SWE-bench Verified
- GPT-5.4: note Terminal-Bench 75.1% in description, consolidate pill text
- #5 DeepSeek V3.2 → Qwen 3.6-Plus: leads Terminal-Bench at 61.6%, 88.2% GPQA,
  1M context, available now on Alibaba Cloud + OpenRouter

Writing section (reordered based on EQ-Bench CW scores):
- #1 Claude Sonnet 4.6 (1936 EQ-Bench CW — highest, best voice consistency)
- #2 Claude Opus 4.6 (Mazur 8.53, IF Arena #1, 1M context for literary depth)
- #3 Gemini 3.1 Pro (Arena CW #1 1487, AI-tell avoidance, 2M context)
- #4 GPT-5.4 (noted as ~9th on Arena CW, better for structured/commercial writing)
- #5 Meta Muse Spark → Kimi K2.5 (/usr/bin/bash.60/.50, ~1700 EQ-Bench CW, live API)
  Muse Spark removed — no commercial API available yet

Reasoning section:
- Gemini 3.1 Pro GPQA: 95.45% → 94.1% (more conservative/recent figure, consistent
  with both agents' data)
- Added ARC-AGI-2 77.1% for Gemini 3.1 Pro (#1 on visual reasoning too)
- Opus 4.6: added note that Sonnet leads GDPval-AA (1633 Elo #1) for throughput
- #5 DeepSeek V3.2 → Qwen 3.6-Plus (88.2% GPQA, 1M context, same model as coding)

Quick picker:
- Creative writing: Opus → Sonnet 4.6 (EQ-Bench #1, 85% cheaper)
- Hard reasoning: 95.45% → 94.1%, add ARC-AGI-2 mention
- Budget pick: DeepSeek V3.2 → Gemini 3 Flash Thinking (/usr/bin/bash.50/1M, 89.8% GPQA)

Setup boxes:
- Self-hosted: Muse Spark → Qwen 3.6-Plus + Gemma 4 26B MoE (Apache 2.0,
  82.3% GPQA with 3.8B active params, best edge/self-hosted reasoning)

Overall section: unchanged (top 5 still correct per both agents)
Search section: unchanged (no new data from either agent)
2026-04-12 02:49:07 +00:00
..