mirror of
https://github.com/nesquena/hermes-webui.git
synced 2026-05-27 12:10:40 +00:00
a68276fefc
Coding section: - Gemini 3.1 Pro: add Terminal-Bench 78.4% (highest of any frontier model on CLI/DevOps) Score badge updated to show Terminal-Bench rather than SWE-bench Verified - GPT-5.4: note Terminal-Bench 75.1% in description, consolidate pill text - #5 DeepSeek V3.2 → Qwen 3.6-Plus: leads Terminal-Bench at 61.6%, 88.2% GPQA, 1M context, available now on Alibaba Cloud + OpenRouter Writing section (reordered based on EQ-Bench CW scores): - #1 Claude Sonnet 4.6 (1936 EQ-Bench CW — highest, best voice consistency) - #2 Claude Opus 4.6 (Mazur 8.53, IF Arena #1, 1M context for literary depth) - #3 Gemini 3.1 Pro (Arena CW #1 1487, AI-tell avoidance, 2M context) - #4 GPT-5.4 (noted as ~9th on Arena CW, better for structured/commercial writing) - #5 Meta Muse Spark → Kimi K2.5 (/usr/bin/bash.60/.50, ~1700 EQ-Bench CW, live API) Muse Spark removed — no commercial API available yet Reasoning section: - Gemini 3.1 Pro GPQA: 95.45% → 94.1% (more conservative/recent figure, consistent with both agents' data) - Added ARC-AGI-2 77.1% for Gemini 3.1 Pro (#1 on visual reasoning too) - Opus 4.6: added note that Sonnet leads GDPval-AA (1633 Elo #1) for throughput - #5 DeepSeek V3.2 → Qwen 3.6-Plus (88.2% GPQA, 1M context, same model as coding) Quick picker: - Creative writing: Opus → Sonnet 4.6 (EQ-Bench #1, 85% cheaper) - Hard reasoning: 95.45% → 94.1%, add ARC-AGI-2 mention - Budget pick: DeepSeek V3.2 → Gemini 3 Flash Thinking (/usr/bin/bash.50/1M, 89.8% GPQA) Setup boxes: - Self-hosted: Muse Spark → Qwen 3.6-Plus + Gemma 4 26B MoE (Apache 2.0, 82.3% GPQA with 3.8B active params, best edge/self-hosted reasoning) Overall section: unchanged (top 5 still correct per both agents) Search section: unchanged (no new data from either agent)