Open reference for evaluating AI browser agents, computer-use systems, and coding agents across the public benchmarks teams actually compare on.
Live site: leaderboard.steel.dev — full rankings, methodology notes, and per-result detail.
Maintained by Steel, the open-source browser API for AI agents.
The tables below show the current top entries on each tracked benchmark. Each section links to the full leaderboard with sources, methodology, and additional context.
Browser agents · Agent scope · 19 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | Alumnium (new) | Alumnium | 98.5% |
| 2 | Surfer 2 | H Company | 97.1% |
| 3 | Magnitude | Magnitude | 93.9% |
| 4 | Surfer-H + Holo1 | H Company | 92.2% |
| 5 | Browserable | Browserable | 90.4% |
Research/search · Mixed scope · 82 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | GPT-5.5 Pro | OpenAI | 90.1% |
| 2 | GPT-5.4 Pro | OpenAI | 89.3% |
| 3 | MiroThinker-H1 | MiroMind | 88.2% |
| 4 | Claude Mythos Preview | Anthropic | 86.9% |
| 5 | Kimi K2.6 | Moonshot AI | 86.3% |
Browser agents · Agent scope · 49 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | WebTactix (DeepSeek v3.2) | WebTactix | 74.3% |
| 2 | OpAgent | CodeFuse AI | 71.6% |
| 3 | ColorBrowserAgent | MadeAgents | 71.2% |
| 4 | Claude Code + GBOX MCP | GBOX AI | 68.0% |
| 5 | DeepSky Agent | DeepSky | 66.9% |
Coding · Model scope · 16 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | Claude Mythos | Anthropic | 93.9% |
| 2 | Claude Opus 4.8 (new) | Anthropic | 88.6% |
| 3 | Claude Opus 4.7 | Anthropic | 87.6% |
| 4 | Claude Opus 4.5 | Anthropic | 80.9% |
| 5 | Claude Opus 4.6 | Anthropic | 80.8% |
Coding · Model scope · 15 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | gpt-5 (high) | OpenAI | 88.0% |
| 2 | gpt-5 (medium) | OpenAI | 86.7% |
| 3 | o3-pro (high) | OpenAI | 84.9% |
| 4 | gemini-2.5-pro-preview-06-05 (32k think) | 83.1% | |
| 5 | o3 (high) | OpenAI | 81.3% |
Computer use · Agent scope · 17 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | Claude Opus 4.8 (new) | Anthropic | 83.4% |
| 2 | Mythos Preview (new) | Anthropic | 79.6% |
| 3 | OSAgent | TheAGI Company | 76.26% |
| 4 | GPT-5.4 (new) | OpenAI | 75.0% |
| 5 | Claude Opus 4.6 | Anthropic | 72.7% |
Model evals / reasoning · Agent scope · 21 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | OPS-Agentic-Search (new) | Alibaba Cloud | 92.36% |
| 1 | openJiuwen-deepagent (new) | Suzhou AI Lab / Shuqian Tech | 92.36% |
| 3 | openJiuwen-deepagent (GPT5/Gemini) | openJiuwen | 91.69% |
| 4 | Lemon Agent | Lenovo CTO Org | 91.36% |
| 5 | JoinAI V2.2 | JoinAI-CMCC | 90.7% |
Browser agents · Agent scope · 7 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | Claude Sonnet 4.6 | Anthropic | 33.3% |
| 2 | GLM-5 (new) | Z.ai | 24.2% |
| 3 | Gemini 3 Flash | 19.0% | |
| 4 | Claude Haiku 4.5 | Anthropic | 18.3% |
| 5 | GPT-5.4 | OpenAI | 6.5% |
Browser agents · Agent scope · 11 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | Claude Mythos 5 (browser-use) (new) | Anthropic | 51.9% |
| 1 | Claude Opus 4.8 (browser-use) (new) | Anthropic | 51.9% |
| 3 | Claude Mythos Preview (browser-use) (new) | Anthropic | 47.4% |
| 4 | Claude Sonnet 4.6 (browser-use) (new) | Anthropic | 45.2% |
| 5 | Claude Opus 4.6 CUA (new) | Anthropic | 36.3% |
Browser agents · Agent scope · 22 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | Browser Use Cloud (bu-max) (new) | Browser-Use | 97.0% |
| 2 | GPT-5.4 Native Computer Use | OpenAI | 93.0% |
| 3 | ABP + Claude Opus 4.6 | theredsix | 90.53% |
| 4 | TinyFish | TinyFish AI | 90.0% |
| 5 | UI-TARS-2 | ByteDance / VLM-Research | 88.2% |
Model evals / reasoning · Model scope · 12 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | Step-3.5-Flash | StepFun | 88.2% |
| 2 | GLM-4.7 | Z.ai | 87.4% |
| 3 | MiMo-V2-Flash | Xiaomi | 80.3% |
| 4 | GLM-4.7-Flash | Z.ai | 79.5% |
| 5 | MiniMax M2 | MiniMax | 77.2% |
Model evals / reasoning · Model scope · 10 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | AgentRL w/ Qwen2.5-32B-Instruct | Tsinghua University | 70.4% |
| 2 | AgentRL w/ Qwen2.5-14B-Instruct | Tsinghua University | 67.7% |
| 3 | AgentRL w/ GLM-4-9B-0414 | Tsinghua University | 65.0% |
| 4 | AgentRL w/ Qwen2.5-7B-Instruct | Tsinghua University | 62.0% |
| 5 | AgentRL w/ Qwen2.5-3B-Instruct | Tsinghua University | 60.0% |
- Within-benchmark only. Scores measure different things on different benchmarks; don't compare a WebVoyager number to a SWE-bench Verified number.
- Source-linked. Each score links to the original report — paper, blog post, or official leaderboard. Treat anything self-reported with the appropriate dose of skepticism.
- Scope matters. Agent pages reflect full system setups (model + tools + policy). Model pages emphasize base model capability under a stated harness. Mixed pages combine both — read the per-row notes on the site before drawing conclusions.
We welcome new entries, corrections, methodology notes, and new benchmark pages. See
CONTRIBUTING.md for the evidence standard, JSON schema expectations, ranking
rules, and new-leaderboard checklist.
At minimum, every submitted score needs a public source URL that directly supports the benchmark,
system name, score, and setup notes. If you update leaderboard data, run npm run update-readme
before opening a pull request.
MIT
