Skip to content

steel-dev/leaderboard

Repository files navigation

AI Browser Agent Leaderboards

Open reference for evaluating AI browser agents, computer-use systems, and coding agents across the public benchmarks teams actually compare on.

Steel.dev — Open-source Browser API for AI Agents & Apps

Live site: leaderboard.steel.dev — full rankings, methodology notes, and per-result detail.

Maintained by Steel, the open-source browser API for AI agents.

Top results by benchmark

The tables below show the current top entries on each tracked benchmark. Each section links to the full leaderboard with sources, methodology, and additional context.

Browser agents · Agent scope · 19 entries tracked

Rank System Organization Score
1 Alumnium (new) Alumnium 98.5%
2 Surfer 2 H Company 97.1%
3 Magnitude Magnitude 93.9%
4 Surfer-H + Holo1 H Company 92.2%
5 Browserable Browserable 90.4%

See all 19 entries →


Research/search · Mixed scope · 82 entries tracked

Rank System Organization Score
1 GPT-5.5 Pro OpenAI 90.1%
2 GPT-5.4 Pro OpenAI 89.3%
3 MiroThinker-H1 MiroMind 88.2%
4 Claude Mythos Preview Anthropic 86.9%
5 Kimi K2.6 Moonshot AI 86.3%

See all 82 entries →


Browser agents · Agent scope · 49 entries tracked

Rank System Organization Score
1 WebTactix (DeepSeek v3.2) WebTactix 74.3%
2 OpAgent CodeFuse AI 71.6%
3 ColorBrowserAgent MadeAgents 71.2%
4 Claude Code + GBOX MCP GBOX AI 68.0%
5 DeepSky Agent DeepSky 66.9%

See all 49 entries →


Coding · Model scope · 16 entries tracked

Rank System Organization Score
1 Claude Mythos Anthropic 93.9%
2 Claude Opus 4.8 (new) Anthropic 88.6%
3 Claude Opus 4.7 Anthropic 87.6%
4 Claude Opus 4.5 Anthropic 80.9%
5 Claude Opus 4.6 Anthropic 80.8%

See all 16 entries →


Coding · Model scope · 15 entries tracked

Rank System Organization Score
1 gpt-5 (high) OpenAI 88.0%
2 gpt-5 (medium) OpenAI 86.7%
3 o3-pro (high) OpenAI 84.9%
4 gemini-2.5-pro-preview-06-05 (32k think) Google 83.1%
5 o3 (high) OpenAI 81.3%

See all 15 entries →


Computer use · Agent scope · 17 entries tracked

Rank System Organization Score
1 Claude Opus 4.8 (new) Anthropic 83.4%
2 Mythos Preview (new) Anthropic 79.6%
3 OSAgent TheAGI Company 76.26%
4 GPT-5.4 (new) OpenAI 75.0%
5 Claude Opus 4.6 Anthropic 72.7%

See all 17 entries →


Model evals / reasoning · Agent scope · 21 entries tracked

Rank System Organization Score
1 OPS-Agentic-Search (new) Alibaba Cloud 92.36%
1 openJiuwen-deepagent (new) Suzhou AI Lab / Shuqian Tech 92.36%
3 openJiuwen-deepagent (GPT5/Gemini) openJiuwen 91.69%
4 Lemon Agent Lenovo CTO Org 91.36%
5 JoinAI V2.2 JoinAI-CMCC 90.7%

See all 21 entries →


Browser agents · Agent scope · 7 entries tracked

Rank System Organization Score
1 Claude Sonnet 4.6 Anthropic 33.3%
2 GLM-5 (new) Z.ai 24.2%
3 Gemini 3 Flash Google 19.0%
4 Claude Haiku 4.5 Anthropic 18.3%
5 GPT-5.4 OpenAI 6.5%

See all 7 entries →


Browser agents · Agent scope · 11 entries tracked

Rank System Organization Score
1 Claude Mythos 5 (browser-use) (new) Anthropic 51.9%
1 Claude Opus 4.8 (browser-use) (new) Anthropic 51.9%
3 Claude Mythos Preview (browser-use) (new) Anthropic 47.4%
4 Claude Sonnet 4.6 (browser-use) (new) Anthropic 45.2%
5 Claude Opus 4.6 CUA (new) Anthropic 36.3%

See all 11 entries →


Browser agents · Agent scope · 22 entries tracked

Rank System Organization Score
1 Browser Use Cloud (bu-max) (new) Browser-Use 97.0%
2 GPT-5.4 Native Computer Use OpenAI 93.0%
3 ABP + Claude Opus 4.6 theredsix 90.53%
4 TinyFish TinyFish AI 90.0%
5 UI-TARS-2 ByteDance / VLM-Research 88.2%

See all 22 entries →


Model evals / reasoning · Model scope · 12 entries tracked

Rank System Organization Score
1 Step-3.5-Flash StepFun 88.2%
2 GLM-4.7 Z.ai 87.4%
3 MiMo-V2-Flash Xiaomi 80.3%
4 GLM-4.7-Flash Z.ai 79.5%
5 MiniMax M2 MiniMax 77.2%

See all 12 entries →


Model evals / reasoning · Model scope · 10 entries tracked

Rank System Organization Score
1 AgentRL w/ Qwen2.5-32B-Instruct Tsinghua University 70.4%
2 AgentRL w/ Qwen2.5-14B-Instruct Tsinghua University 67.7%
3 AgentRL w/ GLM-4-9B-0414 Tsinghua University 65.0%
4 AgentRL w/ Qwen2.5-7B-Instruct Tsinghua University 62.0%
5 AgentRL w/ Qwen2.5-3B-Instruct Tsinghua University 60.0%

See all 10 entries →

How to read these tables

  • Within-benchmark only. Scores measure different things on different benchmarks; don't compare a WebVoyager number to a SWE-bench Verified number.
  • Source-linked. Each score links to the original report — paper, blog post, or official leaderboard. Treat anything self-reported with the appropriate dose of skepticism.
  • Scope matters. Agent pages reflect full system setups (model + tools + policy). Model pages emphasize base model capability under a stated harness. Mixed pages combine both — read the per-row notes on the site before drawing conclusions.

Contributing

We welcome new entries, corrections, methodology notes, and new benchmark pages. See CONTRIBUTING.md for the evidence standard, JSON schema expectations, ranking rules, and new-leaderboard checklist.

At minimum, every submitted score needs a public source URL that directly supports the benchmark, system name, score, and setup notes. If you update leaderboard data, run npm run update-readme before opening a pull request.

License

MIT