AI Browser Agent Leaderboards

Open reference for evaluating AI browser agents, computer-use systems, and coding agents across the public benchmarks teams actually compare on.

Live site: leaderboard.steel.dev — full rankings, methodology notes, and per-result detail.

Maintained by Steel, the open-source browser API for AI agents.

Top results by benchmark

The tables below show the current top entries on each tracked benchmark. Each section links to the full leaderboard with sources, methodology, and additional context.

WebVoyager

Browser agents · Agent scope · 19 entries tracked

Rank	System	Organization	Score
1	Alumnium (new)	Alumnium	98.5%
2	Surfer 2	H Company	97.1%
3	Magnitude	Magnitude	93.9%
4	Surfer-H + Holo1	H Company	92.2%
5	Browserable	Browserable	90.4%

See all 19 entries →

BrowseComp

Research/search · Mixed scope · 82 entries tracked

Rank	System	Organization	Score
1	GPT-5.5 Pro	OpenAI	90.1%
2	GPT-5.4 Pro	OpenAI	89.3%
3	MiroThinker-H1	MiroMind	88.2%
4	Claude Mythos Preview	Anthropic	86.9%
5	Kimi K2.6	Moonshot AI	86.3%

See all 82 entries →

WebArena

Browser agents · Agent scope · 49 entries tracked

Rank	System	Organization	Score
1	WebTactix (DeepSeek v3.2)	WebTactix	74.3%
2	OpAgent	CodeFuse AI	71.6%
3	ColorBrowserAgent	MadeAgents	71.2%
4	Claude Code + GBOX MCP	GBOX AI	68.0%
5	DeepSky Agent	DeepSky	66.9%

See all 49 entries →

SWE-bench Verified

Coding · Model scope · 16 entries tracked

Rank	System	Organization	Score
1	Claude Mythos	Anthropic	93.9%
2	Claude Opus 4.8 (new)	Anthropic	88.6%
3	Claude Opus 4.7	Anthropic	87.6%
4	Claude Opus 4.5	Anthropic	80.9%
5	Claude Opus 4.6	Anthropic	80.8%

See all 16 entries →

Aider

Coding · Model scope · 15 entries tracked

Rank	System	Organization	Score
1	gpt-5 (high)	OpenAI	88.0%
2	gpt-5 (medium)	OpenAI	86.7%
3	o3-pro (high)	OpenAI	84.9%
4	gemini-2.5-pro-preview-06-05 (32k think)	Google	83.1%
5	o3 (high)	OpenAI	81.3%

See all 15 entries →

OSWorld

Computer use · Agent scope · 17 entries tracked

Rank	System	Organization	Score
1	Claude Opus 4.8 (new)	Anthropic	83.4%
2	Mythos Preview (new)	Anthropic	79.6%
3	OSAgent	TheAGI Company	76.26%
4	GPT-5.4 (new)	OpenAI	75.0%
5	Claude Opus 4.6	Anthropic	72.7%

See all 17 entries →

GAIA

Model evals / reasoning · Agent scope · 21 entries tracked

Rank	System	Organization	Score
1	OPS-Agentic-Search (new)	Alibaba Cloud	92.36%
1	openJiuwen-deepagent (new)	Suzhou AI Lab / Shuqian Tech	92.36%
3	openJiuwen-deepagent (GPT5/Gemini)	openJiuwen	91.69%
4	Lemon Agent	Lenovo CTO Org	91.36%
5	JoinAI V2.2	JoinAI-CMCC	90.7%

See all 21 entries →

ClawBench

Browser agents · Agent scope · 7 entries tracked

Rank	System	Organization	Score
1	Claude Sonnet 4.6	Anthropic	33.3%
2	GLM-5 (new)	Z.ai	24.2%
3	Gemini 3 Flash	Google	19.0%
4	Claude Haiku 4.5	Anthropic	18.3%
5	GPT-5.4	OpenAI	6.5%

See all 7 entries →

HealthAdminBench

Browser agents · Agent scope · 11 entries tracked

Rank	System	Organization	Score
1	Claude Mythos 5 (browser-use) (new)	Anthropic	51.9%
1	Claude Opus 4.8 (browser-use) (new)	Anthropic	51.9%
3	Claude Mythos Preview (browser-use) (new)	Anthropic	47.4%
4	Claude Sonnet 4.6 (browser-use) (new)	Anthropic	45.2%
5	Claude Opus 4.6 CUA (new)	Anthropic	36.3%

See all 11 entries →

Online-Mind2Web

Browser agents · Agent scope · 22 entries tracked

Rank	System	Organization	Score
1	Browser Use Cloud (bu-max) (new)	Browser-Use	97.0%
2	GPT-5.4 Native Computer Use	OpenAI	93.0%
3	ABP + Claude Opus 4.6	theredsix	90.53%
4	TinyFish	TinyFish AI	90.0%
5	UI-TARS-2	ByteDance / VLM-Research	88.2%

See all 22 entries →

τ-bench

Model evals / reasoning · Model scope · 12 entries tracked

Rank	System	Organization	Score
1	Step-3.5-Flash	StepFun	88.2%
2	GLM-4.7	Z.ai	87.4%
3	MiMo-V2-Flash	Xiaomi	80.3%
4	GLM-4.7-Flash	Z.ai	79.5%
5	MiniMax M2	MiniMax	77.2%

See all 12 entries →

AgentBench

Model evals / reasoning · Model scope · 10 entries tracked

Rank	System	Organization	Score
1	AgentRL w/ Qwen2.5-32B-Instruct	Tsinghua University	70.4%
2	AgentRL w/ Qwen2.5-14B-Instruct	Tsinghua University	67.7%
3	AgentRL w/ GLM-4-9B-0414	Tsinghua University	65.0%
4	AgentRL w/ Qwen2.5-7B-Instruct	Tsinghua University	62.0%
5	AgentRL w/ Qwen2.5-3B-Instruct	Tsinghua University	60.0%

See all 10 entries →

How to read these tables

Within-benchmark only. Scores measure different things on different benchmarks; don't compare a WebVoyager number to a SWE-bench Verified number.
Source-linked. Each score links to the original report — paper, blog post, or official leaderboard. Treat anything self-reported with the appropriate dose of skepticism.
Scope matters. Agent pages reflect full system setups (model + tools + policy). Model pages emphasize base model capability under a stated harness. Mixed pages combine both — read the per-row notes on the site before drawing conclusions.

Contributing

We welcome new entries, corrections, methodology notes, and new benchmark pages. See CONTRIBUTING.md for the evidence standard, JSON schema expectations, ranking rules, and new-leaderboard checklist.

At minimum, every submitted score needs a public source URL that directly supports the benchmark, system name, score, and setup notes. If you update leaderboard data, run npm run update-readme before opening a pull request.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
.claude/skills/benchmark-discovery		.claude/skills/benchmark-discovery
.github		.github
.vscode		.vscode
docs		docs
public		public
scripts		scripts
src		src
.gitignore		.gitignore
.prettierignore		.prettierignore
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
astro.config.mjs		astro.config.mjs
eslint.config.js		eslint.config.js
package.json		package.json
plan.md		plan.md
pnpm-lock.yaml		pnpm-lock.yaml
prettier.config.js		prettier.config.js
tailwind.config.js		tailwind.config.js
tsconfig.json		tsconfig.json
tsconfig.script.json		tsconfig.script.json
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Browser Agent Leaderboards

Top results by benchmark

WebVoyager

BrowseComp

WebArena

SWE-bench Verified

Aider

OSWorld

GAIA

ClawBench

HealthAdminBench

Online-Mind2Web

τ-bench

AgentBench

How to read these tables

Contributing

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Browser Agent Leaderboards

Top results by benchmark

How to read these tables

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages