Benchmarking for Evaluating Web Experience (Core Web Vitals, etc)
Large language models (LLMs) have shown significant progress on software engineering tasks, leading to the development of coding agents. However, current benchmarks like SWE-Bench and Polyglot are limited by their focus on small bug fixes (average 12 lines of code) and lack representation of web development—which constitutes 50% of software jobs and generates 40% of industry revenue.
Web performance optimization presents unique challenges compared to traditional bug-fixing: there are no predefined "correct" answers, solutions must address site-specific bottlenecks, and success is measured by continuous improvement in metrics like Largest Contentful Paint (LCP), Cumulative Layout Shift (CLS), and Interaction to Next Paint (INP).
CWV-Bench bridges this gap by evaluating coding agents on their ability to improve real website performance. Unlike traditional benchmarks that test against engineered test cases, CWV-Bench assesses whether agents can diagnose complex rendering pipeline bottlenecks, implement optimizations without introducing regressions, and reason about performance trade-offs in real-world scenarios.
Tested on Ubuntu 24.04 LTS and macOS (Apple Silicon) with Python 3.12.
git clone https://github.com/behavior-in-the-wild/web-experience-benchmark.git
cd web-experience-benchmark
# Use Python 3.12 — 3.14+ is too new for some dependencies
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install tldextract pelican # extra hosting deps not in requirements.txt
playwright install chromium
# bore tunnel (for PSI measurements)
cargo install bore-cliEach framework requires its own runtime. Install only what you need:
Node.js (Express, React, Next.js, Vue, Hexo)
# macOS
brew install node
# Ubuntu
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt install -y nodejsHugo + Go (host_hugo.sh)
# macOS
brew install hugo go
# Ubuntu — use official Hugo binary (apt version is outdated)
sudo apt install -y golang-go
wget https://github.com/gohugoio/hugo/releases/latest/download/hugo_extended_linux_amd64.tar.gz
tar -xzf hugo_extended_linux_amd64.tar.gz && sudo mv hugo /usr/local/bin/Ruby + Jekyll (host_jekyll.sh)
# macOS (system Ruby is read-only; use Homebrew)
brew install ruby
export PATH="/opt/homebrew/opt/ruby/bin:/opt/homebrew/lib/ruby/gems/$(ruby -e 'puts RUBY_VERSION.match(/^\d+\.\d+/)[0]').0/bin:$PATH"
gem install jekyll bundler
# Ubuntu
sudo apt install -y ruby-full build-essential
gem install jekyll bundlerSee
harness/host_files/README.mdfor the full hosting setup reference including a verification checklist.
Note:
vllminrequirements.txtis Linux/GPU only and will be skipped on macOS.
curl -fsSL https://opencode.ai/install | bash # OpenCode
npm install -g @openai/codex # Codex
pip install aider-chat # AiderThe benchmark harness evaluates coding agents on CWV optimization tasks. See harness/README.md for full documentation.
# Run all agents on first 10 repos, 4 parallel jobs
./harness/evaluate.sh --parallel 4 --limit 10
# Patch-only mode (skip CWV measurement — fastest for agent testing)
SKIP_CWV_MEASURE=1 ./harness/evaluate.sh --parallel 8 --limit 50Create harness/.env with your credentials:
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
AZURE_OPENAI_API_DEPLOYMENT_NAME=gpt-4.1
GOOGLE_PAGESPEED_INSIGHTS_API_KEY=... # for PSI measurementstemplate_opencode_os.sh (OS models), template_opencode.sh, template_claudecode.sh, template_codex.sh, template_aider.sh, template_gemini.sh, template_null.sh
Each job writes into its own subdirectory:
harness/out/<YYYYMMDD_HHMMSS>/results/
└── {ID}_{AGENT}/
├── {ID}_{AGENT}.patch # code patch written by agent
├── agent.log # agent stdout/stderr
├── usage.json # token counts, cost, wall time
├── host.log # HTTP server logs
├── screenshot.png # screenshot of patched site
├── visual.json # AI visual regression result
├── mobile.json # CWV metrics (mobile)
├── desktop.json # CWV metrics (desktop)
├── init_psi_mobile.json # PageSpeed before patch (if enabled)
├── init_psi_desktop.json
├── final_psi_mobile.json # PageSpeed after patch (if enabled)
└── final_psi_desktop.json
Run the full benchmark suite against self-hosted open-source models via vLLM. Models are served one at a time with automatic GPU management.
# Run all 6 compatible models on 100 samples
bash harness/opensource_models/run_os_models.sh \
--csv harness/SAMPLE/input_100.csv \
2>&1 | tee harness/out/run_full.log
# Smoke test — 1 sample, all models
bash harness/opensource_models/run_os_models.sh \
--csv harness/SAMPLE/input_1_test.csv \
--parallel 1
# Single model
bash harness/opensource_models/run_os_models.sh gemma \
--csv harness/SAMPLE/input_100.csvSupported models (A100/SM80 compatible): gemma-4-31b-it, glm-4.7-flash, qwen3-coder-next, gpt-oss-120b, devstral-2-123b, minimax-m2.7
Token accounting in usage/summary.json:
prompt_tokens— input tokenscompletion_tokens— non-reasoning output tokensreasoning_tokens— thinking tokens (separate)total_tokens— sum of all three
Automatic optimization of the agent's Phase 1 (planning) and Phase 2 (execution) prompts using Bayesian search over LLM-generated instruction candidates, scored by measured CWV improvements. See harness/prompt_optimisation/README.md.
# From repo root
python -m harness.prompt_optimisation.cli select-training-set
python -m harness.prompt_optimisation.cli optimize --algo gepa
python -m harness.prompt_optimisation.cli show --run 20260521_140000web-experience-benchmark/
├── harness/ # Benchmark harness
│ ├── evaluate.sh # Main benchmark runner
│ ├── agents/ # Agent templates
│ ├── host_files/ # Framework hosting scripts
│ ├── scripts/ # Utility scripts (rerun CSV, batch apply, etc.)
│ ├── opensource_models/ # vLLM serving + multi-model runner
│ ├── prompt_optimisation/ # MIPRO-style prompt optimization system
│ └── SAMPLE/ # Input CSVs and repo snapshots
├── scripts/ # Standalone analysis scripts
├── requirements.txt
└── README.md
harness/SAMPLE/input.csv — 503 web repositories pinned to specific commits.
| Column | Description |
|---|---|
ID |
Unique integer identifier |
REPO_ID |
owner/repo on GitHub |
FRAMEWORK |
e.g. Jekyll, Express, Static HTML |
COMMIT_ID |
Pinned commit SHA |
HOST_FILE_PATH |
Path to framework hosting script |
CWV_MOBILE / CWV_DESKTOP |
Baseline CWV JSON |
@software{web_experience_benchmark_2025,
title={{Towards Benchmarking and Optimizing Web Experiences}},
author={{Behavior in the Wild}},
year={2025},
url={https://github.com/behavior-in-the-wild/web-experience-benchmark}
}