This project generates a dataset of Google Form interactions using direct Playwright Python for browser control.
Tool traces are validated and normalized through an MCP server by default, then written to tool_trace.jsonl.
pip install -r requirements.txt
python -m playwright install chromium
# Linux only (if needed):
python -m playwright install --with-deps chromiumHPC-friendly setup wrapper (no sudo, project-local venv/browser cache):
bash scripts/hpc_setup.shRuntime readiness check:
python3 scripts/verify_runtime_setup.py
python3 scripts/preflight_baseline_eval.pyHeadless baseline wrapper:
bash scripts/run_baselines_headless.sh --smoke-test-all-forms --overwrite-existingFor cluster workflow (canonical directory policy, model install, Slurm/headless usage), see:
README_HPC.md
Canonical thesis-primary runbook now separates two benchmark families:
- Family A: direct Playwright MCP tool use for Qwen
text_llmandvlm - Family B: native computer-use for
OpenCUA-32B
The combined orchestrator runs both families sequentially:
CONFIG_PATH=configs/baselines/track_baseline_models.json \
DIRECT_PROVIDER=opencua_local \
bash scripts/run_track_baseline_matrix.shReference efficiency is compared against the matching scripted Playwright run for the same form_id and run_XXXX, using tool_trace.jsonl event counts and structured run annotations rather than video parsing.
If PLAYWRIGHT_SKIP_FFMPEG_INSTALL is set, video recording may fail.
mcp_server interaction mode also requires node + npx to run the official Playwright MCP server.
The default MCP command now forces --browser chromium to avoid system Chrome dependency.
Default MCP launch no longer enables --caps=vision, which improves compatibility in WSL/restricted environments.
When Python Playwright is installed, runner automatically passes its Chromium executable path to MCP (--executable-path) to avoid Node browser mismatch.
In WSL, runner also adds --no-sandbox for MCP browser launch.
If @playwright/mcp is not already cached, the first mcp_server run may need network access for npx package resolution.
Runner now auto-installs Node Playwright chromium for mcp_server mode unless --no-mcp-browser-install is provided.
Runner also performs an MCP package preflight (npx @playwright/mcp --version) before run start in mcp_server mode.
For more stable startup in WSL/offline scenarios, install MCP globally once: npm i -g @playwright/mcp.
python3 src/engine/runner.py \
--form-id conf_interest \
--dataset-root data/forms \
--num-runs 1Answers are matched automatically from:
data/answers/<form_id>/runs.json
Full dataset run (all forms under src/forms, each auto-matched to data/answers/<form_id>/runs.json):
python3 src/engine/runner.py \
--all-forms \
--dataset-root data/forms \
--skip-existing-videoSmoke test across all forms (runs exactly one answer instance per form, prints pass/fail summary):
python3 src/engine/runner.py \
--smoke-test-all-forms \
--dataset-root data/forms \
--overwrite-existingSmoke test using full official Playwright MCP browser execution:
python3 src/engine/runner.py \
--smoke-test-all-forms \
--dataset-root data/forms \
--overwrite-existing \
--interaction-mode mcp_serverBy default the browser runs headed (visible). Use --headless to disable UI.
Mouse overlay is enabled by default for video clarity. Use --no-mouse-overlay to disable it.
Screenshots are optional and disabled by default. Use --screenshots to save observations/*.png.
Trace mode defaults to mcp and auto-starts the bundled server src/engine/mcp_trace_server.py.
Interaction mode defaults to local.
Use --interaction-mode mcp_server to execute browser interaction via the official Playwright MCP server.
The engine accepts two formats:
- JSON: either a single run (a list of answer entries) or a multi-run object with
runs. - JSONL: one run per line, each line is a JSON object describing a run.
Each answer entry must contain:
label(question label text to match)widget_type(short_text, paragraph_text, single_choice, multi_choice, date, time)value(string or list, depending on widget type)
Optional run metadata can be included and is carried into annotations.json.
Each run generates:
data/forms/<form_id>/runs/run_XXXX/<form_id>_run_XXXX.webmdata/forms/<form_id>/runs/run_XXXX/annotations.jsondata/forms/<form_id>/runs/run_XXXX/answers_instance.jsondata/forms/<form_id>/runs/run_XXXX/tool_trace.jsonldata/forms/<form_id>/runs/run_XXXX/observations/step_XXXX_pre.pngdata/forms/<form_id>/runs/run_XXXX/observations/step_XXXX_post.pngdata/forms/<form_id>/runs/run_XXXX/observations/submit_pre.pngdata/forms/<form_id>/runs/run_XXXX/observations/submit_post.png
annotations.json includes form/run identifiers, video path, run parameters, macro actions, submit timing, and trace pointers.
tool_trace.jsonl is JSONL with Playwright MCP-style action names.
In local mode it records low-level interaction actions (browser_mouse_click_xy, browser_mouse_move_xy, browser_type, browser_press_key, browser_mouse_wheel).
In mcp_server mode it records official MCP tool calls (browser_navigate, browser_run_code, browser_wait_for, browser_take_screenshot, browser_close).
Each action now also includes required-field metadata when detectable: required, required_attr, required_marker.
Useful flags:
--num-runslimit how many runs to generate in one execution.--start-indexforce the starting run index.--resumecontinue from the next missing run index.--skip-existingskip runs whose output directory already exists.--skip-existing-videoskip runs whose output directory already contains a.webm.--overwrite-existingdelete an existing run directory and regenerate it.--all-formsrun all form specs insrc/forms.--smoke-test-all-formsrun one test run per form and continue through failures, then print summary.--form-urloverride the URL from the spec file.--answers-rootbase directory for automatic answer matching (default:data/answers).--answers-fileprimary filename to look for in each form answer directory (default:runs.json).- Fallback if missing:
runs.jsonl,runs.ndjson(fails with explicit error if none exist). --headlessrun without visible browser UI.--slow-moadd a delay (ms) to Playwright actions.--type-delay-msdelay (ms) between typed characters.--action-delay-msdelay (ms) after each action for visibility.--viewport-width/--viewport-heightset the browser viewport.--timeout-msset Playwright timeout for waits.--screenshotsenable per-step and submit screenshots.--no-mouse-overlaydisable the visible mouse overlay.--interaction-modechoose browser action backend:local(default) ormcp_server.--trace-modechoose trace backend:mcp(default) orlocal.--mcp-server-cmdoverride MCP server command (defaults to bundled trace server).--mcp-tool-nameMCP tool used for event normalization (default:record_action).--mcp-timeout-msMCP request timeout (default:5000).--browser-mcp-cmdoverride official Playwright MCP browser command used inmcp_servermode.--browser-mcp-timeout-mstimeout for browser MCP tool calls (default:120000).--no-mcp-browser-installdisable automatic Node Playwright browser install preflight.--mcp-browser-install-timeout-stimeout for browser preflight install (default:600).--no-mcp-verify-tracedisable MCP action-schema validation for trace events.--no-mcp-strictkeep validation on but do not fail the run on validation errors.
Overwrite existing run:
python3 src/engine/runner.py \
--form-id conf_interest \
--dataset-root data/forms \
--num-runs 1 \
--start-index 1 \
--overwrite-existing