Audit-grade Flutter testing for AI agents — drive real iPhones, Androids & the web, then grade what ships.
mcp-phone-controll is an MCP server that gives agents safe, structured access to real Android + iOS devices and Flutter web — and, uniquely, an opinionated audit suite that grades the code, tests, and runtime an agent produces. 147 tools. Works with Claude Desktop / Code, Cursor, or any MCP host — including local/SLM models via the OpenAI-compat HTTP adapter.
- An audit suite no other Flutter MCP has. Pure-compute senior-engineer rubrics that grade what the agent ships —
audit_code_seniority,audit_security(OWASP MASVS),audit_performance(animation / scroll / rebuild jank),audit_accessibility,audit_localization,audit_dependencies,audit_test_quality,audit_web_app— all gated by a 9-domainaudit_release_readinesscomposite that returns a ship / hold / block verdict. No device needed; works for any model. - Composes, doesn't reinvent. It drives devices (adb / WebDriverAgent / Patrol) and grades the result. For web driving it composes with the model-agnostic Chrome DevTools MCP / Playwright MCP rather than shipping its own browser driver, and defers SDK plumbing to Google's built-in
dart mcp-serverand mobile flows to Maestro. → The Stack - Runtime graders — you capture, we grade.
run_lighthouse(web vitals),ingest_frame_timeline(jank score from a VM-Timeline / Chrome trace),ingest_har(per-action network / Firestore cost),ingest_maestro_report.
pip install mcp-phone-controll # or: uvx mcp-phone-controll
claude mcp add phone-controll -- python -m mcp_phone_controll
# optional — model-agnostic web driving (compose, don't reinvent):
claude mcp add chrome-devtools --scope user -- npx -y chrome-devtools-mcp@latest
claude mcp add playwright --scope user -- npx -y @playwright/mcp@latestThen call describe_capabilities from your agent. Full setup (venv-pinned, device prereqs): First 15 minutes.
→ The Stack · Navigation latency · Performance rubric · SLM / local-model setup · Senior-tester discipline · Comparison vs other MCPs · Web before/after playbook · FAQ · Configuration · Tools by category · Architecture
Round-trip reduction. Testing dragged because every step was its own request → model-reasoning → next-request cycle (total time ≈ round-trips × per-turn cost). v0.15.0 collapses many steps into one call.
- 🆕
batch(v0.15.0) — run an ordered list of tool calls server-side in ONE round-trip; the model reasons once for a known flow, not once per step. Each step gets the full pipeline (image-cap / trace / truncate);stop_on_error+ a per-step trace. → Navigation latency - 🆕 Diagnostics-on-failure (v0.15.0) — a failed
batch/tap_and_verifystep folds a capped screenshot + recent error logs into the result, so diagnosing costs no extra round-trip. - 🆕
wait_until(v0.15.0) — block server-side until an element is visible or gone (spinner/dialog disappears) — one call instead of an agent poll-loop.
Action-primitive hardening, all from real-device field feedback (iPhone 17 Pro sim / WDA 13, Galaxy S25). The action tools failed on the two screens that mattered; these fixes target each cause.
- 🆕 iOS
tapself-heals a dead WDA session (v0.14.0) —tapused to stay pinned to a dead session id forever (surviving WDA restart, reselect, even a sim reboot). Any recoverable session error now drops the cached session and re-handshakes once. Fixes the/wda/tap/0 unknown commandsymptom too. - 🆕 Deep-tree actions +
snapshotMaxDepth(v0.14.0) —tap/swipe/tap_textno longer abort withcall depth exceed 10on deep Flutter/Compose trees. - 🆕
tap(bounds=…)+dump_uiartifact spill +zoom_screenshot(v0.14.0) — tap an element by its bounds when it has no stable selector;dump_uiwrites the full XML to an artifact when large (no more truncation dead-end); crop+upscale a region to read tiny UI. → Navigation latency
The navigation-latency arc. For device agents, the slow default is screenshot → reason over pixels → compute x,y → tap → screenshot to confirm — ~1–2k image tokens and a vision round-trip per step. v0.13.0 makes the structured (no-vision) path the easy default for any work that isn't a visual check.
- 🆕
tapby selector (v0.13.0) —tap(resource_id=… | text=… | class_name=…)resolves and taps server-side in one call: no screenshot, no pixel reasoning, no coordinate math. Get selectors fromextract_ui_graph;x,ybecomes the fallback. A miss fails withcapture_diagnosticsrather than tapping a guess. → Navigation latency - 🆕 Optional UI-hierarchy cache (v0.13.0) —
MCP_UI_CACHE_TTL_MScaches the read-only dump on a stable screen so repeatedextract_ui_graph/dump_uidon't re-hit the device. Off by default; only observations are cached (taps always resolve live), so it can never cause a wrong tap. - 🆕
estimate_tokens(v0.12.0) — a context-budget guard: estimate a string or file, passbudget_tokens, and get afits/headroomverdict + a recommendation (proceed/proceed_with_caution/flush_context). tiktoken when installed, else a calibrated heuristic. → SLM guide - 🆕 Tool-tier scoping on the HTTP adapter (v0.11.0) —
GET /tools?tier=basic|intermediate|expert(andMCP_TOOL_TIER) so a 4B-class local LLM gets a reasoning-sized surface and pulls in the long tail on demand viadescribe_capabilities. - 🆕
audit_performance(v0.9.0) — static jank audit: animation anti-patterns (controller-in-build, setState-in-listener, animated Opacity), scroll/virtualization (ListView(children:)vs.builder), rebuild cost. → rubric - 🆕
ingest_frame_timeline+ingest_har(v0.10.0) — runtime graders: jank score from a captured frame timeline; per-action network/Firestore cost from a HAR. - 🆕 Flutter web (v0.5–0.8) —
audit_web_app,run_lighthouse, web debug sessions (start_debug_session(serial="chrome")→dump_widget_treevia DWDS),run_unit_tests(platform="chrome"). - 🔧 Composition — registers + documents Chrome DevTools MCP (debug/tooling) and Playwright MCP (visual/SLM) as the model-agnostic browser-driving layer; tracks Google's MCP now being built into the SDK (
dart mcp-server).
Previous milestones: v0.4.0 Maestro composition (audit_maestro_flow + ingest_maestro_report) · v0.3.0 the audit suite + senior-tester loop (design_test_plan + audit_test_quality) · v0.2.x PyPI release, multi-device locking, Patrol, AR/vision. Full history: CHANGELOG.md.
Mobile QA still loses 30–50% of its engineering hours to flaky selector maintenance (Drizz industry survey, 2026). Agents can close that loop — but until now there was no production-grade MCP that gave them safe, structured access to real phones. This is that MCP:
- Cross-session device locking so 4 concurrent Claude windows don't collide on the same Galaxy S25.
- Tiered tool surface (BASIC / INTERMEDIATE / EXPERT, 147 tools total) so 4B-class local LLMs aren't overwhelmed and Claude Desktop's tool-count ceiling doesn't drop your server.
- Defense-in-depth image cap that survived three production "2000 px API limit" incidents — including the case where an overnight bot bypassed
take_screenshotand used rawadb screencap. - Patrol-first Flutter integration with
system=truefor OS dialogs,tap_and_verifyfor the verify-after-action discipline, and YAML test plans the agent can author and re-run. - Production-ready out of the gate: CycloneDX SBOM, pip-audit gating, structured JSON logs, Prometheus
/metrics, k8s/health+/ready, Docker image, GitHub Action wrapper, 7 ADRs documenting load-bearing decisions.
| Path | What |
|---|---|
packages/phone-controll/ |
The flagship MCP. 147 tools spanning device control, build/install/launch, Patrol-driven Flutter UI tests, Flutter web (web debug sessions + run_lighthouse + audit_web_app), AR/Vision, declarative YAML test plans, cross-session device locking, the audit suite (seniority/security/performance/i18n/dependencies/a11y/test-quality/web + 9-domain composite), the senior-tester loop (design_test_plan + audit_test_quality), and runtime graders (ingest_frame_timeline / ingest_har / ingest_maestro_report). |
packages/<future>/ |
Future MCPs slot in here using the same shape (see docs/adding_an_mcp.md). |
examples/templates/ |
Shared YAML test-plan templates (smoke, ump-decline, ar-anchor, flutter-test-smoke). |
examples/agent_loop.py |
Reference autonomous Plan→Build→Test→Verify loop using any OpenAI-compat local LLM. |
skills/ |
Symlinks to the Claude Code skills that ship with these MCPs. |
scripts/ |
Fresh-laptop installer, doctor, and ops scripts. |
docs/ |
Architecture, framework-extension recipe, MCP-extension recipe. |
- Atomic cross-MCP refactors — change shared types in one PR.
- One venv, one CI, one set of pre-commit hooks boots everything.
- The HTTP adapter's existing sub-router pattern (e.g.
/dev-session/*) lets future packages register their own routers without coordinating across repos. - Easy to extract later:
git filter-repo --subdirectory-filter packages/<name>peels any package back into its own repo.
git clone <this repo> ~/Desktop/flutter-dev-agents
cd ~/Desktop/flutter-dev-agents/packages/phone-controll
uv venv --python 3.11
uv pip install -e ".[dev,ar,http]"
pytest # full unit suite, no toolchain needed
# Register the MCP with Claude Code
claude mcp add phone-controll -- \
/Users/$(whoami)/Desktop/flutter-dev-agents/packages/phone-controll/.venv/bin/python \
-m mcp_phone_controllFor a step-by-step "open VS Code → drive a real phone" walkthrough that
exercises every Tier A–F tool, see
docs/walkthrough-vscode-test.md.
See packages/phone-controll/README.md for the full list. Briefly:
- Android:
adb(brew install --cask android-platform-tools) - iOS: Xcode + CLT,
pymobiledevice3 remote tunneldrunning for developer-tier services - Flutter:
flutteron PATH; for Patrol:dart pub global activate patrol_cli - AR (optional):
[ar]extra installs OpenCV - HTTP adapter (optional):
[http]extra installs FastAPI + uvicorn
Run check_environment from any Claude Code session — it returns a structured doctor report with concrete fix commands for any red items.
- Native macOS for the human factory: real devices via USB, iOS simulators, multiple VS Code windows, multi-Claude concurrent sessions. Each Claude session owns its devices via the MCP's filesystem-coordinated locks.
- Linux container (planned, deferred): headless Android emulator + Flutter + Patrol + the MCP, for CI runners. See
docs/architecture.md.
packages/phone-controll/v0.15.2 — 147 tools live on PyPI, 1145 hermetic unit tests + real-device tests (gated onMCP_REAL_DEVICE=1). Field-tested across real Flutter projects (docs/v030-field-test.md); the v0.13–v0.15 tools (selector-tap,dump_uispill, deep-tree dump,batch, diagnostics-on-failure,wait_until,zoom_screenshot) live-verified on the iPhone 17 Pro simulator.- First-real-device patch release shipped May 2026 — fixed iOS 17+
--rsdrouting, WDA team_id signing, Polish NBSPtap_text, raw-adb screencaprecovery loop. SeeCHANGELOG.md. - Multi-window VS Code orchestration + debug sessions + WDA setup + cross-session device locks all in place.
A typical day on the factory laptop:
Claude #1 in checkaiapp/
→ open_project_in_ide("checkaiapp") # spawns its own VS Code window
→ select_device(R3CYA05CHXB) # acquires the lock on the Galaxy
→ start_debug_session(project_path=...) # `flutter run --machine`, returns vm_service_uri
→ ...edit code, restart_debug_session, read_debug_log, repeat...
→ run_patrol_test (or run_test_plan with dev_iteration.yaml)
→ stop_debug_session, release_device, close_ide_window
Claude #2 in another_app/ → emulator-5554, its own VS Code, its own debug
Claude #3 in third_app/ → iPhone simulator UDID, its own VS Code, its own debug
Three independent debug sessions, three IDE windows, three locked devices, no collisions. The HTTP adapter exposes both the unified /tools/* surface and a focused /dev-session/* sub-router for agents that only care about the dev-iteration loop.
See examples/templates/dev_iteration.yaml for a runnable plan template; docs/ios_setup.md for the iPhone prerequisites (Developer Mode, DDI, tunneld, WebDriverAgent).
See docs/adding_a_framework.md and docs/adding_an_mcp.md for the extension recipes. Both stay small (a few new files each) thanks to the Clean Architecture boundaries.
Mirrors CI exactly — install once, never push a red build again:
uv pip install pre-commit
pre-commit install
pre-commit run --all-files # one-time baseline; CI parity checkThree gates: ruff (lint+autofix), pytest -q (fast suite, no tests/agent), generate_tool_catalogue --check (refuses if docs/tools.md drifts from the live registry). See .pre-commit-config.yaml.
A shippable visual-asset brief pack lives in docs/design/ — six self-contained briefs (logo, social preview, landing page, architecture diagram, demo video, pitch deck) each with concrete specs + a Claude-designer prompt. Total ~12 person-days of design work to ship the full pack; the first 3 briefs (~7 days) cover 80% of the launch surface.
Apache License 2.0 — see LICENSE. Inbound contributions follow the same license; no separate CLA.