flutter-dev-agents

Audit-grade Flutter testing for AI agents — drive real iPhones, Androids & the web, then grade what ships.

mcp-phone-controll is an MCP server that gives agents safe, structured access to real Android + iOS devices and Flutter web — and, uniquely, an opinionated audit suite that grades the code, tests, and runtime an agent produces. 147 tools. Works with Claude Desktop / Code, Cursor, or any MCP host — including local/SLM models via the OpenAI-compat HTTP adapter.

What makes it different

An audit suite no other Flutter MCP has. Pure-compute senior-engineer rubrics that grade what the agent ships — audit_code_seniority, audit_security (OWASP MASVS), audit_performance (animation / scroll / rebuild jank), audit_accessibility, audit_localization, audit_dependencies, audit_test_quality, audit_web_app — all gated by a 9-domain audit_release_readiness composite that returns a ship / hold / block verdict. No device needed; works for any model.
Composes, doesn't reinvent. It drives devices (adb / WebDriverAgent / Patrol) and grades the result. For web driving it composes with the model-agnostic Chrome DevTools MCP / Playwright MCP rather than shipping its own browser driver, and defers SDK plumbing to Google's built-in dart mcp-server and mobile flows to Maestro. → The Stack
Runtime graders — you capture, we grade. run_lighthouse (web vitals), ingest_frame_timeline (jank score from a VM-Timeline / Chrome trace), ingest_har (per-action network / Firestore cost), ingest_maestro_report.

Quickstart

pip install mcp-phone-controll                # or: uvx mcp-phone-controll
claude mcp add phone-controll -- python -m mcp_phone_controll

# optional — model-agnostic web driving (compose, don't reinvent):
claude mcp add chrome-devtools --scope user -- npx -y chrome-devtools-mcp@latest
claude mcp add playwright      --scope user -- npx -y @playwright/mcp@latest

Then call describe_capabilities from your agent. Full setup (venv-pinned, device prereqs): First 15 minutes.

→ The Stack · Navigation latency · Performance rubric · SLM / local-model setup · Senior-tester discipline · Comparison vs other MCPs · Web before/after playbook · FAQ · Configuration · Tools by category · Architecture

What's new in v0.15.0 (June 2026)

Round-trip reduction. Testing dragged because every step was its own request → model-reasoning → next-request cycle (total time ≈ round-trips × per-turn cost). v0.15.0 collapses many steps into one call.

🆕 batch (v0.15.0) — run an ordered list of tool calls server-side in ONE round-trip; the model reasons once for a known flow, not once per step. Each step gets the full pipeline (image-cap / trace / truncate); stop_on_error + a per-step trace. → Navigation latency
🆕 Diagnostics-on-failure (v0.15.0) — a failed batch/tap_and_verify step folds a capped screenshot + recent error logs into the result, so diagnosing costs no extra round-trip.
🆕 wait_until (v0.15.0) — block server-side until an element is visible or gone (spinner/dialog disappears) — one call instead of an agent poll-loop.

What's new in v0.14.0 (June 2026)

Action-primitive hardening, all from real-device field feedback (iPhone 17 Pro sim / WDA 13, Galaxy S25). The action tools failed on the two screens that mattered; these fixes target each cause.

🆕 iOS tap self-heals a dead WDA session (v0.14.0) — tap used to stay pinned to a dead session id forever (surviving WDA restart, reselect, even a sim reboot). Any recoverable session error now drops the cached session and re-handshakes once. Fixes the /wda/tap/0 unknown command symptom too.
🆕 Deep-tree actions + snapshotMaxDepth (v0.14.0) — tap/swipe/tap_text no longer abort with call depth exceed 10 on deep Flutter/Compose trees.
🆕 tap(bounds=…) + dump_ui artifact spill + zoom_screenshot (v0.14.0) — tap an element by its bounds when it has no stable selector; dump_ui writes the full XML to an artifact when large (no more truncation dead-end); crop+upscale a region to read tiny UI. → Navigation latency

What's new in v0.13.0 (June 2026)

The navigation-latency arc. For device agents, the slow default is screenshot → reason over pixels → compute x,y → tap → screenshot to confirm — ~1–2k image tokens and a vision round-trip per step. v0.13.0 makes the structured (no-vision) path the easy default for any work that isn't a visual check.

🆕 tap by selector (v0.13.0) — tap(resource_id=… | text=… | class_name=…) resolves and taps server-side in one call: no screenshot, no pixel reasoning, no coordinate math. Get selectors from extract_ui_graph; x,y becomes the fallback. A miss fails with capture_diagnostics rather than tapping a guess. → Navigation latency
🆕 Optional UI-hierarchy cache (v0.13.0) — MCP_UI_CACHE_TTL_MS caches the read-only dump on a stable screen so repeated extract_ui_graph/dump_ui don't re-hit the device. Off by default; only observations are cached (taps always resolve live), so it can never cause a wrong tap.
🆕 estimate_tokens (v0.12.0) — a context-budget guard: estimate a string or file, pass budget_tokens, and get a fits/headroom verdict + a recommendation (proceed / proceed_with_caution / flush_context). tiktoken when installed, else a calibrated heuristic. → SLM guide
🆕 Tool-tier scoping on the HTTP adapter (v0.11.0) — GET /tools?tier=basic|intermediate|expert (and MCP_TOOL_TIER) so a 4B-class local LLM gets a reasoning-sized surface and pulls in the long tail on demand via describe_capabilities.
🆕 audit_performance (v0.9.0) — static jank audit: animation anti-patterns (controller-in-build, setState-in-listener, animated Opacity), scroll/virtualization (ListView(children:) vs .builder), rebuild cost. → rubric
🆕 ingest_frame_timeline + ingest_har (v0.10.0) — runtime graders: jank score from a captured frame timeline; per-action network/Firestore cost from a HAR.
🆕 Flutter web (v0.5–0.8) — audit_web_app, run_lighthouse, web debug sessions (start_debug_session(serial="chrome") → dump_widget_tree via DWDS), run_unit_tests(platform="chrome").
🔧 Composition — registers + documents Chrome DevTools MCP (debug/tooling) and Playwright MCP (visual/SLM) as the model-agnostic browser-driving layer; tracks Google's MCP now being built into the SDK (dart mcp-server).

Previous milestones: v0.4.0 Maestro composition (audit_maestro_flow + ingest_maestro_report) · v0.3.0 the audit suite + senior-tester loop (design_test_plan + audit_test_quality) · v0.2.x PyPI release, multi-device locking, Patrol, AR/vision. Full history: CHANGELOG.md.

Why it matters

Mobile QA still loses 30–50% of its engineering hours to flaky selector maintenance (Drizz industry survey, 2026). Agents can close that loop — but until now there was no production-grade MCP that gave them safe, structured access to real phones. This is that MCP:

Cross-session device locking so 4 concurrent Claude windows don't collide on the same Galaxy S25.
Tiered tool surface (BASIC / INTERMEDIATE / EXPERT, 147 tools total) so 4B-class local LLMs aren't overwhelmed and Claude Desktop's tool-count ceiling doesn't drop your server.
Defense-in-depth image cap that survived three production "2000 px API limit" incidents — including the case where an overnight bot bypassed take_screenshot and used raw adb screencap.
Patrol-first Flutter integration with system=true for OS dialogs, tap_and_verify for the verify-after-action discipline, and YAML test plans the agent can author and re-run.
Production-ready out of the gate: CycloneDX SBOM, pip-audit gating, structured JSON logs, Prometheus /metrics, k8s /health + /ready, Docker image, GitHub Action wrapper, 7 ADRs documenting load-bearing decisions.

What's here

Path	What
`packages/phone-controll/`	The flagship MCP. 147 tools spanning device control, build/install/launch, Patrol-driven Flutter UI tests, Flutter web (web debug sessions + `run_lighthouse` + `audit_web_app`), AR/Vision, declarative YAML test plans, cross-session device locking, the audit suite (seniority/security/performance/i18n/dependencies/a11y/test-quality/web + 9-domain composite), the senior-tester loop (`design_test_plan` + `audit_test_quality`), and runtime graders (`ingest_frame_timeline` / `ingest_har` / `ingest_maestro_report`).
`packages/<future>/`	Future MCPs slot in here using the same shape (see `docs/adding_an_mcp.md`).
`examples/templates/`	Shared YAML test-plan templates (smoke, ump-decline, ar-anchor, flutter-test-smoke).
`examples/agent_loop.py`	Reference autonomous Plan→Build→Test→Verify loop using any OpenAI-compat local LLM.
`skills/`	Symlinks to the Claude Code skills that ship with these MCPs.
`scripts/`	Fresh-laptop installer, doctor, and ops scripts.
`docs/`	Architecture, framework-extension recipe, MCP-extension recipe.

Why a monorepo

Atomic cross-MCP refactors — change shared types in one PR.
One venv, one CI, one set of pre-commit hooks boots everything.
The HTTP adapter's existing sub-router pattern (e.g. /dev-session/*) lets future packages register their own routers without coordinating across repos.
Easy to extract later: git filter-repo --subdirectory-filter packages/<name> peels any package back into its own repo.

Getting started (developer machine, macOS)

git clone <this repo> ~/Desktop/flutter-dev-agents
cd ~/Desktop/flutter-dev-agents/packages/phone-controll
uv venv --python 3.11
uv pip install -e ".[dev,ar,http]"
pytest                                    # full unit suite, no toolchain needed

# Register the MCP with Claude Code
claude mcp add phone-controll -- \
  /Users/$(whoami)/Desktop/flutter-dev-agents/packages/phone-controll/.venv/bin/python \
  -m mcp_phone_controll

For a step-by-step "open VS Code → drive a real phone" walkthrough that exercises every Tier A–F tool, see docs/walkthrough-vscode-test.md.

External prerequisites

See packages/phone-controll/README.md for the full list. Briefly:

Android: adb (brew install --cask android-platform-tools)
iOS: Xcode + CLT, pymobiledevice3 remote tunneld running for developer-tier services
Flutter: flutter on PATH; for Patrol: dart pub global activate patrol_cli
AR (optional): [ar] extra installs OpenCV
HTTP adapter (optional): [http] extra installs FastAPI + uvicorn

Run check_environment from any Claude Code session — it returns a structured doctor report with concrete fix commands for any red items.

Topologies

Native macOS for the human factory: real devices via USB, iOS simulators, multiple VS Code windows, multi-Claude concurrent sessions. Each Claude session owns its devices via the MCP's filesystem-coordinated locks.
Linux container (planned, deferred): headless Android emulator + Flutter + Patrol + the MCP, for CI runners. See docs/architecture.md.

Status

packages/phone-controll/ v0.15.2 — 147 tools live on PyPI, 1145 hermetic unit tests + real-device tests (gated on MCP_REAL_DEVICE=1). Field-tested across real Flutter projects (docs/v030-field-test.md); the v0.13–v0.15 tools (selector-tap, dump_ui spill, deep-tree dump, batch, diagnostics-on-failure, wait_until, zoom_screenshot) live-verified on the iPhone 17 Pro simulator.
First-real-device patch release shipped May 2026 — fixed iOS 17+ --rsd routing, WDA team_id signing, Polish NBSP tap_text, raw-adb screencap recovery loop. See CHANGELOG.md.
Multi-window VS Code orchestration + debug sessions + WDA setup + cross-session device locks all in place.

Real-developer multi-project workflow

A typical day on the factory laptop:

Claude #1 in checkaiapp/
  → open_project_in_ide("checkaiapp")     # spawns its own VS Code window
  → select_device(R3CYA05CHXB)            # acquires the lock on the Galaxy
  → start_debug_session(project_path=...)  # `flutter run --machine`, returns vm_service_uri
  → ...edit code, restart_debug_session, read_debug_log, repeat...
  → run_patrol_test (or run_test_plan with dev_iteration.yaml)
  → stop_debug_session, release_device, close_ide_window

Claude #2 in another_app/                  → emulator-5554, its own VS Code, its own debug
Claude #3 in third_app/                    → iPhone simulator UDID, its own VS Code, its own debug

Three independent debug sessions, three IDE windows, three locked devices, no collisions. The HTTP adapter exposes both the unified /tools/* surface and a focused /dev-session/* sub-router for agents that only care about the dev-iteration loop.

See examples/templates/dev_iteration.yaml for a runnable plan template; docs/ios_setup.md for the iPhone prerequisites (Developer Mode, DDI, tunneld, WebDriverAgent).

Contributing

See docs/adding_a_framework.md and docs/adding_an_mcp.md for the extension recipes. Both stay small (a few new files each) thanks to the Clean Architecture boundaries.

Pre-commit hooks

Mirrors CI exactly — install once, never push a red build again:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files   # one-time baseline; CI parity check

Three gates: ruff (lint+autofix), pytest -q (fast suite, no tests/agent), generate_tool_catalogue --check (refuses if docs/tools.md drifts from the live registry). See .pre-commit-config.yaml.

Design

A shippable visual-asset brief pack lives in docs/design/ — six self-contained briefs (logo, social preview, landing page, architecture diagram, demo video, pitch deck) each with concrete specs + a Claude-designer prompt. Total ~12 person-days of design work to ship the full pack; the first 3 briefs (~7 days) cover 80% of the launch surface.

License

Apache License 2.0 — see LICENSE. Inbound contributions follow the same license; no separate CLA.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.github		.github
docs		docs
examples		examples
integrations		integrations
packages/phone-controll		packages/phone-controll
scripts		scripts
skills		skills
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
INTEGRATIONS.md		INTEGRATIONS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
smithery.yaml		smithery.yaml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

flutter-dev-agents

What makes it different

Quickstart

What's new in v0.15.0 (June 2026)

What's new in v0.14.0 (June 2026)

What's new in v0.13.0 (June 2026)

Why it matters

What's here

Why a monorepo

Getting started (developer machine, macOS)

External prerequisites

Topologies

Status

Real-developer multi-project workflow

Contributing

Pre-commit hooks

Design

License

About

Uh oh!

Releases 22

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

flutter-dev-agents

What makes it different

Quickstart

What's new in v0.15.0 (June 2026)

What's new in v0.14.0 (June 2026)

What's new in v0.13.0 (June 2026)

Why it matters

What's here

Why a monorepo

Getting started (developer machine, macOS)

External prerequisites

Topologies

Status

Real-developer multi-project workflow

Contributing

Pre-commit hooks

Design

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 22

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages