Skip to content

michal-giza/flutter-dev-agents

flutter-dev-agents

Audit-grade Flutter testing for AI agents — drive real iPhones, Androids & the web, then grade what ships.

tests license MCP spec python PyPI CI

mcp-phone-controll is an MCP server that gives agents safe, structured access to real Android + iOS devices and Flutter web — and, uniquely, an opinionated audit suite that grades the code, tests, and runtime an agent produces. 147 tools. Works with Claude Desktop / Code, Cursor, or any MCP host — including local/SLM models via the OpenAI-compat HTTP adapter.

What makes it different

  • An audit suite no other Flutter MCP has. Pure-compute senior-engineer rubrics that grade what the agent ships — audit_code_seniority, audit_security (OWASP MASVS), audit_performance (animation / scroll / rebuild jank), audit_accessibility, audit_localization, audit_dependencies, audit_test_quality, audit_web_app — all gated by a 9-domain audit_release_readiness composite that returns a ship / hold / block verdict. No device needed; works for any model.
  • Composes, doesn't reinvent. It drives devices (adb / WebDriverAgent / Patrol) and grades the result. For web driving it composes with the model-agnostic Chrome DevTools MCP / Playwright MCP rather than shipping its own browser driver, and defers SDK plumbing to Google's built-in dart mcp-server and mobile flows to Maestro. → The Stack
  • Runtime graders — you capture, we grade. run_lighthouse (web vitals), ingest_frame_timeline (jank score from a VM-Timeline / Chrome trace), ingest_har (per-action network / Firestore cost), ingest_maestro_report.

Quickstart

pip install mcp-phone-controll                # or: uvx mcp-phone-controll
claude mcp add phone-controll -- python -m mcp_phone_controll

# optional — model-agnostic web driving (compose, don't reinvent):
claude mcp add chrome-devtools --scope user -- npx -y chrome-devtools-mcp@latest
claude mcp add playwright      --scope user -- npx -y @playwright/mcp@latest

Then call describe_capabilities from your agent. Full setup (venv-pinned, device prereqs): First 15 minutes.

The Stack · Navigation latency · Performance rubric · SLM / local-model setup · Senior-tester discipline · Comparison vs other MCPs · Web before/after playbook · FAQ · Configuration · Tools by category · Architecture

What's new in v0.15.0 (June 2026)

Round-trip reduction. Testing dragged because every step was its own request → model-reasoning → next-request cycle (total time ≈ round-trips × per-turn cost). v0.15.0 collapses many steps into one call.

  • 🆕 batch (v0.15.0) — run an ordered list of tool calls server-side in ONE round-trip; the model reasons once for a known flow, not once per step. Each step gets the full pipeline (image-cap / trace / truncate); stop_on_error + a per-step trace. → Navigation latency
  • 🆕 Diagnostics-on-failure (v0.15.0) — a failed batch/tap_and_verify step folds a capped screenshot + recent error logs into the result, so diagnosing costs no extra round-trip.
  • 🆕 wait_until (v0.15.0) — block server-side until an element is visible or gone (spinner/dialog disappears) — one call instead of an agent poll-loop.

What's new in v0.14.0 (June 2026)

Action-primitive hardening, all from real-device field feedback (iPhone 17 Pro sim / WDA 13, Galaxy S25). The action tools failed on the two screens that mattered; these fixes target each cause.

  • 🆕 iOS tap self-heals a dead WDA session (v0.14.0) — tap used to stay pinned to a dead session id forever (surviving WDA restart, reselect, even a sim reboot). Any recoverable session error now drops the cached session and re-handshakes once. Fixes the /wda/tap/0 unknown command symptom too.
  • 🆕 Deep-tree actions + snapshotMaxDepth (v0.14.0) — tap/swipe/tap_text no longer abort with call depth exceed 10 on deep Flutter/Compose trees.
  • 🆕 tap(bounds=…) + dump_ui artifact spill + zoom_screenshot (v0.14.0) — tap an element by its bounds when it has no stable selector; dump_ui writes the full XML to an artifact when large (no more truncation dead-end); crop+upscale a region to read tiny UI. → Navigation latency

What's new in v0.13.0 (June 2026)

The navigation-latency arc. For device agents, the slow default is screenshot → reason over pixels → compute x,y → tap → screenshot to confirm — ~1–2k image tokens and a vision round-trip per step. v0.13.0 makes the structured (no-vision) path the easy default for any work that isn't a visual check.

  • 🆕 tap by selector (v0.13.0) — tap(resource_id=… | text=… | class_name=…) resolves and taps server-side in one call: no screenshot, no pixel reasoning, no coordinate math. Get selectors from extract_ui_graph; x,y becomes the fallback. A miss fails with capture_diagnostics rather than tapping a guess. → Navigation latency
  • 🆕 Optional UI-hierarchy cache (v0.13.0) — MCP_UI_CACHE_TTL_MS caches the read-only dump on a stable screen so repeated extract_ui_graph/dump_ui don't re-hit the device. Off by default; only observations are cached (taps always resolve live), so it can never cause a wrong tap.
  • 🆕 estimate_tokens (v0.12.0) — a context-budget guard: estimate a string or file, pass budget_tokens, and get a fits/headroom verdict + a recommendation (proceed / proceed_with_caution / flush_context). tiktoken when installed, else a calibrated heuristic. → SLM guide
  • 🆕 Tool-tier scoping on the HTTP adapter (v0.11.0) — GET /tools?tier=basic|intermediate|expert (and MCP_TOOL_TIER) so a 4B-class local LLM gets a reasoning-sized surface and pulls in the long tail on demand via describe_capabilities.
  • 🆕 audit_performance (v0.9.0) — static jank audit: animation anti-patterns (controller-in-build, setState-in-listener, animated Opacity), scroll/virtualization (ListView(children:) vs .builder), rebuild cost. → rubric
  • 🆕 ingest_frame_timeline + ingest_har (v0.10.0) — runtime graders: jank score from a captured frame timeline; per-action network/Firestore cost from a HAR.
  • 🆕 Flutter web (v0.5–0.8) — audit_web_app, run_lighthouse, web debug sessions (start_debug_session(serial="chrome")dump_widget_tree via DWDS), run_unit_tests(platform="chrome").
  • 🔧 Composition — registers + documents Chrome DevTools MCP (debug/tooling) and Playwright MCP (visual/SLM) as the model-agnostic browser-driving layer; tracks Google's MCP now being built into the SDK (dart mcp-server).

Previous milestones: v0.4.0 Maestro composition (audit_maestro_flow + ingest_maestro_report) · v0.3.0 the audit suite + senior-tester loop (design_test_plan + audit_test_quality) · v0.2.x PyPI release, multi-device locking, Patrol, AR/vision. Full history: CHANGELOG.md.


Why it matters

Mobile QA still loses 30–50% of its engineering hours to flaky selector maintenance (Drizz industry survey, 2026). Agents can close that loop — but until now there was no production-grade MCP that gave them safe, structured access to real phones. This is that MCP:

  • Cross-session device locking so 4 concurrent Claude windows don't collide on the same Galaxy S25.
  • Tiered tool surface (BASIC / INTERMEDIATE / EXPERT, 147 tools total) so 4B-class local LLMs aren't overwhelmed and Claude Desktop's tool-count ceiling doesn't drop your server.
  • Defense-in-depth image cap that survived three production "2000 px API limit" incidents — including the case where an overnight bot bypassed take_screenshot and used raw adb screencap.
  • Patrol-first Flutter integration with system=true for OS dialogs, tap_and_verify for the verify-after-action discipline, and YAML test plans the agent can author and re-run.
  • Production-ready out of the gate: CycloneDX SBOM, pip-audit gating, structured JSON logs, Prometheus /metrics, k8s /health + /ready, Docker image, GitHub Action wrapper, 7 ADRs documenting load-bearing decisions.

What's here

Path What
packages/phone-controll/ The flagship MCP. 147 tools spanning device control, build/install/launch, Patrol-driven Flutter UI tests, Flutter web (web debug sessions + run_lighthouse + audit_web_app), AR/Vision, declarative YAML test plans, cross-session device locking, the audit suite (seniority/security/performance/i18n/dependencies/a11y/test-quality/web + 9-domain composite), the senior-tester loop (design_test_plan + audit_test_quality), and runtime graders (ingest_frame_timeline / ingest_har / ingest_maestro_report).
packages/<future>/ Future MCPs slot in here using the same shape (see docs/adding_an_mcp.md).
examples/templates/ Shared YAML test-plan templates (smoke, ump-decline, ar-anchor, flutter-test-smoke).
examples/agent_loop.py Reference autonomous Plan→Build→Test→Verify loop using any OpenAI-compat local LLM.
skills/ Symlinks to the Claude Code skills that ship with these MCPs.
scripts/ Fresh-laptop installer, doctor, and ops scripts.
docs/ Architecture, framework-extension recipe, MCP-extension recipe.

Why a monorepo

  • Atomic cross-MCP refactors — change shared types in one PR.
  • One venv, one CI, one set of pre-commit hooks boots everything.
  • The HTTP adapter's existing sub-router pattern (e.g. /dev-session/*) lets future packages register their own routers without coordinating across repos.
  • Easy to extract later: git filter-repo --subdirectory-filter packages/<name> peels any package back into its own repo.

Getting started (developer machine, macOS)

git clone <this repo> ~/Desktop/flutter-dev-agents
cd ~/Desktop/flutter-dev-agents/packages/phone-controll
uv venv --python 3.11
uv pip install -e ".[dev,ar,http]"
pytest                                    # full unit suite, no toolchain needed

# Register the MCP with Claude Code
claude mcp add phone-controll -- \
  /Users/$(whoami)/Desktop/flutter-dev-agents/packages/phone-controll/.venv/bin/python \
  -m mcp_phone_controll

For a step-by-step "open VS Code → drive a real phone" walkthrough that exercises every Tier A–F tool, see docs/walkthrough-vscode-test.md.

External prerequisites

See packages/phone-controll/README.md for the full list. Briefly:

  • Android: adb (brew install --cask android-platform-tools)
  • iOS: Xcode + CLT, pymobiledevice3 remote tunneld running for developer-tier services
  • Flutter: flutter on PATH; for Patrol: dart pub global activate patrol_cli
  • AR (optional): [ar] extra installs OpenCV
  • HTTP adapter (optional): [http] extra installs FastAPI + uvicorn

Run check_environment from any Claude Code session — it returns a structured doctor report with concrete fix commands for any red items.

Topologies

  • Native macOS for the human factory: real devices via USB, iOS simulators, multiple VS Code windows, multi-Claude concurrent sessions. Each Claude session owns its devices via the MCP's filesystem-coordinated locks.
  • Linux container (planned, deferred): headless Android emulator + Flutter + Patrol + the MCP, for CI runners. See docs/architecture.md.

Status

  • packages/phone-controll/ v0.15.2147 tools live on PyPI, 1145 hermetic unit tests + real-device tests (gated on MCP_REAL_DEVICE=1). Field-tested across real Flutter projects (docs/v030-field-test.md); the v0.13–v0.15 tools (selector-tap, dump_ui spill, deep-tree dump, batch, diagnostics-on-failure, wait_until, zoom_screenshot) live-verified on the iPhone 17 Pro simulator.
  • First-real-device patch release shipped May 2026 — fixed iOS 17+ --rsd routing, WDA team_id signing, Polish NBSP tap_text, raw-adb screencap recovery loop. See CHANGELOG.md.
  • Multi-window VS Code orchestration + debug sessions + WDA setup + cross-session device locks all in place.

Real-developer multi-project workflow

A typical day on the factory laptop:

Claude #1 in checkaiapp/
  → open_project_in_ide("checkaiapp")     # spawns its own VS Code window
  → select_device(R3CYA05CHXB)            # acquires the lock on the Galaxy
  → start_debug_session(project_path=...)  # `flutter run --machine`, returns vm_service_uri
  → ...edit code, restart_debug_session, read_debug_log, repeat...
  → run_patrol_test (or run_test_plan with dev_iteration.yaml)
  → stop_debug_session, release_device, close_ide_window

Claude #2 in another_app/                  → emulator-5554, its own VS Code, its own debug
Claude #3 in third_app/                    → iPhone simulator UDID, its own VS Code, its own debug

Three independent debug sessions, three IDE windows, three locked devices, no collisions. The HTTP adapter exposes both the unified /tools/* surface and a focused /dev-session/* sub-router for agents that only care about the dev-iteration loop.

See examples/templates/dev_iteration.yaml for a runnable plan template; docs/ios_setup.md for the iPhone prerequisites (Developer Mode, DDI, tunneld, WebDriverAgent).

Contributing

See docs/adding_a_framework.md and docs/adding_an_mcp.md for the extension recipes. Both stay small (a few new files each) thanks to the Clean Architecture boundaries.

Pre-commit hooks

Mirrors CI exactly — install once, never push a red build again:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files   # one-time baseline; CI parity check

Three gates: ruff (lint+autofix), pytest -q (fast suite, no tests/agent), generate_tool_catalogue --check (refuses if docs/tools.md drifts from the live registry). See .pre-commit-config.yaml.

Design

A shippable visual-asset brief pack lives in docs/design/ — six self-contained briefs (logo, social preview, landing page, architecture diagram, demo video, pitch deck) each with concrete specs + a Claude-designer prompt. Total ~12 person-days of design work to ship the full pack; the first 3 briefs (~7 days) cover 80% of the launch surface.

License

Apache License 2.0 — see LICENSE. Inbound contributions follow the same license; no separate CLA.

About

The first MCP server for autonomous Flutter testing on real iPhones and Android devices. 110 tools across Android (uiautomator2+adb), iOS (WebDriverAgent+pymobiledevice3), Flutter (Patrol + flutter run --machine). Works with Claude Desktop, Claude Code, Cursor.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages