English · 中文
A lightweight macOS desktop automation toolkit — control the real mouse, keyboard,
and screen from the shell. It is packaged as a Cursor Agent Skill
(SKILL.md), but the scripts are plain Bash and work standalone in any terminal.
No MCP server, no Node, no Python. Just system tools (screencapture, osascript, sips)
plus cliclick for precise mouse/keyboard events.
⚠️ Safety warning. These scripts move your real cursor, press real keys, and capture your screen. Run only code you understand. Anything driving the GUI can click the wrong thing if a window isn't focused — review each action. Avoid destructive operations and never let it type secrets.
- 📸 Screenshots (downscaled JPEG, cheap to read) and high-detail region "zoom"
- 🖱️ Mouse: move, single / double / right click at logical coordinates
- ⌨️ Keyboard: type text and press shortcuts (
cmd c,return, arrows, …) - 🪟 App control: activate apps, read frontmost app / window / mouse / screen size
It is designed to be driven by an AI agent in an observe → act → verify loop, but every script is usable by hand.
- macOS (Apple Silicon or Intel)
cliclick:brew install cliclick- Grant the controlling app (e.g. Cursor, Terminal, iTerm) these in
System Settings → Privacy & Security:
- Screen Recording (for screenshots)
- Accessibility (for clicks / keystrokes)
git clone https://github.com/ZcwDev/macos-control.git
cp -R macos-control ~/.cursor/skills/macos-control # SKILL.md is auto-discoveredgit clone https://github.com/ZcwDev/macos-control.git
cd macos-control && chmod +x scripts/*.sh| Script | Purpose |
|---|---|
context.sh |
Frontmost app, window title, mouse point, logical screen size |
screenshot.sh [path] |
Downscaled JPEG of the screen; prints path + logical_size |
zoom.sh X Y [W H] |
High-detail capture around a point (cursor shown) to verify a target |
move.sh X Y |
Move mouse, no click |
click.sh X Y [single|double|right] |
Click at logical points |
type.sh "text" |
Type literal ASCII (use the clipboard for non-ASCII) |
key.sh [cmd ctrl alt shift] KEY |
Keys / shortcuts (return, cmd c, cmd shift 4, …) |
app.sh "Name" |
Activate / focus an app |
Coordinates are logical points (what cliclick uses). screenshot.sh prints
logical_size=WxH. Convert a target by fraction of the image, not raw pixels (the image
you view is downscaled to a variable size):
click_x = (target_x_in_image / image_width) * logical_width
click_y = (target_y_in_image / image_height) * logical_height
For small/adjacent targets, refine with the move.sh → zoom.sh → click.sh loop.
scripts/app.sh Safari
scripts/screenshot.sh # read it, estimate a target's fraction
scripts/click.sh 735 480
scripts/type.sh "hello"
scripts/key.sh return- Works best with native macOS apps. Electron-based apps (some chat and editor apps) don't expose accessibility data, so locate targets visually with screenshots + the zoom loop.
- Typing goes through the active input source. Switch to a plain Latin/ABC input source before typing ASCII, or use the clipboard for other text.
- On multi-monitor setups it captures and controls the main display.