Ultraswarm is a durable multi-worker coding orchestrator for Codex, Claude Code, Cursor Agent, Grok, and shell usage. One standalone Node runner owns decomposition, worker routing, process supervision, isolated Git worktrees, adaptive review, transactional integration, approvals, recovery, and reporting.
Run-output polish:
- Plain terminal report by default (no markdown
#/**/_showing as literal chars); pass--markdownto keep GitHub-markdown for pasting into a PR. - Color — the live stream and report verdict are color-coded (green pass / red fail / yellow
retry); auto-off when piped, honors
NO_COLOR/--no-color. - Run wall-clock in the Summary, and short 8-char run-ids (merge/status/logs/… accept an unambiguous prefix).
- Every table in the run report is now a clean aligned terminal table — the per-task list was the last markdown table; it now matches the per-CLI / PLAN PREVIEW / WORKER ROSTER fixed-width style.
- Per-CLI token table now renders as a clean aligned terminal table (the 3.5.15 version was a
markdown table whose pipes didn't line up in a raw CLI) — fixed-width columns with a
─separator and right-aligned numbers, matching the PLAN PREVIEW / WORKER ROSTER style.
Per-CLI token breakdown (see the CHANGELOG):
- "Work offloaded" now breaks usage down by CLI — a table of
landed(tokens that produced integrated work) vsspent(all attempts, incl. rejected retries and competition losers) vsoverhead, with a reconciling total. You can see at a glance which worker burned tokens and how much went to retries/competition (e.g.Workers used ≈ 383,578 tokens — ≈ 274,485 landed, ≈ 109,093 on retries + competition).
Real token-usage capture (see the CHANGELOG):
- codex and opencode now report real usage. Their default invocations use
exec --json/run --format json, and ultraswarm parses the structured JSONL usage events — so the report's "Work offloaded" section shows the actual token count (e.g.Workers reported ≈ 238,656 tokens) instead of "not reported". No more scraped guesswork (removed in 3.5.13), and no fabrication when a CLI doesn't report — a custom invocation without the JSON flag honestly shows "not reported".
Honest run-report value section (see the CHANGELOG):
- No more scraped token noise — the old "Tokens saved" number was regex-scraped from worker stdout and matched incidental digits (e.g. "≈ 62 tokens" for a run that used thousands). The free-text scrape is gone; token/cost now come only from a worker's structured usage, else nothing is claimed.
- "Work offloaded" reports what's measured — tasks, worker-attempt count, and total external wall-clock. A token figure shows only when a worker actually reported one; otherwise the report says "Token/cost usage: not reported by these CLIs" rather than inventing a misleading count.
Live-stream readability follow-up to v3.5.11 (see the CHANGELOG):
- No more git chatter in the stream —
git worktree add/merge --squashoutput is captured instead of inherited, so a big swarm's progress lines aren't buried under "Preparing worktree …". - Consistent glyphs everywhere — routine-path escalation/rejection/blocked lines now carry the
same
↑/✗/⊘glyphs as the high-risk competition path, so the whole stream scans uniformly.
Readability + accuracy pass on the two human-facing output surfaces (see the CHANGELOG for detail):
- Accurate run-end report — reports "integrated" (not "merged") while a run awaits merge approval, with a staging line making clear nothing lands on your branch until you approve; the headline counts every task (including post-merge regressions) so the numbers reconcile.
- Honest token offload — the offload headline no longer leads with a misleading
≈ N/≈ 0; it shows the exact figure on full coverage, an explicit floor (x of y tasks reported) on partial, and "not measurable here" when no worker reports usage. Retried-but-integrated tasks are named. - Visible competition retries — when a high-risk competition winner is rejected by adversarial QA,
the live stream now logs the judged winner and
✗ … rejected by QA — retryinginstead of silently jumping to the next attempt.
Hardening from a full audit of the orchestrator (each fix shipped as its own patch release; see the CHANGELOG for per-version detail):
- Concurrency — fixed a re-entrant limiter deadlock that could hang an entire run, and froze runs deterministically on ≤3-core hosts (CI), whenever a high-risk task fanned out competition/QA work.
- Security — plan
contract.commandsnow reject shell metacharacters (no morenpm test; rm -rf ~reaching the shell); worker env passthrough narrowed from the wholeXDG_*namespace to named vars. - Integration — a no-op squash records a clean skip instead of throwing and blocking the whole
wave; a failed per-task commit fails loud instead of reporting
ok; post-runcleanupdeletes only the current run's branches. - Recovery —
resumejudges liveness on a persisted orchestrator identity (pid + boot id), so it can't reap a still-running run or be fooled by PID reuse after a reboot; terminal runs are immutable. - Brain — Anthropic schema calls extract JSON defensively and fall back to raw text so the
validate-and-retry loop works; malformed
--plan-file/package.jsonfail with a clearUSAGEerror. - Alias workers in competition — user-defined alias workers can now participate in (and be retried within) high-risk competition; they previously tombstoned as "only N usable worker(s)".
- Functional preflight —
preflightruns a cached exec smoke test per CLI (write a file in an isolated temp dir) and excludes workers that pass--versionbut can't actually run (dead auth, no-op). Routing keys off the functional verdict. See Prerequisites. - Human-readable output —
preflight, plan previews,status, anddoctorrender aligned tables by default; add--jsonfor the old machine output. - Live progress + every-agent heartbeat — runs stream per-agent dispatch lines, gate results, and a periodic active/idle heartbeat to stderr so every worker stays visible.
- Tokens-saved summary — the final report estimates the implementation tokens that ran on external CLIs off your Claude context (an honest best-effort floor).
- Repo-local worktrees with deps installed — per-task and integration worktrees default to
<repo>/.ultraswarm/worktreesand have dependencies installed before gates run (detected from the lockfile: pnpm/npm/yarn), so build/test gates resolvenode_moduleseven on pnpm workspaces.
agentworker — the Cursor CLI (agent -p --force) as a headless shell worker for isolated worktree execution. See Cursor Agent Worker.- Cursor agent host skill — install with
scripts/install-cursor-skill.shso Cursor sessions can orchestrate via the standalone runner. See Cursor Agent.
small-harnessworker — SmallHarness as a built-in worker with MCP integration and multi-backend support. See SmallHarness Worker.
- User-defined harness aliases — register your own CLI entries under a new top-level
aliasesconfig key. Each aliasextendsa built-in (inheriting its binary, timeout, effort flags, and capabilities), overrides only its specialty / models / invocation, and can cap routing withmaxTier. Generalizes the previously hardcodedpi-local; strictly opt-in. See Harness Aliases.
piworker — the provider-agnosticpicoding CLI (Anthropic Claude spread by default). See Local / Private Models.pi-localworker — an always-on local/private worker that drives Ollama models through the samepibinary for fully offline-capable runs.- Per-task effort levels — the decomposition brain assigns reasoning
effortper task, independent of model tier, defaulting tolow, with effort-first QA escalation. See Effort Levels.
- SQLite state and append-only events under
.ultraswarm/state.sqlite - Capability and repository-metric worker routing with explanations
- Supervised worker process groups, timeouts, cancellation, redacted bounded logs
- Executable task contracts and forbidden-path policy
- Integration branches that do not modify the checked-out branch
- Separate plan and merge approvals
- Crash/status/log/export commands and stale-base recovery
- Generated Claude, Codex, Grok, and Cursor agent skills from one provenance-locked contract
Node 22 or newer is required because ultraswarm uses the built-in node:sqlite
API.
git clone https://github.com/fubak/ultraswarm.git ~/projects/ultraswarm
cd ~/projects/ultraswarm
npm installbash scripts/install-codex-skill.shThis creates:
~/.agents/skills/ultraswarm -> ~/projects/ultraswarm/hosts/codex/skills/ultraswarm
Restart Codex and invoke $ultraswarm.
Install the plugin:
/plugin marketplace add fubak/ultraswarm
/plugin install ultraswarm@ultraswarm
Invoke /ultraswarm.
Ultraswarm is published in the official xAI Grok plugin marketplace.
- Grok Build can proactively suggest the skill for complex multi-step coding tasks.
- Install directly from the Grok marketplace / plugin browser (searches for "ultraswarm").
- Invocation inside Grok: follow the skill (typically
ultraswarmor/ultraswarm).
The skill delegates to the standalone runner (do not re-implement orchestration inside the host).
For direct/shell or non-Grok use:
node ~/projects/ultraswarm/bin/ultraswarm.mjs run ...
# or the installed bin after `npm install -g` equivalentSee the generated Grok host contract: hosts/grok/skills/ultraswarm/SKILL.md.
Plugin source + details: https://github.com/fubak/ultraswarm (manifests in .grok-plugin/ + .claude-plugin/).
- Bump the version in every manifest so they agree (validate Check 3):
package.json,package-lock.json(runnpm install --package-lock-only),.claude-plugin/plugin.jsonand.grok-plugin/plugin.json(keep byte-identical —cpone to the other), and bothversionfields in.claude-plugin/marketplace.json(metadata.version+plugins[0].version). - Update docs + CHANGELOG (move
[Unreleased]to the new version + date). npm run validateandnpm testmust pass.- Push to main.
- Capture the new commit SHA (
git rev-parse HEAD). - In the plugin-marketplace repo, update the
shafor ultraswarm, re-runpython3 scripts/generate-plugin-index.py, then validate + open PR. scripts/validate.shnow also validates.grok-plugin/plugin.json(parse + version match) and enforces that the two manifests are byte-identical.
This addresses review feedback on packaging validation and sync risk.
bash scripts/install-cursor-skill.shThis creates:
~/.cursor/skills/ultraswarm -> ~/projects/ultraswarm/hosts/agent/skills/ultraswarm
Restart Cursor and invoke the ultraswarm skill. The host prepares plans and
delegates execution to bin/ultraswarm.mjs; it does not implement feature work
directly.
Install the Cursor CLI separately if you also want agent as a worker:
curl https://cursor.com/install -fsS | bash
agent --versionSee the full Grok Build (xAI Plugin Marketplace) section (and the maintenance subsection) above. For direct execution outside Grok:
node ~/projects/ultraswarm/bin/ultraswarm.mjs ...The generated Grok host contract is at hosts/grok/skills/ultraswarm/SKILL.md.
- A Git repository
- Node 22+
- At least two authenticated worker CLIs from
codex,gemini,grok,agy,droid,opencode,pi,pi-local,small-harness, andagent - An authenticated
claudeCLI for the default QA/decomposition brain, orANTHROPIC_API_KEYwithULTRASWARM_BRAIN=anthropic-api
Check readiness:
# Functionally verify each CLI (cached smoke test — proves a worker can actually write a file,
# not just that `--version` succeeds). Workers shown UNUSABLE are excluded from routing.
node ~/projects/ultraswarm/bin/ultraswarm.mjs preflight
# Policy, gates, and worker capabilities (add --json for machine-readable output):
node ~/projects/ultraswarm/bin/ultraswarm.mjs doctor
node ~/projects/ultraswarm/bin/ultraswarm.mjs workerspreflight is the recommended first step: a CLI can pass --version yet fail every real run
(dead auth, no-op output). The smoke test catches that and routing skips non-functional workers
automatically. Verdicts are cached in .ultraswarm/functional-probe.json (24h TTL, keyed by
binary version); preflight --smoke forces a re-probe.
Create a plan:
{
"tasks": [
{
"id": "api-tests",
"description": "Add regression coverage for the API",
"files": ["test/api.test.mjs"],
"complexity_score": 25,
"risk": "routine",
"effort": "low",
"dependencies": [],
"prompt": "Add focused regression tests for invalid request handling.",
"contract": {
"commands": ["npm test"],
"assertions": ["Invalid requests return 400"],
"allowed_paths": ["test"]
}
}
]
}cli, model_tier, and effort are optional. When cli/model_tier are omitted,
ultraswarm ranks healthy workers using capability fit and repository-local pass, latency,
and cost history. When effort is omitted it defaults to low (see Effort Levels).
Preview without executing:
node ~/projects/ultraswarm/bin/ultraswarm.mjs run \
--plan-file .ultraswarm-plan.jsonApprove the plan and execute:
node ~/projects/ultraswarm/bin/ultraswarm.mjs run \
--plan-file .ultraswarm-plan.json \
--approve-planWhile a run executes, it streams colour-coded progress to stderr — wave headers, a per-agent dispatch
line for every worker the moment it starts (▶ task → cli@tier attempt N [pid …]), gate results
(✓/✗), review verdicts (✔ approved; ✗ … rejected by QA — retrying), escalations (↑), and a
periodic active/idle heartbeat (⏱ active: … · idle: …) so every worker's state stays visible. Colour
auto-disables when output is not a TTY (piped/CI), honours the NO_COLOR convention, and is suppressed
by --no-color.
When it finishes it prints a run report (plain terminal text by default; pass --markdown to emit
GitHub-markdown for pasting into a PR/issue): a per-task table showing which worker landed each task, a
Summary with the run wall-clock, and a Work offloaded section — how many tasks/worker-attempts
ran on external CLIs, their total compute time, and a per-CLI token breakdown (landed vs spent vs
retry/competition overhead) read from each CLI's structured usage. The headline value: the
implementation ran on external CLIs, off your Claude context; Claude only orchestrated and reviewed
(see token reporting).
Per-task and integration worktrees are created under <repo>/.ultraswarm/worktrees (gitignored).
Because a fresh worktree checks out tracked files only (no node_modules), the runner installs
dependencies in each worktree before gates run — inferred from the lockfile (pnpm-lock.yaml →
pnpm install --frozen-lockfile, package-lock.json → npm ci, yarn.lock → yarn install --immutable); repos without a lockfile are left untouched. This is what makes gates resolve
node_modules on pnpm workspaces, where upward module resolution from a sibling worktree does not
reach the symlinked deps. Override the worktree location with --worktree-root <dir>.
Gates are auto-detected from your package.json scripts (build, test, lint). Override which
scripts gate with --gates <names> (e.g. --gates test,lint to drop a worktree-unsafe build) or a
"gates" array in ultraswarm.config.json; an empty list disables gates. The same selection applies
to the integration gate at merge time, so run and merge stay consistent.
The run finishes in awaiting_merge. Your checked-out branch has not changed.
After reviewing status and logs, provide the separate merge approval:
node ~/projects/ultraswarm/bin/ultraswarm.mjs status <run-id>
node ~/projects/ultraswarm/bin/ultraswarm.mjs logs <run-id>
node ~/projects/ultraswarm/bin/ultraswarm.mjs merge <run-id> --approveThe final merge is fast-forward only. If the target branch moved, the run enters
stale_base; recover it with:
node ~/projects/ultraswarm/bin/ultraswarm.mjs resume <run-id>| Command | Purpose |
|---|---|
preflight |
Functionally verify enabled CLIs (cached smoke test); --smoke forces a re-probe |
run |
Preview or execute a plan |
merge <id> --approve |
Approve and fast-forward integrated work |
status [id] |
List runs or inspect durable state |
logs <id> |
Read append-only events |
cancel <id> |
Terminate worker process trees |
resume <id> |
Recover awaiting-merge or stale-base state |
doctor |
Validate policy, gates, and worker health |
workers |
Show worker health and capabilities |
explain-routing <task> |
Explain worker rankings |
export <id> |
Export run provenance as JSON |
preflight, run (plan preview), status, doctor, and workers print human-readable
tables by default; add --json for machine-readable output. By default run functionally
verifies the pool (cached smoke test) before assigning; pass --smoke to force a fresh probe or
--no-smoke to fall back to a --version-only check.
The run report renders as plain terminal text by default; pass --markdown to emit GitHub-markdown
(for pasting into a PR or issue). Colour is enabled for interactive terminals and disabled when output
is piped/redirected; turn it off explicitly with --no-color (or NO_COLOR=1). Any command that takes
a run id also accepts an unambiguous prefix — e.g. the 8-character id printed in the report's
Approve merge with: line — in place of the full id (merge/status/logs/cancel/resume/export).
Legacy --plan-file ... --yes syntax remains as a v2 compatibility shim.
--yes maps only to plan approval; it never approves the final merge.
Exit codes are 0 success, 1 runtime failure, 2 usage error, 3 approval
required, and 4 blocked or stale state.
Add policy to ultraswarm.config.json:
{
"enabled": ["codex", "gemini"],
"workerEnvAllowlist": ["OPENAI_API_KEY"],
"policy": {
"minimumHealthyWorkers": 2,
"maxParallelWorkers": 4,
"requireCompetitionForRisk": ["high"],
"approvals": {
"beforeExecution": true,
"beforeMerge": true
},
"forbiddenPaths": [".env", ".env.*", "infra/prod/**"],
"maxCostUsd": 10,
"isolation": "native",
"containerImage": null,
"network": "allow"
}
}Project configuration overrides the global
~/.claude/ultraswarm.config.json. For container isolation, set containerImage to an image containing the selected worker CLIs. Network denial requires container
isolation and is rejected when configured with native isolation.
Beyond the built-in CLIs, you can register your own named entries under aliases. An alias
extends a built-in (inheriting its binary, timeout, effort flags, and capabilities) and
overrides only what differs — its specialty, its model tiers, and its invocation. This is how
you run several local models, each tuned for a job, through one CLI binary:
{
"enabled": ["codex", "pi-qwen-coder"],
"aliases": {
"pi-qwen-coder": {
"extends": "pi",
"specialty": "local coding, small refactors, unit tests",
"maxTier": "moderate",
"models": {
"simple": { "model": "qwen3-coder:7b", "invocation": "pi -p --provider ollama --model qwen3-coder:7b --config ~/.pi/lean.json \"$(cat .ultraswarm-prompt.txt)\"" }
}
}
}
}- Lean harness: put whatever makes a CLI's harness leaner directly in the
invocation(a--configpointing at a stripped-down profile, fewer flags, etc.). Local models often do better with less wrapping. maxTier: caps the tiers an alias will accept. A task above the cap is clamped down (e.g. an expert task on amaxTier: moderatealias runs at moderate), so a small local model is never handed work it can't do.- Opt-in only: nothing is auto-generated. An alias exists only if you declare it, and is
active only when it appears in
enabled(or whenenabledis omitted entirely).
SmallHarness is a terminal-first coding agent written in Rust that supports multiple AI backends (OpenAI, OpenRouter, Ollama, LM Studio, MLX, llama.cpp). As an ultraswarm worker it brings:
- Multi-backend routing: switch between cloud and local models per-task via overrides
- MCP integration: native Model Context Protocol support for extended tool sets
- Cost tracking: real-time per-turn and session cost accounting
SmallHarness must be installed separately:
cargo install small-harnessAdd small-harness to enabled to activate it. The built-in defaults use the OpenAI backend for simple tasks and OpenRouter (Claude) for moderate/complex/expert. Backend and model are passed via environment variables — SmallHarness reads BACKEND and AGENT_MODEL from the environment, not CLI flags.
To route simple tasks through a local Ollama model instead, override in ultraswarm.config.json:
{
"enabled": ["codex", "small-harness"],
"overrides": {
"small-harness": {
"models": {
"simple": {
"model": "qwen3-coder:7b",
"invocation": "BACKEND=ollama AGENT_MODEL=qwen3-coder:7b small-harness --allow-tools --print \"$(cat .ultraswarm-prompt.txt)\""
}
}
}
}
}Tool approval: ultraswarm always passes
--allow-toolsso SmallHarness auto-approves tool calls in one-shot mode. Do not omit this flag in custom invocations or the worker will silently deny every tool call and produce no file changes.
API keys: SmallHarness inherits only the variables in
workerEnvAllowlist. The built-in defaults needOPENAI_API_KEY(simple tier) andOPENROUTER_API_KEY(moderate/complex/expert). Add both to your config:{ "workerEnvAllowlist": ["OPENAI_API_KEY", "OPENROUTER_API_KEY"] }
The Cursor CLI (agent) runs headless tasks via agent -p --force in isolated worktrees.
Ultraswarm uses the same ShellWorkerAdapter as every other worker — no custom interface.
Install the CLI:
curl https://cursor.com/install -fsS | bash
agent --versionAdd agent to enabled to activate it. Built-in tier mapping: simple →
composer-2.5-fast; moderate → gpt-5.4; complex/expert → Claude Sonnet 4.6 /
Opus 4.8. Override models in ultraswarm.config.json via the standard overrides key.
File writes: ultraswarm always passes
--forceso the agent applies edits in one-shot mode. Without--force, the CLI only proposes changes and the task fails withno_changes.
API key: headless runs need
CURSOR_API_KEY. Add it toworkerEnvAllowlist:{ "workerEnvAllowlist": ["CURSOR_API_KEY"] }
When Cursor is both host and worker, keep at least one other worker enabled so high-risk tasks can satisfy competition policy.
pi and pi-local are both backed by the pi
CLI. pi runs a provider-agnostic Anthropic Claude spread; pi-local is an always-on
worker that routes through Ollama for fully local, private, offline-capable runs.
Ollama is a model backend, not an agentic worker — it cannot edit files or run commands on
its own. pi-local is the harness that drives local models with tool-calling inside an
isolated worktree.
To use pi-local:
- Install and run Ollama.
- Pull the models you want, e.g.
ollama pull qwen3-coder:7bandollama pull qwen3-coder:30b. - Register an
ollamaprovider and those models in~/.pi/agent/models.json(Pi reads provider entries withbaseUrl: http://localhost:11434/v1,api: openai-completions). - Override the default model IDs in
ultraswarm.config.jsonto match the models you pulled (seeultraswarm.config.advanced.json).
doctor and workers probe the pi binary, so a green pi-local means "pi is
installed" — not "Ollama is running." If Ollama is down, pi-local tasks fail at execution
time and are reported and retried like any other worker failure.
Local-model requirement:
pi-localonly works with a local model that emits structured tool-calls through Pi's provider endpoint. Many small local models (and the OpenAI-completions compatibility path) will describe an edit as plain text instead of calling thewrite/edittool — Pi then has nothing to execute and no file is produced, so the task fails its contract. Choose a local model with reliable tool-calling, and treat the defaultqwen3-coderIDs as examples to override. Frontier-hosted providers (thepiworker) do not have this limitation.
Reasoning effort is a per-task dial, independent of model tier. The decomposition brain
assigns effort (off/low/medium/high/xhigh) to each task and defaults to low —
most routine tasks produce the same result at low effort, far faster and cheaper. High effort is
reserved for genuinely hard reasoning.
Effort is injected per CLI for the workers that expose the dial (codex, droid, pi); other
workers ignore it. On QA failure, ultraswarm escalates effort first (low → medium → high)
before spending more — the cheapest correction rung first. Routine tasks climb effort within
their model tier; high-risk and complex tasks use the full ladder, stepping up the model tier
only after effort tops out.
Set effort explicitly on a task in your plan JSON to override, or override effortFlags per CLI
in ultraswarm.config.json (see ultraswarm.config.advanced.json).
Behavior note: because effort defaults to
low, an expert-tier task runs the expert model at low effort and escalates on failure — it is no longer pinned to high effort. Pin it witheffort: "high"if you need maximum reasoning up front.
- Worker attempts run in separate worktrees and process groups.
- Accepted task commits are squash-integrated into
ultraswarm/run-<run-id>, not the checked-out branch. - Worker environments use an allowlist rather than inheriting secrets.
- Logs redact common credential assignments and rotate at the output limit.
- Task contracts run commands and reject changes outside
allowed_paths. .ultraswarm/is ignored by Git and contains SQLite state, worker logs, the functional-probe cache, and per-task worktrees.- Token usage is read only from a worker's structured output (codex
exec --json, opencoderun --format json) — never a text scrape. A worker invoked without its JSON flag, or one with no usage parser yet (gemini/grok/agy/droid/pi), runs fine and the report honestly says "Token/cost usage: not reported", never a fabricated number. The figures are usage estimates, not a billing source of truth. - v2 JSONL journals remain readable files but cannot be resumed as v3 runs.
npm test
bash scripts/validate.sh
node scripts/generate-host-skills.mjs --checkEdit hosts/host-contract.json or scripts/generate-host-skills.mjs, then run
node scripts/generate-host-skills.mjs. Do not hand-edit generated host skills.
Host install scripts:
- Codex:
bash scripts/install-codex-skill.sh - Cursor:
bash scripts/install-cursor-skill.sh
A pre-commit hook (in .githooks/, auto-enabled by npm install via the prepare script)
blocks commits that introduce host-skill drift — the generated SKILL.md files must stay in
sync with hosts/host-contract.json. Enable it manually with
git config core.hooksPath .githooks. CI (.github/workflows/validate.yml) runs
validate.sh and the full test suite on every PR, and main requires a passing CI run
through a pull request before merge.
MIT