Goal: React quickly to failing CI on main or active branches — diagnose, propose minimal fixes, and escalate when the loop cannot confidently resolve.
Recommended:
/loop 15mduring active development (Grok, Claude Code)/loop 5mwhen main is red and you're shipping- GitHub Action on
workflow_runfailure (event-driven, not polling)
Slower overnight cadence (30–60m) is fine when no one is watching.
ci-triage— Parse CI logs, identify failing job/step, classify failure type (flake, regression, env, config)minimal-fix— Smallest change that addresses the specific failure- Project test/lint skill — Build and test commands for your stack
ci-sweeper-state.md or section in STATE.md:
## CI Sweeper — Active Failures
Last run: 2026-06-09 14:30 UTC
### main @ abc1234
- Job: test-auth
- Failure: AssertionError in test_refresh_token_expiry
- Attempts: 1/3
- Last action: Minimal fix proposed in worktree fix/ci-auth-refresh
- Status: Waiting for verifier + human
### Resolved (last 7d)
- main @ def5678 — lint fix merged via PR #1250Track: commit SHA, failing job, attempt count, worktree/PR link, outcome.
- Discover CI failures on watched branches (main, release/*, active PRs).
- For each new failure:
- Classify: flake vs real regression vs infra.
- If flake (seen before, intermittent): add to Watch, do not auto-fix.
- If actionable: open worktree → implementer drafts fix.
- Verifier sub-agent checks: fix addresses failure, no unrelated changes, tests pass locally.
- Open PR or comment on existing PR with proposed fix.
- If attempts exceed max (e.g. 3): escalate to human with full context.
- Prune resolved failures from active list.
- Verifier must run tests in the worktree before approving.
- Implementer cannot merge — only propose.
- Flake detection: if same test failed and passed on retry without code change, do not auto-fix.
- Infrastructure failures (runner OOM, registry down, secrets missing)
- Failures touching >5 files or core architecture
- Security-sensitive test failures
- Max attempts exceeded on same failure
- Intermittent flakes that need quarantine, not code changes
Grok Build TUI:
/loop 15m Check CI on main and open PRs. For new failures: classify, and if actionable draft minimal fix in worktree with verifier. Update ci-sweeper-state.md. Escalate after 3 attempts.Claude Code:
/goal All tests on main pass and lint is cleanUse /goal for focused "get green" sessions; use scheduled sweeper for ongoing monitoring.
Codex:
- Automation: "Summarise CI failures every 30m" → Triage inbox; separate automation for fix attempts.
GitHub Actions:
- See
examples/github-actions/ci-sweeper.yml— triggers on workflow failure.
| Failure | Mitigation |
|---|---|
| Fix-the-symptom loops | Verifier checks root cause, not just green CI |
| Fighting flakes with retries | Classify flakes; quarantine or skip with ticket |
| Token burn on red main | Pause loop after N failures; batch fixes |
| Wrong branch targeted | Explicit branch allowlist in skill |
| Scenario | Tokens/run | Notes |
|---|---|---|
| No-op (CI green) | ~5k | Required — do not run full sweeper when green |
| Triage / classify | ~50k | Log parse + failure classification |
| Fix attempt (L2) | ~200k | Worktree + implementer + verifier |
Cadence: 5m–15m · Tier: very-high · Suggested daily cap: 1M tokens · Early exit required
npx @cobusgreyling/loop-cost --pattern ci-sweeper --cadence 15m --level L2At 15m cadence without early-exit, worst-case spend exceeds 5M tokens/day. Never run full action paths on every tick.
- Mean time to first proposed fix after CI goes red
- % of failures resolved without human intervention (trivial cases only)
- Repeat failure rate (same job failing again within 48h)
Best entry loop for teams new to loop engineering — high frequency, bounded scope, clear verification.