Summary
Add a self-healing capability where oompa periodically inspects its own structured logs for recurring errors, classifies them, checks for existing duplicate issues, and creates new issues when it detects novel problems. Issues are created without the `good-for-ai` label -- they require human triage before oompa attempts to fix them.
Motivation
Several bugs have gone undetected for days because oompa silently retries failing operations without escalating:
Oompa already logs these errors via slog, but nobody reads the logs proactively. Oompa should watch its own logs and raise issues when it detects persistent problems.
Proposed Design
New role: `self-check`
A new role that runs on a schedule (e.g., every 30 minutes or hourly) and:
- Scans recent logs for error/warn-level entries using `journalctl --user -u oompa.service --since "1h ago"` or the in-memory event ring buffer
- Groups errors by pattern -- same error message on the same project/PR repeating N times is one problem, not N problems
- Classifies severity:
- Critical: same error repeating every poll cycle (retry loop) -- e.g., git corruption, cursor not advancing
- Warning: intermittent errors (happen sometimes but not every cycle)
- Info: one-off errors that self-resolved
- Checks for duplicates -- before creating an issue, search existing open issues on `qinqon/oompa` for the same error pattern. Use the error message signature (first line, stripped of variable parts like SHAs and timestamps) as the search key
- Creates an issue (without `good-for-ai` label) with:
- Error pattern and frequency
- Affected project/PR
- Sample log lines
- Suggested severity
- Duration of the problem (first occurrence → latest occurrence)
What it should detect
| Pattern |
Example |
Severity |
| Same error every cycle |
`failed to amend commit` on every poll |
Critical |
| Agent invoked but no cursor advance |
Review loop on stale reviews |
Critical |
| Worktree operation failures |
`git worktree add` failing repeatedly |
Critical |
| Cost anomaly |
Single PR consuming >$10 in a session |
Critical |
| Intermittent API errors |
GitHub API rate limit or 5xx |
Warning |
| Agent timeouts |
Agent killed after timeout |
Warning |
| One-off failures |
Single failed push that succeeds on retry |
Info (ignore) |
Config
```yaml
In the oompa project config or as a global setting
self-check:
schedule: "*/30 * * * *" # every 30 minutes, or "hourly"
lookback: 1h # how far back to scan logs
min-repeat: 3 # minimum repetitions to consider it a pattern
cost-threshold: 10.0 # alert if single PR costs more than this
```
Issue format
```markdown
Self-check: recurring error detected
Pattern: `failed to amend commit` on openshift/hypershift PR #8365
Severity: Critical (repeating every poll cycle)
First seen: 2026-05-06T14:00:42Z
Latest: 2026-05-11T06:24:04Z (5 days)
Occurrences: 3,600+ times
Estimated cost: ~$2,000 (agent invoked each cycle at ~$0.60)
Sample log entries
```
May 11 05:51:30 ERROR failed to amend commit pr=8365 error="git commit --amend: exit status 1 (stderr: error: invalid object...)"
May 11 05:53:30 ERROR failed to amend commit pr=8365 error="git commit --amend: exit status 1 (stderr: error: invalid object...)"
```
Suggested action
This error indicates a corrupted git worktree. The worktree at `/tmp/oompa-work/openshift/hypershift/worktrees/...` may need to be deleted and recreated.
```
Why no `good-for-ai` label?
Self-detected issues should be triaged by a human before oompa attempts to fix them. Reasons:
- The fix might require infrastructure changes (deleting worktrees, restarting the service)
- Some errors might be expected/transient and don't need a code fix
- Oompa shouldn't create an infinite loop of self-fixing (detect error → create issue → pick up issue → fail → detect error...)
- Human can add `good-for-ai` after reviewing if the fix is appropriate for autonomous implementation
Duplicate detection
Before creating an issue, search for duplicates:
```bash
gh search issues --repo qinqon/oompa "self-check: failed to amend commit" --state open
```
If a matching open issue exists:
- Add a comment with the latest occurrence count and time range
- Don't create a duplicate
Safety constraints
- Never create issues with `good-for-ai` label -- humans decide what to auto-fix
- Rate limit issue creation -- max 1 issue per error pattern per day
- Don't alert on known-transient errors -- configurable ignore patterns (e.g., "context canceled" during shutdown)
- Don't self-reference -- if the self-check itself errors, log it but don't create an issue about it (prevents infinite loops)
Implementation sketch
Log source options
- journalctl -- parse slog output from systemd journal. Works but requires text parsing
- Event ring buffer -- use the existing event system. Only captures events that are emitted, might miss raw error logs
- Dedicated error log -- add a separate error collector that captures all error/warn slog entries in a structured buffer. Most reliable
Option 3 is recommended -- add a `slog.Handler` wrapper that captures error-level entries into a ring buffer, alongside the existing text handler.
Non-goals
- This is NOT about fixing bugs automatically -- it's about detecting and reporting them
- This is NOT a replacement for monitoring/alerting -- it's a lightweight self-awareness feature
- This does NOT modify oompa's behavior when errors are detected -- it only creates issues
Summary
Add a self-healing capability where oompa periodically inspects its own structured logs for recurring errors, classifies them, checks for existing duplicate issues, and creates new issues when it detects novel problems. Issues are created without the `good-for-ai` label -- they require human triage before oompa attempts to fix them.
Motivation
Several bugs have gone undetected for days because oompa silently retries failing operations without escalating:
Oompa already logs these errors via slog, but nobody reads the logs proactively. Oompa should watch its own logs and raise issues when it detects persistent problems.
Proposed Design
New role: `self-check`
A new role that runs on a schedule (e.g., every 30 minutes or hourly) and:
What it should detect
Config
```yaml
In the oompa project config or as a global setting
self-check:
schedule: "*/30 * * * *" # every 30 minutes, or "hourly"
lookback: 1h # how far back to scan logs
min-repeat: 3 # minimum repetitions to consider it a pattern
cost-threshold: 10.0 # alert if single PR costs more than this
```
Issue format
```markdown
Self-check: recurring error detected
Pattern: `failed to amend commit` on openshift/hypershift PR #8365
Severity: Critical (repeating every poll cycle)
First seen: 2026-05-06T14:00:42Z
Latest: 2026-05-11T06:24:04Z (5 days)
Occurrences: 3,600+ times
Estimated cost: ~$2,000 (agent invoked each cycle at ~$0.60)
Sample log entries
```
May 11 05:51:30 ERROR failed to amend commit pr=8365 error="git commit --amend: exit status 1 (stderr: error: invalid object...)"
May 11 05:53:30 ERROR failed to amend commit pr=8365 error="git commit --amend: exit status 1 (stderr: error: invalid object...)"
```
Suggested action
This error indicates a corrupted git worktree. The worktree at `/tmp/oompa-work/openshift/hypershift/worktrees/...` may need to be deleted and recreated.
```
Why no `good-for-ai` label?
Self-detected issues should be triaged by a human before oompa attempts to fix them. Reasons:
Duplicate detection
Before creating an issue, search for duplicates:
```bash
gh search issues --repo qinqon/oompa "self-check: failed to amend commit" --state open
```
If a matching open issue exists:
Safety constraints
Implementation sketch
Log source options
Option 3 is recommended -- add a `slog.Handler` wrapper that captures error-level entries into a ring buffer, alongside the existing text handler.
Non-goals