Skip to content

feat: self-healing -- oompa monitors its own logs for recurring errors and creates fix issues #166

Description

@qinqon

Summary

Add a self-healing capability where oompa periodically inspects its own structured logs for recurring errors, classifies them, checks for existing duplicate issues, and creates new issues when it detects novel problems. Issues are created without the `good-for-ai` label -- they require human triage before oompa attempts to fix them.

Motivation

Several bugs have gone undetected for days because oompa silently retries failing operations without escalating:

Oompa already logs these errors via slog, but nobody reads the logs proactively. Oompa should watch its own logs and raise issues when it detects persistent problems.

Proposed Design

New role: `self-check`

A new role that runs on a schedule (e.g., every 30 minutes or hourly) and:

  1. Scans recent logs for error/warn-level entries using `journalctl --user -u oompa.service --since "1h ago"` or the in-memory event ring buffer
  2. Groups errors by pattern -- same error message on the same project/PR repeating N times is one problem, not N problems
  3. Classifies severity:
    • Critical: same error repeating every poll cycle (retry loop) -- e.g., git corruption, cursor not advancing
    • Warning: intermittent errors (happen sometimes but not every cycle)
    • Info: one-off errors that self-resolved
  4. Checks for duplicates -- before creating an issue, search existing open issues on `qinqon/oompa` for the same error pattern. Use the error message signature (first line, stripped of variable parts like SHAs and timestamps) as the search key
  5. Creates an issue (without `good-for-ai` label) with:
    • Error pattern and frequency
    • Affected project/PR
    • Sample log lines
    • Suggested severity
    • Duration of the problem (first occurrence → latest occurrence)

What it should detect

Pattern Example Severity
Same error every cycle `failed to amend commit` on every poll Critical
Agent invoked but no cursor advance Review loop on stale reviews Critical
Worktree operation failures `git worktree add` failing repeatedly Critical
Cost anomaly Single PR consuming >$10 in a session Critical
Intermittent API errors GitHub API rate limit or 5xx Warning
Agent timeouts Agent killed after timeout Warning
One-off failures Single failed push that succeeds on retry Info (ignore)

Config

```yaml

In the oompa project config or as a global setting

self-check:
schedule: "*/30 * * * *" # every 30 minutes, or "hourly"
lookback: 1h # how far back to scan logs
min-repeat: 3 # minimum repetitions to consider it a pattern
cost-threshold: 10.0 # alert if single PR costs more than this
```

Issue format

```markdown

Self-check: recurring error detected

Pattern: `failed to amend commit` on openshift/hypershift PR #8365
Severity: Critical (repeating every poll cycle)
First seen: 2026-05-06T14:00:42Z
Latest: 2026-05-11T06:24:04Z (5 days)
Occurrences: 3,600+ times
Estimated cost: ~$2,000 (agent invoked each cycle at ~$0.60)

Sample log entries

```
May 11 05:51:30 ERROR failed to amend commit pr=8365 error="git commit --amend: exit status 1 (stderr: error: invalid object...)"
May 11 05:53:30 ERROR failed to amend commit pr=8365 error="git commit --amend: exit status 1 (stderr: error: invalid object...)"
```

Suggested action

This error indicates a corrupted git worktree. The worktree at `/tmp/oompa-work/openshift/hypershift/worktrees/...` may need to be deleted and recreated.
```

Why no `good-for-ai` label?

Self-detected issues should be triaged by a human before oompa attempts to fix them. Reasons:

  • The fix might require infrastructure changes (deleting worktrees, restarting the service)
  • Some errors might be expected/transient and don't need a code fix
  • Oompa shouldn't create an infinite loop of self-fixing (detect error → create issue → pick up issue → fail → detect error...)
  • Human can add `good-for-ai` after reviewing if the fix is appropriate for autonomous implementation

Duplicate detection

Before creating an issue, search for duplicates:

```bash
gh search issues --repo qinqon/oompa "self-check: failed to amend commit" --state open
```

If a matching open issue exists:

  • Add a comment with the latest occurrence count and time range
  • Don't create a duplicate

Safety constraints

  • Never create issues with `good-for-ai` label -- humans decide what to auto-fix
  • Rate limit issue creation -- max 1 issue per error pattern per day
  • Don't alert on known-transient errors -- configurable ignore patterns (e.g., "context canceled" during shutdown)
  • Don't self-reference -- if the self-check itself errors, log it but don't create an issue about it (prevents infinite loops)

Implementation sketch

Log source options

  1. journalctl -- parse slog output from systemd journal. Works but requires text parsing
  2. Event ring buffer -- use the existing event system. Only captures events that are emitted, might miss raw error logs
  3. Dedicated error log -- add a separate error collector that captures all error/warn slog entries in a structured buffer. Most reliable

Option 3 is recommended -- add a `slog.Handler` wrapper that captures error-level entries into a ring buffer, alongside the existing text handler.

Non-goals

  • This is NOT about fixing bugs automatically -- it's about detecting and reporting them
  • This is NOT a replacement for monitoring/alerting -- it's a lightweight self-awareness feature
  • This does NOT modify oompa's behavior when errors are detected -- it only creates issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions