Unify retry systems into a single RetryPolicy by thedumbtechguy · Pull Request #4 · radioactive-labs/chrono_forge

thedumbtechguy · 2026-06-03T12:28:17Z

Summary

ChronoForge had three independent retry systems, two backoff algorithms, and three "should we retry?" decision models:

Workflow-level (uncaught errors in perform) — should_retry? hardcoded to attempt < 3, backoff from a fixed [1s,5s,30s,2m,10m] array (whose 2m/10m tail was unreachable because the < 3 check bit first), guarded by a contradictory max_attempts == 5.
Step-level (durably_execute, durably_repeat) — max_attempts: param with a different algorithm (2**n capped at 32s).
wait_until — retry_on: [classes] allowlist, no count cap.

This PR collapses all of them into one RetryPolicy (attempt cap + exponential-with-jitter backoff + error-class predicate). Today's behaviors become presets of one type.

What changed

New ChronoForge::Executor::RetryPolicy — retryable?(error, attempts) and backoff_for(attempts) are the entire decision surface. Backoff is computed once at re-enqueue (never replayed, so deterministic where it matters).
Per-call retry_policy: on durably_execute, durably_repeat, wait_until, plus a class-level retry_policy DSL. Resolution: per-call → class default → per-site built-in.
Per-site built-ins: steps 3 attempts / cap 30s; workflow-level 10 attempts / cap 600s (~8.5 min tolerant window for transient infra errors on uncaught perform errors); wait_until retries nothing by default. wait_until deliberately does not inherit the class default, so a class-wide "retry everything" can't silently retry condition-evaluation bugs.
Deletions: RetryStrategy, the dead should_retry? (the 3-vs-5 contradiction), and the dead retry_method: reschedule arg.

Notes / decisions

The kwarg is retry_policy: (not retry:) because retry is a Ruby keyword — a retry: param can't be read inside a method without binding.local_variable_get(:retry).
Workflow-level retries fire only on uncaught errors in perform; a step exhausting its own retries stalls the workflow instead (unchanged). The 10 was chosen to ride out realistic transient blips (DB failover, deploy restart) without dragging out the deterministic-bug case — each workflow-level retry replays the whole workflow.

⚠️ Breaking changes

durably_execute/durably_repeat no longer accept max_attempts:; wait_until no longer accepts retry_on:. All three take retry_policy: now. Migrate max_attempts: N → retry_policy: RetryPolicy.new(max_attempts: N) and retry_on: [...] → retry_policy: RetryPolicy.new(retry_on: [...]).
Backoff is exponential-with-jitter everywhere. A permanently-failing workflow now reaches failed after 10 attempts (was an effective 4).

Testing

New RetryPolicy unit tests (truth table for retryable?, backoff growth/cap/jitter bounds, presets) and integration tests (per-call override, class default, wait_until fast-fail vs opt-in retry).
Full suite green: 105 tests, 0 failures; standardrb clean.

A design doc is included at docs/superpowers/specs/2026-06-03-unified-retry-policy-design.md.

🤖 Generated with Claude Code

Collapse three independent retry systems — workflow-level uncaught errors, step-level (durably_execute/durably_repeat), and wait_until condition errors — into one RetryPolicy abstraction (attempt cap + exponential-with-jitter backoff + error-class predicate). - Add ChronoForge::Executor::RetryPolicy with per-site presets (step_default 3/cap30, workflow_default 10/cap600, wait_default retry-nothing). - Add class-level `retry_policy` DSL and a per-call `retry_policy:` kwarg on durably_execute, durably_repeat, and wait_until. Resolution: per-call -> class default -> per-site built-in. wait_until does not inherit the class default so a class-wide "retry everything" can't silently retry condition-evaluation bugs. - Delete RetryStrategy, the dead `should_retry?` (the 3-vs-5 contradiction), and the dead `retry_method:` reschedule argument. BREAKING: durably_execute/durably_repeat drop `max_attempts:` and wait_until drops `retry_on:`; all three now take `retry_policy:`. Backoff is exponential-with-jitter everywhere. Workflow-level uncaught errors now retry up to 10 times (~8.5 min) before failing (was an effective 4); step failures still stall rather than retry at the workflow level. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

…l with jitter) Equal jitter puts each wait in [d/2, d], so the cumulative window's expected value is ~4.25 min; ~8.5 min is the undisturbed upper bound. Wording-only; no behavior change.

thedumbtechguy added 2 commits June 3, 2026 12:27

docs(retry): clarify workflow window is up to ~8.5 min (≈4 min typica…

f71eb41

…l with jitter) Equal jitter puts each wait in [d/2, d], so the cumulative window's expected value is ~4.25 min; ~8.5 min is the undisturbed upper bound. Wording-only; no behavior change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify retry systems into a single RetryPolicy#4

Unify retry systems into a single RetryPolicy#4
thedumbtechguy wants to merge 2 commits into
mainfrom
feat/unified-retry-policy

thedumbtechguy commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thedumbtechguy commented Jun 3, 2026

Summary

What changed

Notes / decisions

⚠️ Breaking changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant