Unify retry systems into a single RetryPolicy#4
Open
thedumbtechguy wants to merge 2 commits into
Open
Conversation
Collapse three independent retry systems — workflow-level uncaught errors, step-level (durably_execute/durably_repeat), and wait_until condition errors — into one RetryPolicy abstraction (attempt cap + exponential-with-jitter backoff + error-class predicate). - Add ChronoForge::Executor::RetryPolicy with per-site presets (step_default 3/cap30, workflow_default 10/cap600, wait_default retry-nothing). - Add class-level `retry_policy` DSL and a per-call `retry_policy:` kwarg on durably_execute, durably_repeat, and wait_until. Resolution: per-call -> class default -> per-site built-in. wait_until does not inherit the class default so a class-wide "retry everything" can't silently retry condition-evaluation bugs. - Delete RetryStrategy, the dead `should_retry?` (the 3-vs-5 contradiction), and the dead `retry_method:` reschedule argument. BREAKING: durably_execute/durably_repeat drop `max_attempts:` and wait_until drops `retry_on:`; all three now take `retry_policy:`. Backoff is exponential-with-jitter everywhere. Workflow-level uncaught errors now retry up to 10 times (~8.5 min) before failing (was an effective 4); step failures still stall rather than retry at the workflow level. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
…l with jitter) Equal jitter puts each wait in [d/2, d], so the cumulative window's expected value is ~4.25 min; ~8.5 min is the undisturbed upper bound. Wording-only; no behavior change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ChronoForge had three independent retry systems, two backoff algorithms, and three "should we retry?" decision models:
perform) —should_retry?hardcoded toattempt < 3, backoff from a fixed[1s,5s,30s,2m,10m]array (whose2m/10mtail was unreachable because the< 3check bit first), guarded by a contradictorymax_attempts == 5.durably_execute,durably_repeat) —max_attempts:param with a different algorithm (2**ncapped at 32s).wait_until—retry_on: [classes]allowlist, no count cap.This PR collapses all of them into one
RetryPolicy(attempt cap + exponential-with-jitter backoff + error-class predicate). Today's behaviors become presets of one type.What changed
ChronoForge::Executor::RetryPolicy—retryable?(error, attempts)andbackoff_for(attempts)are the entire decision surface. Backoff is computed once at re-enqueue (never replayed, so deterministic where it matters).retry_policy:ondurably_execute,durably_repeat,wait_until, plus a class-levelretry_policyDSL. Resolution: per-call → class default → per-site built-in.3 attempts / cap 30s; workflow-level10 attempts / cap 600s(~8.5 min tolerant window for transient infra errors on uncaughtperformerrors);wait_untilretries nothing by default.wait_untildeliberately does not inherit the class default, so a class-wide "retry everything" can't silently retry condition-evaluation bugs.RetryStrategy, the deadshould_retry?(the 3-vs-5 contradiction), and the deadretry_method:reschedule arg.Notes / decisions
retry_policy:(notretry:) becauseretryis a Ruby keyword — aretry:param can't be read inside a method withoutbinding.local_variable_get(:retry).perform; a step exhausting its own retries stalls the workflow instead (unchanged). The10was chosen to ride out realistic transient blips (DB failover, deploy restart) without dragging out the deterministic-bug case — each workflow-level retry replays the whole workflow.durably_execute/durably_repeatno longer acceptmax_attempts:;wait_untilno longer acceptsretry_on:. All three takeretry_policy:now. Migratemax_attempts: N→retry_policy: RetryPolicy.new(max_attempts: N)andretry_on: [...]→retry_policy: RetryPolicy.new(retry_on: [...]).failedafter 10 attempts (was an effective 4).Testing
RetryPolicyunit tests (truth table forretryable?, backoff growth/cap/jitter bounds, presets) and integration tests (per-call override, class default,wait_untilfast-fail vs opt-in retry).standardrbclean.A design doc is included at
docs/superpowers/specs/2026-06-03-unified-retry-policy-design.md.🤖 Generated with Claude Code