Skip to content

Unify retry systems into a single RetryPolicy#4

Open
thedumbtechguy wants to merge 2 commits into
mainfrom
feat/unified-retry-policy
Open

Unify retry systems into a single RetryPolicy#4
thedumbtechguy wants to merge 2 commits into
mainfrom
feat/unified-retry-policy

Conversation

@thedumbtechguy

Copy link
Copy Markdown
Contributor

Summary

ChronoForge had three independent retry systems, two backoff algorithms, and three "should we retry?" decision models:

  1. Workflow-level (uncaught errors in perform) — should_retry? hardcoded to attempt < 3, backoff from a fixed [1s,5s,30s,2m,10m] array (whose 2m/10m tail was unreachable because the < 3 check bit first), guarded by a contradictory max_attempts == 5.
  2. Step-level (durably_execute, durably_repeat) — max_attempts: param with a different algorithm (2**n capped at 32s).
  3. wait_untilretry_on: [classes] allowlist, no count cap.

This PR collapses all of them into one RetryPolicy (attempt cap + exponential-with-jitter backoff + error-class predicate). Today's behaviors become presets of one type.

What changed

  • New ChronoForge::Executor::RetryPolicyretryable?(error, attempts) and backoff_for(attempts) are the entire decision surface. Backoff is computed once at re-enqueue (never replayed, so deterministic where it matters).
  • Per-call retry_policy: on durably_execute, durably_repeat, wait_until, plus a class-level retry_policy DSL. Resolution: per-call → class default → per-site built-in.
  • Per-site built-ins: steps 3 attempts / cap 30s; workflow-level 10 attempts / cap 600s (~8.5 min tolerant window for transient infra errors on uncaught perform errors); wait_until retries nothing by default. wait_until deliberately does not inherit the class default, so a class-wide "retry everything" can't silently retry condition-evaluation bugs.
  • Deletions: RetryStrategy, the dead should_retry? (the 3-vs-5 contradiction), and the dead retry_method: reschedule arg.

Notes / decisions

  • The kwarg is retry_policy: (not retry:) because retry is a Ruby keyword — a retry: param can't be read inside a method without binding.local_variable_get(:retry).
  • Workflow-level retries fire only on uncaught errors in perform; a step exhausting its own retries stalls the workflow instead (unchanged). The 10 was chosen to ride out realistic transient blips (DB failover, deploy restart) without dragging out the deterministic-bug case — each workflow-level retry replays the whole workflow.

⚠️ Breaking changes

  • durably_execute/durably_repeat no longer accept max_attempts:; wait_until no longer accepts retry_on:. All three take retry_policy: now. Migrate max_attempts: Nretry_policy: RetryPolicy.new(max_attempts: N) and retry_on: [...]retry_policy: RetryPolicy.new(retry_on: [...]).
  • Backoff is exponential-with-jitter everywhere. A permanently-failing workflow now reaches failed after 10 attempts (was an effective 4).

Testing

  • New RetryPolicy unit tests (truth table for retryable?, backoff growth/cap/jitter bounds, presets) and integration tests (per-call override, class default, wait_until fast-fail vs opt-in retry).
  • Full suite green: 105 tests, 0 failures; standardrb clean.

A design doc is included at docs/superpowers/specs/2026-06-03-unified-retry-policy-design.md.

🤖 Generated with Claude Code

Collapse three independent retry systems — workflow-level uncaught
errors, step-level (durably_execute/durably_repeat), and wait_until
condition errors — into one RetryPolicy abstraction (attempt cap +
exponential-with-jitter backoff + error-class predicate).

- Add ChronoForge::Executor::RetryPolicy with per-site presets
  (step_default 3/cap30, workflow_default 10/cap600, wait_default
  retry-nothing).
- Add class-level `retry_policy` DSL and a per-call `retry_policy:`
  kwarg on durably_execute, durably_repeat, and wait_until. Resolution:
  per-call -> class default -> per-site built-in. wait_until does not
  inherit the class default so a class-wide "retry everything" can't
  silently retry condition-evaluation bugs.
- Delete RetryStrategy, the dead `should_retry?` (the 3-vs-5
  contradiction), and the dead `retry_method:` reschedule argument.

BREAKING: durably_execute/durably_repeat drop `max_attempts:` and
wait_until drops `retry_on:`; all three now take `retry_policy:`.
Backoff is exponential-with-jitter everywhere. Workflow-level uncaught
errors now retry up to 10 times (~8.5 min) before failing (was an
effective 4); step failures still stall rather than retry at the
workflow level.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
…l with jitter)

Equal jitter puts each wait in [d/2, d], so the cumulative window's
expected value is ~4.25 min; ~8.5 min is the undisturbed upper bound.
Wording-only; no behavior change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant