…f (v0.3.2)
Two real-world 502 Bad Gateway events on 2026-04-30 killed daily
dogfood runs that should have succeeded. The @octokit/plugin-retry
plugin only attaches to the REST client; GraphQL had no equivalent
until now, so a single transient hiccup propagated as either a raw
HTTP error or — worse — as `Cannot read properties of undefined`
when the HTML 502 body was parsed as a JSON shell.
Fix:
- New `withRetry(fn, opts)` helper in packages/core/src/retry.ts.
60 lines, dependency-free, exports `isRetryableError` so tests can
classify error shapes deterministically.
- Retried error classes: HTTP 5xx, 408, 429; net errors ECONNRESET,
ETIMEDOUT, ENOTFOUND, EAI_AGAIN; GraphqlResponseError whose message
contains 502/503/504/bad gateway/gateway timeout/service unavailable.
- Not retried: 4xx other than 408/429 (caller bugs), plain Error with
no status/code (probably code bugs).
- Exponential backoff with jitter: 500ms × 2ⁿ + 0–100ms jitter, capped
at 4 attempts (~8.5s worst case wall time).
- Wired into ingestSnapshot (USER_QUERY + REPOS_QUERY paginate loop)
and ingestAuditExtras (REPO_EXTRAS_QUERY paginate loop +
USER_OPEN_PRS_QUERY). Per-PR timeline retries with maxAttempts: 2 only
— its existing per-PR try/catch already degrades to timeline: null,
so we keep wall-time predictable on a brief outage.
- `onRetry` hook logs attempt + delay + compact error label via the
existing pino logger so dogfood logs show the retry trail.
Tests:
- 14 tests in packages/core/test/retry.test.ts cover the classifier
(positive + negative cases per error shape) and the backoff/cap math
with a mathematically-verified exponential growth check.
No new dependencies. No new public API surface beyond `withRetry` /
`isRetryableError` exports. Zero behavior change on non-transient
errors.
Summary
Two real-world 502 Bad Gateway events on 2026-04-30 killed daily dogfood runs that should have succeeded. The `@octokit/plugin-retry` plugin only attaches to the REST client; GraphQL had no equivalent until now, so a single transient hiccup propagated either as a raw HTTP error or — worse — as `Cannot read properties of undefined` when the HTML 502 body parsed as a JSON shell.
This PR closes that gap with a tiny dependency-free `withRetry` helper, applied to every `client.graphql` call across both `ingestSnapshot` and `ingestAuditExtras`.
What changed
Retry policy
Backoff: `500ms × 2ⁿ + 0–100ms jitter`. Default 4 attempts (max ~8.5s wall time).
The per-PR timeline retry is capped at `maxAttempts: 2` because its existing per-PR try/catch already degrades to `timeline: null` — we don't want a brief outage to 5x the run time on a 50-PR account.
Test plan
Reviewer notes