Skip to content

feat(orb): Orb relay failure queue with periodic retry#1386

Merged
JSONbored merged 1 commit into
mainfrom
feat/orb-relay-retry-queue
Jun 25, 2026
Merged

feat(orb): Orb relay failure queue with periodic retry#1386
JSONbored merged 1 commit into
mainfrom
feat/orb-relay-retry-queue

Conversation

@JSONbored

Copy link
Copy Markdown
Owner

Summary

When forwardOrbEvent fails to reach a brokered self-host container (e.g. it is temporarily unreachable or returning a non-2xx response), the delivery was previously dropped silently. This adds a retry queue so transient failures recover automatically.

What ships:

  • Migration 0070orb_relay_failures table (delivery_id PK, event_name, installation_id, raw_body, attempts, last_attempt_at, created_at, expires_at)
  • storeRelayFailure — idempotent insert (ON CONFLICT DO NOTHING); called from handleOrbWebhook when forwardOrbEvent returns "failed" and the installation_id is known
  • retryFailedRelays — periodic retry worker: deletes rows on success or skip, increments attempts on continued failure; prunes rows that have exhausted 5 attempts or exceeded a 1-hour TTL
  • retry-orb-relay job type — added to src/types.ts and src/queue/processors.ts
  • Cron wiringsrc/index.ts enqueues retry-orb-relay every sweep cycle (≈2 min) only when ORB_BROKER_ENABLED is set — the Cloudflare Worker deploy is byte-identical until the broker flag is turned on

Boundary decisions:

  • Only "failed" returns from forwardOrbEvent are stored — "skipped" events (non-forwardable type, no registered relay) are fast no-ops that don't need retrying
  • retryFailedRelays never throws; a persistently-down container burns its attempt budget and the row is pruned cleanly
  • At-most-5-retries × ≈2-min cadence = up to ~10 min of recovery window before the delivery is abandoned (events older than 1 hour are also pruned regardless of attempt count)
  • On retry, "skipped" now also deletes the row — if a container deregisters its relay URL between the initial failure and the retry, the failure is cleaned up rather than exhausting attempts

Scope

  • Self-host / Orb / broker path
  • New public API or changed API signature
  • DB schema / migration (0070_orb_relay_failures.sql)
  • Auth / session / CORS path
  • UI change
  • Docs change

Validation

  • npx vitest run test/integration/orb-relay.test.ts — 26 tests (19 existing + 7 new), all pass
  • npx vitest run test/integration/orb-webhook.test.ts — 17 tests, all pass
  • npm run test:ci — exit 0, full gate green

Safety

  • No secrets, tokens, wallets, hotkeys, trust scores, or reward values introduced
  • No auth/CORS changes; no public API surface changed
  • Self-host only; Cloudflare Worker deploy is byte-identical when ORB_BROKER_ENABLED is unset
  • storeRelayFailure stores the raw webhook body in raw_body (same data already stored in orb_webhook_events.payload_hash); no new sensitive data is persisted

When forwardOrbEvent fails to reach a brokered self-host container (e.g.
it is temporarily unreachable), the delivery is now recorded in a new
orb_relay_failures table instead of being silently dropped.

The cron fires a retry-orb-relay job every sweep cycle (≈2 min) when
ORB_BROKER_ENABLED is set. retryFailedRelays re-calls forwardOrbEvent for
each pending row, deletes it on success or on a RELAY_FORWARD_EVENTS skip,
and increments its attempt counter on continued failure. Rows are pruned
after 5 attempts or a 1-hour TTL — whichever comes first.

Boundary behaviour:
- insertions are idempotent (ON CONFLICT DO NOTHING on delivery_id)
- storeRelayFailure is a no-op if the forward was skipped (the event type
  is not forwardable) or if installation_id is not set; only "failed" from
  a registered relay is recorded
- retryFailedRelays never throws; a persistently-down container burns its
  budget (5 attempts) and the row is cleaned up, preventing unbounded growth

The Cloudflare Worker deploy is byte-identical until ORB_BROKER_ENABLED is
set (the cron job is not enqueued unless the flag is truthy).
@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Jun 25, 2026
@superagent-security

Copy link
Copy Markdown

Superagent didn't find any vulnerabilities or security issues in this PR.

@JSONbored JSONbored merged commit 50d0eea into main Jun 25, 2026
17 checks passed
@JSONbored JSONbored deleted the feat/orb-relay-retry-queue branch June 25, 2026 19:39
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.58824% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.33%. Comparing base (acd1535) to head (4c94115).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/orb/webhook.ts 50.00% 1 Missing and 1 partial ⚠️
src/queue/processors.ts 0.00% 2 Missing ⚠️
src/index.ts 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1386      +/-   ##
==========================================
- Coverage   95.36%   95.33%   -0.03%     
==========================================
  Files         191      191              
  Lines       20714    20730      +16     
  Branches     7489     7493       +4     
==========================================
+ Hits        19753    19763      +10     
- Misses        380      383       +3     
- Partials      581      584       +3     
Files with missing lines Coverage Δ
src/orb/relay.ts 100.00% <100.00%> (ø)
src/index.ts 90.69% <0.00%> (-2.16%) ⬇️
src/orb/webhook.ts 96.66% <50.00%> (-3.34%) ⬇️
src/queue/processors.ts 87.61% <0.00%> (-0.23%) ⬇️
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant