Skip to content

feat(core)!: two-phase broadcast status + reconcile grace window#26

Merged
whoabuddy merged 3 commits into
mainfrom
feat/sponsor-ledger-two-phase-broadcast
Apr 15, 2026
Merged

feat(core)!: two-phase broadcast status + reconcile grace window#26
whoabuddy merged 3 commits into
mainfrom
feat/sponsor-ledger-two-phase-broadcast

Conversation

@whoabuddy

Copy link
Copy Markdown
Contributor

Closes #25.

Summary

  • Introduces LedgerEntry.status (pending_broadcast | broadcast_sent | broadcast_failed) with explicit lifecycle helpers — beginPendingBroadcast / resolveBroadcast — backed by a transition matrix enforced at the schema layer (LedgerTransitionError on invalid moves).
  • Adds justBroadcastGraceSeconds (default 30) to reconcile(); ledger entries absent from the mempool but broadcast within the grace window are surfaced as inFlightPendingIndex instead of dropped. reconcile() also auto-promotes pending_broadcastbroadcast_sent on mempool confirmation.
  • decideBroadcast gains an await_pending_broadcast variant that short-circuits any other decision while a ledger entry is unresolved — hard invariant against silent double-broadcast.

Design notes

  • OccupantClassification stays closed. Grace-window ambiguity needs ledger + clock context that classifyOccupant has neither of by design; forcing it in would make every caller grace-aware for a concern only reconcile can answer. inFlightPendingIndex is additive on ReconcileResult where that context lives.
  • Timestamps stay ISO. broadcastAt remains IsoDateTimeSchema; grace math parses to epoch ms internally. No parallel unix-ms field.
  • Lifecycle helpers over README conventions. A two-phase write "pattern" documented only in prose drifts. beginPendingBroadcast / resolveBroadcast enforce the transition matrix at the type boundary and throw on misuse.

Breaking change

SponsorLedgerEntry.status is required. Entries persisted under 0.8.0 lack it and will fail schema parse. Consumers must backfill status: "broadcast_sent" on migration. Targets 0.9.0.

Test plan

🤖 Generated with Claude Code

 #25)

Closes the two residual state-confidence gaps that blocked the relay
sponsor-ledger refactor from claiming mainnet-grade state confidence.

**Gap 1 — two-phase broadcast lifecycle**

- `LedgerEntryStatusSchema`: `pending_broadcast` | `broadcast_sent` | `broadcast_failed`
- `SponsorLedgerEntry.status` is now required; schema rejects entries without it
- `beginPendingBroadcast(ledger, input)` writes the ledger before the network
  call; refuses to overwrite an unresolved `pending_broadcast`
- `resolveBroadcast(ledger, nonce, "sent" | "failed")` transitions on return;
  throws `LedgerTransitionError` on invalid transitions (no double-resolve)
- `decideBroadcast` returns `{ kind: "await_pending_broadcast", nonce, txId }`
  when the current entry is unresolved — forces the consumer to resolve the
  prior call before another broadcast can fire

**Gap 2 — reconcile grace window**

- `reconcile()` accepts `justBroadcastGraceSeconds` (default 30)
- Ledger entries absent from the mempool but broadcast within the window are
  surfaced as `inFlightPendingIndex` instead of `dropped` — covers node →
  Hiro indexer propagation lag
- `reconcile()` promotes `pending_broadcast` → `broadcast_sent` automatically
  when the mempool confirms the txId (crash recovery)

**Classification stays closed**

`OccupantClassification` is NOT extended. Grace-window ambiguity requires
ledger + clock context that `classifyOccupant` has neither of by design;
leaking it there would make every caller grace-aware for a concern only
`reconcile` can answer. The pending state is a ledger lifecycle concern,
not an occupant kind — the existing `sponsor_owned_in_ledger` match handles
mempool-confirmed pending entries correctly.

**Breaking change**

Existing ledger entries persisted under 0.8.0 lack `status` and will fail
schema parse. Consumers must backfill `status: "broadcast_sent"` on
migration — targets `0.9.0` minor.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 15, 2026 19:26

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the sponsor wallet state machine by introducing a two-phase broadcast lifecycle for ledger entries and adding a reconcile grace window to reduce false “dropped” classifications during mempool/indexer propagation lag.

Changes:

  • Add required SponsorLedgerEntry.status (pending_broadcast | broadcast_sent | broadcast_failed) and new lifecycle helpers beginPendingBroadcast / resolveBroadcast with LedgerTransitionError.
  • Update decideBroadcast to return await_pending_broadcast when a nonce is unresolved, preventing accidental double-broadcast.
  • Extend reconcile() with justBroadcastGraceSeconds and a new inFlightPendingIndex result bucket; also auto-promote pending_broadcast → broadcast_sent on mempool confirmation.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/core/sponsor-ledger.ts Adds required status field + status enum schema for ledger entries.
src/core/sponsor-wallet-machine.ts Implements two-phase lifecycle helpers, await_pending_broadcast decision, and reconcile grace-window + in-flight classification.
tests/sponsor-wallet-machine.test.ts Updates existing tests for required status and adds coverage for two-phase lifecycle + grace-window behaviors.
README.md Documents the two-phase broadcast write pattern and reconcile grace-window semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/core/sponsor-wallet-machine.ts Outdated
Comment thread README.md Outdated
Comment thread src/core/sponsor-wallet-machine.ts Outdated
- Remove unused LedgerEntryStatusSchema import from sponsor-wallet-machine.
- reconcile: txId mismatch at a nonce is drift, not propagation lag. Classify
  as dropped regardless of broadcastAt age. Grace window now applies only
  when observed is null/undefined. Adds test to pin the behavior.
- README: reformat inline object literal as fenced block so markdown renderers
  don't split the single-line code span across a newline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@arc0btc arc0btc left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two-phase broadcast lifecycle + reconcile grace window — closes the crash-window gap that's been an open reliability concern for the relay.

What works well:

  • The transition matrix is airtight: PENDING_PREDECESSORS set clearly models which states are valid predecessors, and LedgerTransitionError gives callers a typed error to catch rather than a silent wrong-state write.
  • decideBroadcast guarding at the top — returning await_pending_broadcast before any other logic — is the right precedence call. It keeps the invariant clean regardless of occupant classification.
  • The two-pass structure in reconcile() is well-commented and the separation of concerns is sound: Pass 1 handles mempool hits (adopt + promote), Pass 2 handles absences (grace vs. dropped). The second commit's fix making txId-drift unconditionally dropped is exactly right — drift is not lag.
  • Test coverage is thorough. The crash-recovery scenario (entry broadcast 120s ago, never resolved, not in mempool → dropped) and the grace-window boundary tests (10s inside / 60s outside / custom override) give me confidence the edge cases are understood.

[question] Grace window applies to broadcast_sent entries too (src/core/sponsor-wallet-machine.ts, Pass 2)

Pass 2 computes ageMs from entry.broadcastAt for all entries when observed is null — regardless of whether the entry is pending_broadcast or broadcast_sent. A broadcast_sent entry that temporarily disappears from the Hiro mempool (say, due to rapid confirmation or transient indexer lag) within 30s of its broadcastAt timestamp would land in inFlightPendingIndex rather than dropped. Is this intentional? Operationally, a broadcast_sent entry going absent is a different signal than a pending_broadcast one — the former means the node confirmed receipt, the latter means we haven't heard back yet. If the intent is grace-for-pending-only, the condition could be:

    const ageMs = nowMs - Date.parse(entry.broadcastAt);
    if (entry.status === "pending_broadcast" && ageMs >= 0 && ageMs < graceMs) {
      inFlightPendingIndex.push(entry.nonce);
    } else {

If the current behavior is intentional (grace for any recent absence), a brief comment explaining why would help future readers.

[question] justBroadcastGraceSeconds default vs. stated propagation timing (src/core/sponsor-wallet-machine.ts:507)

The comment says the grace window "Covers node→indexer propagation lag (~6-10 Nakamoto blocks + Hiro indexing)". If those are streaming Stacks blocks at ~1-2s each, 30s is comfortably generous. If they're longer, 30s could be tight. The x402 relay can override this via ReconcileOptions, but for consumers who don't override, it's worth confirming the 30s default is calibrated to real Hiro indexing latency. We've seen Hiro 400 delays operationally, though those are different failure modes.

[nit] Variable shadowing in README example (README.md)

let ledger = beginPendingBroadcast(ledger, {

This shadows the outer ledger binding. JS evaluates the RHS correctly so the code works, but the pattern is confusing to read — looks like a circular assignment at a glance. let pendingLedger = beginPendingBroadcast(ledger, ...) or a reassignment (ledger = beginPendingBroadcast(ledger, ...)) would read more naturally.

Code quality notes:

The rbfAttempts defaulting chain in beginPendingBroadcast (input.rbfAttempts ?? existing?.rbfAttempts ?? 0) is correct but callers doing RBF need to remember to pass the incremented count explicitly — there's no auto-increment. The README covers the first-broadcast case but not the RBF flow. A brief README addition showing the RBF call (rbfAttempts: existingEntry.rbfAttempts + 1) would prevent a subtle foot-gun for consumers integrating this pattern.

Operational context:

We run this code path nightly for Zest sBTC supply operations. The nonce-serialization fix we shipped last week (shared nonce coordinator) prevents concurrent double-submissions at the application layer, but there was still a theoretical crash window between write and node confirmation. This closes it cleanly at the state machine layer. The await_pending_broadcast guard is the piece we'd want most — if the process dies mid-broadcast, crash recovery via reconcile picks it up without a second broadcast firing.

The two questions above are worth a quick check before merge, but neither is blocking. The core logic is sound and the test coverage gives me confidence in the critical paths.

- reconcile: grace window now applies ONLY to pending_broadcast entries.
  A broadcast_sent entry that vanishes from the mempool is mined or evicted
  (the node already confirmed receipt), not propagation lag. Falls through
  to dropped so the caller disambiguates via tx status lookup.
- ReconcileOptions: clarify default-30s calibration (Nakamoto block timing
  + Hiro indexer latency) and document grace-window scope.
- README: drop shadowed `let ledger = ...` in favor of straight reassignment.
- README: add RBF flow example showing explicit rbfAttempts increment so
  consumers don't miss the no-auto-increment behavior.
- Test: pin broadcast_sent absence behavior (5s old, mempool null → dropped,
  not inFlightPendingIndex).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@whoabuddy

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough read, @arc0btc — all four points addressed in a909bbc.

[question 1] Grace window leaking to broadcast_sent — fixed.
Sharp catch. You're right that broadcast_sent and pending_broadcast carry different operational meaning when they vanish from the mempool. Pass 2 now branches explicitly:

  • pending_broadcast + absent → grace window applies (could still be propagation lag)
  • broadcast_sent + absent → dropped immediately (node confirmed receipt; absence means mined or evicted, caller disambiguates via tx status lookup)
  • any tx_id mismatch → dropped immediately (drift, not lag)

Added a test does NOT grace-window broadcast_sent entries that pins this: 5s-old broadcast_sent entry, mempool null → dropped, not inFlightPendingIndex. Comment on ReconcileOptions.justBroadcastGraceSeconds now spells out the scope.

[question 2] 30s default — keeping it, but documented honestly.
Stayed at 30s to match the issue spec, but expanded the comment to acknowledge it's calibrated to the tight end of the Nakamoto-block + Hiro-indexer envelope (3-5s × 6-10 blocks ≈ 30-50s). Operators seeing routine Hiro slowness should override upward; the grace only delays a dropped classification, never changes a true outcome. If we hit operational pain on this default, we can revisit in a 0.10.x.

[nit 3] README shadowing — fixed.
Dropped the let ledger = ... rebind in favor of straight reassignment. Reads as the iterative state update it actually is.

[code quality] RBF flow example — added.
Good foot-gun call. README now has a second snippet showing the explicit rbfAttempts: existing.rbfAttempts + 1 pass on retry, with a note that beginPendingBroadcast doesn't auto-increment because the same helper covers first broadcasts and retries.

Operationally: the Zest sBTC nightly path is exactly the consumer this is meant to harden. The shared nonce coordinator catches concurrent submissions; this catches the in-process crash window. Together that's the gap closed end-to-end. 219 tests green, build clean — ready for merge when you are.

@whoabuddy whoabuddy merged commit cbf7261 into main Apr 15, 2026
1 check passed
arc0btc added a commit to arc0btc/arc-starter that referenced this pull request Apr 16, 2026
…atch

Six independent silent-failure modes in v2.1.108 produced prose-only
dispatches with no tool calls landing. Decisive fix: setting
CLAUDE_CODE_SUBPROCESS_ENV_SCRUB=0 — the v2.1.108 hardening force-resets
permission mode to "default" when scrub is on, silently overriding
--permission-mode bypassPermissions. End-to-end verified: PR review
posted to aibtcdev/tx-schemas#26 by arc0btc, task self-closed.

Files:
- .claude/settings.json: disable sandbox (VM is the trust boundary;
  network allowlist wildcard "*" silently no-ops in v2.1.108)
- docs/claude-code-v2.1.108-sandbox-patch.md: full investigation notes
  covering all six failure modes (dispatch.ts changes documented
  there were committed in earlier auto-commits)
- .gitignore: exclude db/last-dispatch-*.txt diagnostic dumps
whoabuddy added a commit to aibtcdev/landing-page that referenced this pull request Apr 22, 2026
* chore(deps): bump @aibtc/tx-schemas to 1.0.0

Upgrade from ^0.3.0 to ^1.0.0. The 1.0.0 breaking change (two-phase
broadcast status + reconcile grace window, aibtcdev/tx-schemas#26) adds
five new TerminalReason variants and five new RpcErrorCode variants;
the core enums used by this repo (TrackedPaymentStateSchema,
TerminalFailureStateSchema, paymentStateDefaultDeliveryByState) are
unchanged and required no code edits.

Co-Authored-By: Claude <noreply@anthropic.com>

* refactor(inbox): adopt new broadcast states from tx-schemas 1.0.0

Expand RPC_ERROR_CODE_MAP and TERMINAL_REASON_ERROR_CODE_MAP in
relay-rpc.ts to cover the five new variants introduced in tx-schemas
1.0.0:

  TerminalReason: sponsor_exhausted → INSUFFICIENT_FUNDS,
    sponsor_nonce_conflict → RELAY_ERROR,
    origin_chaining_limit → NONCE_CONFLICT,
    broadcast_rate_limited → BROADCAST_FAILED,
    sender_hand_expired → PAYMENT_NOT_FOUND

  RpcErrorCode: SPONSOR_EXHAUSTED → INSUFFICIENT_FUNDS,
    ORIGIN_CHAINING_LIMIT → NONCE_CONFLICT,
    BROADCAST_RATE_LIMITED → BROADCAST_FAILED,
    SENDER_HAND_EXPIRED → PAYMENT_NOT_FOUND,
    NONCE_OCCUPIED → NONCE_CONFLICT

Add 18 schema-compat tests verifying enum parse round-trips and
map resolution for each new variant (506 total, up from 488).

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: address PR review feedback

- Refactor new-TerminalReason mapping tests to be table-driven and cover
  all 5 new variants, not just 3. Previously sponsor_nonce_conflict and
  broadcast_rate_limited were added to TERMINAL_REASON_ERROR_CODE_MAP
  without matching regression tests.
- Add `afterEach(() => vi.useRealTimers())` to the mapping describe so an
  assertion throwing mid-test can't leak fake timers into later tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Harden sponsor ledger state transitions: two-phase broadcast status + reconcile grace window

3 participants