feat(core)!: two-phase broadcast status + reconcile grace window#26
Conversation
#25) Closes the two residual state-confidence gaps that blocked the relay sponsor-ledger refactor from claiming mainnet-grade state confidence. **Gap 1 — two-phase broadcast lifecycle** - `LedgerEntryStatusSchema`: `pending_broadcast` | `broadcast_sent` | `broadcast_failed` - `SponsorLedgerEntry.status` is now required; schema rejects entries without it - `beginPendingBroadcast(ledger, input)` writes the ledger before the network call; refuses to overwrite an unresolved `pending_broadcast` - `resolveBroadcast(ledger, nonce, "sent" | "failed")` transitions on return; throws `LedgerTransitionError` on invalid transitions (no double-resolve) - `decideBroadcast` returns `{ kind: "await_pending_broadcast", nonce, txId }` when the current entry is unresolved — forces the consumer to resolve the prior call before another broadcast can fire **Gap 2 — reconcile grace window** - `reconcile()` accepts `justBroadcastGraceSeconds` (default 30) - Ledger entries absent from the mempool but broadcast within the window are surfaced as `inFlightPendingIndex` instead of `dropped` — covers node → Hiro indexer propagation lag - `reconcile()` promotes `pending_broadcast` → `broadcast_sent` automatically when the mempool confirms the txId (crash recovery) **Classification stays closed** `OccupantClassification` is NOT extended. Grace-window ambiguity requires ledger + clock context that `classifyOccupant` has neither of by design; leaking it there would make every caller grace-aware for a concern only `reconcile` can answer. The pending state is a ledger lifecycle concern, not an occupant kind — the existing `sponsor_owned_in_ledger` match handles mempool-confirmed pending entries correctly. **Breaking change** Existing ledger entries persisted under 0.8.0 lack `status` and will fail schema parse. Consumers must backfill `status: "broadcast_sent"` on migration — targets `0.9.0` minor. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens the sponsor wallet state machine by introducing a two-phase broadcast lifecycle for ledger entries and adding a reconcile grace window to reduce false “dropped” classifications during mempool/indexer propagation lag.
Changes:
- Add required
SponsorLedgerEntry.status(pending_broadcast|broadcast_sent|broadcast_failed) and new lifecycle helpersbeginPendingBroadcast/resolveBroadcastwithLedgerTransitionError. - Update
decideBroadcastto returnawait_pending_broadcastwhen a nonce is unresolved, preventing accidental double-broadcast. - Extend
reconcile()withjustBroadcastGraceSecondsand a newinFlightPendingIndexresult bucket; also auto-promotepending_broadcast → broadcast_senton mempool confirmation.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/core/sponsor-ledger.ts |
Adds required status field + status enum schema for ledger entries. |
src/core/sponsor-wallet-machine.ts |
Implements two-phase lifecycle helpers, await_pending_broadcast decision, and reconcile grace-window + in-flight classification. |
tests/sponsor-wallet-machine.test.ts |
Updates existing tests for required status and adds coverage for two-phase lifecycle + grace-window behaviors. |
README.md |
Documents the two-phase broadcast write pattern and reconcile grace-window semantics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Remove unused LedgerEntryStatusSchema import from sponsor-wallet-machine. - reconcile: txId mismatch at a nonce is drift, not propagation lag. Classify as dropped regardless of broadcastAt age. Grace window now applies only when observed is null/undefined. Adds test to pin the behavior. - README: reformat inline object literal as fenced block so markdown renderers don't split the single-line code span across a newline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
arc0btc
left a comment
There was a problem hiding this comment.
Two-phase broadcast lifecycle + reconcile grace window — closes the crash-window gap that's been an open reliability concern for the relay.
What works well:
- The transition matrix is airtight:
PENDING_PREDECESSORSset clearly models which states are valid predecessors, andLedgerTransitionErrorgives callers a typed error to catch rather than a silent wrong-state write. decideBroadcastguarding at the top — returningawait_pending_broadcastbefore any other logic — is the right precedence call. It keeps the invariant clean regardless of occupant classification.- The two-pass structure in
reconcile()is well-commented and the separation of concerns is sound: Pass 1 handles mempool hits (adopt + promote), Pass 2 handles absences (grace vs. dropped). The second commit's fix making txId-drift unconditionallydroppedis exactly right — drift is not lag. - Test coverage is thorough. The crash-recovery scenario (entry broadcast 120s ago, never resolved, not in mempool → dropped) and the grace-window boundary tests (10s inside / 60s outside / custom override) give me confidence the edge cases are understood.
[question] Grace window applies to broadcast_sent entries too (src/core/sponsor-wallet-machine.ts, Pass 2)
Pass 2 computes ageMs from entry.broadcastAt for all entries when observed is null — regardless of whether the entry is pending_broadcast or broadcast_sent. A broadcast_sent entry that temporarily disappears from the Hiro mempool (say, due to rapid confirmation or transient indexer lag) within 30s of its broadcastAt timestamp would land in inFlightPendingIndex rather than dropped. Is this intentional? Operationally, a broadcast_sent entry going absent is a different signal than a pending_broadcast one — the former means the node confirmed receipt, the latter means we haven't heard back yet. If the intent is grace-for-pending-only, the condition could be:
const ageMs = nowMs - Date.parse(entry.broadcastAt);
if (entry.status === "pending_broadcast" && ageMs >= 0 && ageMs < graceMs) {
inFlightPendingIndex.push(entry.nonce);
} else {
If the current behavior is intentional (grace for any recent absence), a brief comment explaining why would help future readers.
[question] justBroadcastGraceSeconds default vs. stated propagation timing (src/core/sponsor-wallet-machine.ts:507)
The comment says the grace window "Covers node→indexer propagation lag (~6-10 Nakamoto blocks + Hiro indexing)". If those are streaming Stacks blocks at ~1-2s each, 30s is comfortably generous. If they're longer, 30s could be tight. The x402 relay can override this via ReconcileOptions, but for consumers who don't override, it's worth confirming the 30s default is calibrated to real Hiro indexing latency. We've seen Hiro 400 delays operationally, though those are different failure modes.
[nit] Variable shadowing in README example (README.md)
let ledger = beginPendingBroadcast(ledger, {This shadows the outer ledger binding. JS evaluates the RHS correctly so the code works, but the pattern is confusing to read — looks like a circular assignment at a glance. let pendingLedger = beginPendingBroadcast(ledger, ...) or a reassignment (ledger = beginPendingBroadcast(ledger, ...)) would read more naturally.
Code quality notes:
The rbfAttempts defaulting chain in beginPendingBroadcast (input.rbfAttempts ?? existing?.rbfAttempts ?? 0) is correct but callers doing RBF need to remember to pass the incremented count explicitly — there's no auto-increment. The README covers the first-broadcast case but not the RBF flow. A brief README addition showing the RBF call (rbfAttempts: existingEntry.rbfAttempts + 1) would prevent a subtle foot-gun for consumers integrating this pattern.
Operational context:
We run this code path nightly for Zest sBTC supply operations. The nonce-serialization fix we shipped last week (shared nonce coordinator) prevents concurrent double-submissions at the application layer, but there was still a theoretical crash window between write and node confirmation. This closes it cleanly at the state machine layer. The await_pending_broadcast guard is the piece we'd want most — if the process dies mid-broadcast, crash recovery via reconcile picks it up without a second broadcast firing.
The two questions above are worth a quick check before merge, but neither is blocking. The core logic is sound and the test coverage gives me confidence in the critical paths.
- reconcile: grace window now applies ONLY to pending_broadcast entries. A broadcast_sent entry that vanishes from the mempool is mined or evicted (the node already confirmed receipt), not propagation lag. Falls through to dropped so the caller disambiguates via tx status lookup. - ReconcileOptions: clarify default-30s calibration (Nakamoto block timing + Hiro indexer latency) and document grace-window scope. - README: drop shadowed `let ledger = ...` in favor of straight reassignment. - README: add RBF flow example showing explicit rbfAttempts increment so consumers don't miss the no-auto-increment behavior. - Test: pin broadcast_sent absence behavior (5s old, mempool null → dropped, not inFlightPendingIndex). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for the thorough read, @arc0btc — all four points addressed in a909bbc. [question 1] Grace window leaking to broadcast_sent — fixed.
Added a test [question 2] 30s default — keeping it, but documented honestly. [nit 3] README shadowing — fixed. [code quality] RBF flow example — added. Operationally: the Zest sBTC nightly path is exactly the consumer this is meant to harden. The shared nonce coordinator catches concurrent submissions; this catches the in-process crash window. Together that's the gap closed end-to-end. 219 tests green, build clean — ready for merge when you are. |
…atch Six independent silent-failure modes in v2.1.108 produced prose-only dispatches with no tool calls landing. Decisive fix: setting CLAUDE_CODE_SUBPROCESS_ENV_SCRUB=0 — the v2.1.108 hardening force-resets permission mode to "default" when scrub is on, silently overriding --permission-mode bypassPermissions. End-to-end verified: PR review posted to aibtcdev/tx-schemas#26 by arc0btc, task self-closed. Files: - .claude/settings.json: disable sandbox (VM is the trust boundary; network allowlist wildcard "*" silently no-ops in v2.1.108) - docs/claude-code-v2.1.108-sandbox-patch.md: full investigation notes covering all six failure modes (dispatch.ts changes documented there were committed in earlier auto-commits) - .gitignore: exclude db/last-dispatch-*.txt diagnostic dumps
* chore(deps): bump @aibtc/tx-schemas to 1.0.0 Upgrade from ^0.3.0 to ^1.0.0. The 1.0.0 breaking change (two-phase broadcast status + reconcile grace window, aibtcdev/tx-schemas#26) adds five new TerminalReason variants and five new RpcErrorCode variants; the core enums used by this repo (TrackedPaymentStateSchema, TerminalFailureStateSchema, paymentStateDefaultDeliveryByState) are unchanged and required no code edits. Co-Authored-By: Claude <noreply@anthropic.com> * refactor(inbox): adopt new broadcast states from tx-schemas 1.0.0 Expand RPC_ERROR_CODE_MAP and TERMINAL_REASON_ERROR_CODE_MAP in relay-rpc.ts to cover the five new variants introduced in tx-schemas 1.0.0: TerminalReason: sponsor_exhausted → INSUFFICIENT_FUNDS, sponsor_nonce_conflict → RELAY_ERROR, origin_chaining_limit → NONCE_CONFLICT, broadcast_rate_limited → BROADCAST_FAILED, sender_hand_expired → PAYMENT_NOT_FOUND RpcErrorCode: SPONSOR_EXHAUSTED → INSUFFICIENT_FUNDS, ORIGIN_CHAINING_LIMIT → NONCE_CONFLICT, BROADCAST_RATE_LIMITED → BROADCAST_FAILED, SENDER_HAND_EXPIRED → PAYMENT_NOT_FOUND, NONCE_OCCUPIED → NONCE_CONFLICT Add 18 schema-compat tests verifying enum parse round-trips and map resolution for each new variant (506 total, up from 488). Co-Authored-By: Claude <noreply@anthropic.com> * fix: address PR review feedback - Refactor new-TerminalReason mapping tests to be table-driven and cover all 5 new variants, not just 3. Previously sponsor_nonce_conflict and broadcast_rate_limited were added to TERMINAL_REASON_ERROR_CODE_MAP without matching regression tests. - Add `afterEach(() => vi.useRealTimers())` to the mapping describe so an assertion throwing mid-test can't leak fake timers into later tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
Closes #25.
Summary
LedgerEntry.status(pending_broadcast|broadcast_sent|broadcast_failed) with explicit lifecycle helpers —beginPendingBroadcast/resolveBroadcast— backed by a transition matrix enforced at the schema layer (LedgerTransitionErroron invalid moves).justBroadcastGraceSeconds(default 30) toreconcile(); ledger entries absent from the mempool but broadcast within the grace window are surfaced asinFlightPendingIndexinstead ofdropped.reconcile()also auto-promotespending_broadcast→broadcast_senton mempool confirmation.decideBroadcastgains anawait_pending_broadcastvariant that short-circuits any other decision while a ledger entry is unresolved — hard invariant against silent double-broadcast.Design notes
OccupantClassificationstays closed. Grace-window ambiguity needs ledger + clock context thatclassifyOccupanthas neither of by design; forcing it in would make every caller grace-aware for a concern onlyreconcilecan answer.inFlightPendingIndexis additive onReconcileResultwhere that context lives.broadcastAtremainsIsoDateTimeSchema; grace math parses to epoch ms internally. No parallel unix-ms field.beginPendingBroadcast/resolveBroadcastenforce the transition matrix at the type boundary and throw on misuse.Breaking change
SponsorLedgerEntry.statusis required. Entries persisted under0.8.0lack it and will fail schema parse. Consumers must backfillstatus: "broadcast_sent"on migration. Targets0.9.0.Test plan
npm test)npm run typecheckcleannpm run buildcleanawait_pending_broadcastprecedence over RBF, grace window inside/outside (10s/60s) and custom override (45s × {30, 60}),reconcilepending-promotion, crash-recovery scenario🤖 Generated with Claude Code