Skip to content

fix(cdc-listener): clear outstandingProbeToken on timeout to unwedge probe#1404

Open
jakebromberg wants to merge 1 commit into
mainfrom
fix/1116-cdc-probe-wedge
Open

fix(cdc-listener): clear outstandingProbeToken on timeout to unwedge probe#1404
jakebromberg wants to merge 1 commit into
mainfrom
fix/1116-cdc-probe-wedge

Conversation

@jakebromberg

Copy link
Copy Markdown
Member

Closes #1116

Summary

  • runProbeTick's echo-timeout branch dispatched connected=false but left outstandingProbeToken set, so every subsequent interval took the early-return and no new probe ever fired. The liveness signal was dark until process restart — even after postgres-js auto-reconnect re-LISTENed.
  • Fix: clear outstandingProbeToken + outstandingProbeAt before dispatching false on the timeout path, so the next interval re-arms with a fresh NOTIFY. Stop path already resets at lines 215-216; no change needed there.
  • Side-effect: a late-arriving echo (after we've already given up and dispatched false) is now intentionally a no-op. onlisten on auto-reconnect — which already dispatches connected=true — is the proper recovery signal.

Test plan

  • new wedge-recovery test passes (wedge -> timeout -> reconnect -> fresh NOTIFY -> second wedge also dispatches false)
  • full unit suite green (228 suites, 3152 tests)
  • lint/format/typecheck clean

…probe (#1116)

The echo-timeout branch in `runProbeTick` dispatched `connected=false`
but left `outstandingProbeToken` set. The early-return at the top of
the next tick then short-circuits every subsequent interval — probing
is permanently dead until process restart, even after postgres-js
auto-reconnect re-LISTENs and the `onlisten` hook re-dispatches true.

Clear the token + timestamp before dispatching false so the next
interval re-arms with a fresh NOTIFY. Stop path already resets at
lines 215-216, so no change needed there.

Test coverage:
- New regression test: wedge -> timeout -> simulated reconnect -> a
  fresh NOTIFY goes out -> second wedge also dispatches false.
- Restructured the existing "echo arrives" test to flip false via a
  one-shot NOTIFY rejection rather than the timeout path (late-arriving
  echoes after timeout are intentionally a no-op now; `onlisten` on
  auto-reconnect is the recovery signal, per the issue's note).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CDC liveness probe wedges permanently after first echo timeout

1 participant