fix(enrichment-worker): drain in-flight candidates on SIGTERM#1408
Open
jakebromberg wants to merge 2 commits into
Open
fix(enrichment-worker): drain in-flight candidates on SIGTERM#1408jakebromberg wants to merge 2 commits into
jakebromberg wants to merge 2 commits into
Conversation
The CDC dispatcher invokes `handleCandidate` as `void handleCandidate(...)` — fire-and-forget — so on SIGTERM the worker's shutdown path tore down the PG pool while LML lookups were still pending. The subsequent claim or finalize write would throw on a torn-down connection, leaving the row in `metadata_status='enriching'` until C6 (#895) swept it past `enriching_since + 60s` and triggered a second Discogs lookup whose answer was already retrieved and discarded. Mirrors the backend's `drainInFlightEnrichments` shape (apps/backend/services/metadata/enrichment.service.ts): - `inFlightCandidates` Set in handler.ts; the CDC dispatcher registers each invocation and unregisters via .finally on settle (resolve OR reject) so the registry can never slow-leak past a tick. - `drainInFlightCandidates(deadlineMs)` races `Promise.allSettled(snapshot)` against `setTimeout(deadlineMs)` and returns the current registry size — never throws. - worker.ts shutdown calls stopCdcListener → drain → closeDatabaseConnection so pending lookups can finalize their writes before the pool closes. Bounded by WORKER_DRAIN_DEADLINE_MS (default 30s, env-overridable) so a hung LML call can't block deploy indefinitely. - Sentry captureMessage on non-zero remaining count uses the same `metric: 'in_flight_dropped'` tag shape as the backend so a single alert can fan over both surfaces. Reduces but does not eliminate the BS#895 sweep traffic referenced in the issue body.
…e, unhandled rejection guard, env clamp
- Drop default WORKER_DRAIN_DEADLINE_MS from 30s to 2s and clamp env overrides at the per-step shutdown bound (5s). The deploy action runs `docker stop` without `-t`, so the SIGTERM-to-SIGKILL grace is the Docker default of 10s; with `stop_cdc` and `drain_sweep` already consuming part of that window, anything above ~5s is dead budget. Mirrors apps/backend's `ENRICHMENT_DRAIN_DEADLINE_MS=2_000` rationale (BS#905).
- Chain `.catch(() => {})` before `.finally` on the dispatcher's in-flight promise. handleCandidate wraps its body in try/catch but the outer `await Sentry.startSpan(...)` is unguarded; a Sentry-internal throw (exporter error, disposed hub during shutdown) would otherwise surface as an `unhandledRejection`. Mirrors backend's pattern in `apps/backend/services/metadata/enrichment.service.ts`.
- Add `jest.resetAllMocks()` to drain.test.ts beforeEach so a future test using sticky mocks can't silently inherit a previous case's stub.
- New regression test pins that a Sentry.startSpan throw is swallowed and the registry still drains.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1108
Summary
handleCandidatefire-and-forget (void handleCandidate(...)), so SIGTERM tore down the PG pool while LML lookups were still pending — the subsequent claim/finalize write would throw on a torn-down connection, leaving the row stranded inmetadata_status='enriching'until C6 swept it.inFlightCandidatesregistry inhandler.ts(mirrors the backend'sinFlightEnrichmentsinapps/backend/services/metadata/enrichment.service.ts): each tick auto-registers and unregisters via.finallyso the registry never slow-leaks past a settle.drainInFlightCandidates(deadlineMs)that racesPromise.allSettled(snapshot)againstsetTimeout(deadlineMs)and returns the current registry size — never throws, returns 0 immediately on empty.worker.tsshutdown betweenstopCdcListenerandcloseDatabaseConnection. Bounded byWORKER_DRAIN_DEADLINE_MS(default 30s, env-overridable viaENRICHMENT_WORKER_DRAIN_DEADLINE_MS) so a hung LML call can't block deploy indefinitely.captureMessageon non-zero remaining count uses the samemetric: 'in_flight_dropped'tag shape as the backend's BS#905 path so a single alert can fan over both surfaces.Reduces but does not eliminate the BS#895 sweep traffic referenced in the issue body.
Test plan
tests/unit/apps/enrichment-worker/drain.test.tspin: empty-registry fast-return, register-on-dispatch, unregister-on-resolve, unregister-on-reject, skip-on-filter-reject, drain-awaits-pending, drain-bounds-deadline, drain-survives-rejectionenrichment-workerunit tests pass; full unit suite 3160/3160 greennpm run lint && npm run format:check && npm run typecheckclean