Skip to content

fix(enrichment-worker): drain in-flight candidates on SIGTERM#1408

Open
jakebromberg wants to merge 2 commits into
mainfrom
fix/1108-enrichment-drain
Open

fix(enrichment-worker): drain in-flight candidates on SIGTERM#1408
jakebromberg wants to merge 2 commits into
mainfrom
fix/1108-enrichment-drain

Conversation

@jakebromberg

Copy link
Copy Markdown
Member

Closes #1108

Summary

  • The CDC dispatcher invokes handleCandidate fire-and-forget (void handleCandidate(...)), so SIGTERM tore down the PG pool while LML lookups were still pending — the subsequent claim/finalize write would throw on a torn-down connection, leaving the row stranded in metadata_status='enriching' until C6 swept it.
  • Adds an inFlightCandidates registry in handler.ts (mirrors the backend's inFlightEnrichments in apps/backend/services/metadata/enrichment.service.ts): each tick auto-registers and unregisters via .finally so the registry never slow-leaks past a settle.
  • Adds drainInFlightCandidates(deadlineMs) that races Promise.allSettled(snapshot) against setTimeout(deadlineMs) and returns the current registry size — never throws, returns 0 immediately on empty.
  • Wires the drain into worker.ts shutdown between stopCdcListener and closeDatabaseConnection. Bounded by WORKER_DRAIN_DEADLINE_MS (default 30s, env-overridable via ENRICHMENT_WORKER_DRAIN_DEADLINE_MS) so a hung LML call can't block deploy indefinitely.
  • Sentry captureMessage on non-zero remaining count uses the same metric: 'in_flight_dropped' tag shape as the backend's BS#905 path so a single alert can fan over both surfaces.

Reduces but does not eliminate the BS#895 sweep traffic referenced in the issue body.

Test plan

  • new in-flight registry tests in tests/unit/apps/enrichment-worker/drain.test.ts pin: empty-registry fast-return, register-on-dispatch, unregister-on-resolve, unregister-on-reject, skip-on-filter-reject, drain-awaits-pending, drain-bounds-deadline, drain-survives-rejection
  • all 84 enrichment-worker unit tests pass; full unit suite 3160/3160 green
  • npm run lint && npm run format:check && npm run typecheck clean

The CDC dispatcher invokes `handleCandidate` as `void handleCandidate(...)`
— fire-and-forget — so on SIGTERM the worker's shutdown path tore down the
PG pool while LML lookups were still pending. The subsequent claim or
finalize write would throw on a torn-down connection, leaving the row in
`metadata_status='enriching'` until C6 (#895) swept it past
`enriching_since + 60s` and triggered a second Discogs lookup whose answer
was already retrieved and discarded.

Mirrors the backend's `drainInFlightEnrichments` shape
(apps/backend/services/metadata/enrichment.service.ts):

  - `inFlightCandidates` Set in handler.ts; the CDC dispatcher registers
    each invocation and unregisters via .finally on settle (resolve OR
    reject) so the registry can never slow-leak past a tick.
  - `drainInFlightCandidates(deadlineMs)` races
    `Promise.allSettled(snapshot)` against `setTimeout(deadlineMs)` and
    returns the current registry size — never throws.
  - worker.ts shutdown calls stopCdcListener → drain → closeDatabaseConnection
    so pending lookups can finalize their writes before the pool closes.
    Bounded by WORKER_DRAIN_DEADLINE_MS (default 30s, env-overridable) so a
    hung LML call can't block deploy indefinitely.
  - Sentry captureMessage on non-zero remaining count uses the same
    `metric: 'in_flight_dropped'` tag shape as the backend so a single
    alert can fan over both surfaces.

Reduces but does not eliminate the BS#895 sweep traffic referenced in the
issue body.
…e, unhandled rejection guard, env clamp

- Drop default WORKER_DRAIN_DEADLINE_MS from 30s to 2s and clamp env overrides at the per-step shutdown bound (5s). The deploy action runs `docker stop` without `-t`, so the SIGTERM-to-SIGKILL grace is the Docker default of 10s; with `stop_cdc` and `drain_sweep` already consuming part of that window, anything above ~5s is dead budget. Mirrors apps/backend's `ENRICHMENT_DRAIN_DEADLINE_MS=2_000` rationale (BS#905).
- Chain `.catch(() => {})` before `.finally` on the dispatcher's in-flight promise. handleCandidate wraps its body in try/catch but the outer `await Sentry.startSpan(...)` is unguarded; a Sentry-internal throw (exporter error, disposed hub during shutdown) would otherwise surface as an `unhandledRejection`. Mirrors backend's pattern in `apps/backend/services/metadata/enrichment.service.ts`.
- Add `jest.resetAllMocks()` to drain.test.ts beforeEach so a future test using sticky mocks can't silently inherit a previous case's stub.
- New regression test pins that a Sentry.startSpan throw is swallowed and the registry still drains.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

enrichment-worker SIGTERM tears down DB before in-flight LML writes

1 participant