Skip to content

PII leak: flowsheet ETL writes DJ_NAME (real name) into shows.legacy_dj_name, surfaced on v2 marker wire #1393

@jakebromberg

Description

@jakebromberg

Summary

flowsheet.dj_name on show_start / show_end / dj_join / dj_leave rows is the value transformToV2 surfaces on the public /flowsheet (v2) wire. For DJs whose auth_user.dj_name is null, the marker dj_name resolver's COALESCE(auth_user.dj_name, shows.legacy_dj_name) falls through to legacy_dj_name. That column is populated by the flowsheet ETL from tubafrenzy's FLOWSHEET_RADIO_SHOW_PROD.DJ_NAME — the user's full real name, not the on-air alias — so the public v2 wire surfaces real names for any affected marker entry.

Per tubafrenzy's own schema docs (https://github.com/WXYC/tubafrenzy/blob/main/docs/database-schema.md):

Column Meaning
DJ_NAME DJ's full name
DJ_HANDLE DJ's on-air alias

And the BS legacy mirror in apps/backend/middleware/legacy/http.mirror.ts (mapShowToTubafrenzy) forwards djName: dj.realName || dj.name into tubafrenzy's DJ_NAME field on every show start.

Leak path

  1. BS legacy mirror writes djName = realName || name into tubafrenzy's FLOWSHEET_RADIO_SHOW_PROD.DJ_NAME on startShow.
  2. Flowsheet ETL pulls rs.DJ_NAME from tubafrenzy and stores it in shows.legacy_dj_name. Three writers, all wrong:
    • jobs/flowsheet-etl/fetch-legacy.ts (incremental)
    • jobs/flowsheet-etl/job.ts importShows reads tuple[2] (DJ_NAME) instead of tuple[4] (DJ_HANDLE) in the bulk-load path
    • jobs/flowsheet-etl/backfill-legacy-ids.ts (legacy-id backfill)
  3. Tubafrenzy webhook insert omits dj_name on show_start/show_end rows (118,919 affected, ongoing) #1371 added marker dj_name resolution on the webhook insert path via COALESCE(auth_user.dj_name, shows.legacy_dj_name). Marker rows previously had dj_name = NULL on the wire; now they hold the polluted legacy_dj_name for DJs whose auth_user.dj_name is null.
  4. transformToV2 surfaces entry.dj_name for show_start / show_end / dj_join / dj_leave on the public v2 wire.

auth_user.name-fallback (the leak path #1286 and #1371 explicitly closed elsewhere) is not the source here — the leak is the legacy_dj_name arm of the COALESCE itself, because the ETL pulls from the wrong source column.

Fix

Two pieces, both in this issue's PR:

  1. Source fix: ETL pulls DJ_HANDLE instead of DJ_NAME in all three writers above. Regression tests pin (a) the SELECT query string and (b) the bulk-load tuple position. LegacyShowRow.djName renamed to djHandle for clarity.
  2. Historical data cleanup: new one-shot job jobs/legacy-dj-name-remediation/ walks every show, rewrites shows.legacy_dj_name from tubafrenzy's DJ_HANDLE, and nulls + re-resolves marker-row flowsheet.dj_name whose value matches the old polluted legacy_dj_name. Idempotent, --dry-run aware, ships with its own Dockerfile (Dockerfile.legacy-dj-name-remediation).

After the fix:

  • Marker dj_name on the v2 wire reflects the on-air handle (from auth_user.dj_name first, DJ_HANDLE second).
  • DJs with neither auth_user.dj_name nor DJ_HANDLE set surface empty dj_name on the wire (asymmetric-fallback case the live path already tolerates).

Relationship to #605

#605 is an open issue from 2026-04-28 about shows.legacy_dj_name not always being populated even when tubafrenzy has a value. Its example table cites real names (Umaesh Kumaran, r.j. igou) — those were the DJ_NAME (real name) values that would have been pulled if the ETL had caught them, applied as a manual workaround.

This issue is independent of #605:

The remediation in step 2 above will scrub the 30 rows #605's manual workaround populated with DJ_NAME (real names) back to DJ_HANDLE (handles), so they too stop leaking PII. #605's coverage concern remains its own issue.

Test plan

  • New regression tests in tests/unit/jobs/flowsheet-etl/fetch-legacy.test.ts and tests/unit/jobs/flowsheet-etl/job.djName.test.ts assert the ETL never reverts to DJ_NAME / tuple[2].
  • Full unit suite (3104 tests) passes.
  • Build + typecheck + lint + Prettier all clean.
  • Remediation runbook: deploy the new job target via Manual Build & Deploy, docker run --rm --env-file .env <image> --dry-run first, then without --dry-run. Output reports (showsScanned, showsUpdated, markerRowsReset) and (reresolved).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtubafrenzyTouches WXYC/tubafrenzy (webhook, legacy mirror, flowsheet ETL, reconcile)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions