Skip to content

refactor(rotation-artist-backfill): switch to lml_identity_id fan-out; drop Discogs-specific carving #1381

@jakebromberg

Description

@jakebromberg

Problem

jobs/rotation-artist-backfill/ (landed in PR #1376, closes BS#1361) is structurally Discogs-specific. It:

  1. Reads rotation.discogs_release_id from the rotation table.
  2. Calls GET /api/v1/discogs/release/{id} per row.
  3. Walks release.artists[*].artist_id (Discogs artist IDs).
  4. Calls GET /api/v1/discogs/artist/{id} per artist.
  5. Defends against Discogs's id=0 sentinel locally via isValidArtistId.

That carving was correct given today's world (BS holds Discogs IDs; LML's only refresh-by-ID surface is Discogs-namespaced). It's wrong altitude once WXYC/library-metadata-lookup#525 lands a source-agnostic refresh endpoint and WXYC/Backend-Service#1380 populates rotation.lml_identity_id.

Desired end state

The cron walks rotation.lml_identity_id instead of rotation.discogs_release_id, batches identity_ids, and calls POST /api/v1/cache/refresh-for-identities once per batch (or per concurrency slot). LML handles the source fan-out and the walk-to-artists internally. BS-side code loses:

  • The extractPhase1ArtistIds projection (LML knows the artist credits and walks to them itself).
  • The isValidArtistId sentinel defense (obsolete once this cron stops calling /discogs/artist/{id} directly — the defense lives at the caller per LML#518, and removing the caller removes the need; LML#546 closes residual caller-side gaps elsewhere in LML).
  • The getRelease / getArtistDetails wrappers (replaced by a single refreshForIdentities call).
  • The Discogs-flavored comments in lml-fetch.ts (no longer accurate at this layer).

The deploy gate (currently gates on LML's commit_sha being a descendant of 3e54907) is updated to gate on whichever commit ships LML#525 instead. The fetched_at discriminator from LML#503 — including its bulk projection — is still load-bearing under the hood, but BS doesn't gate on it directly; it gates on the existence of the new endpoint.

Where

  • jobs/rotation-artist-backfill/query.ts — SELECT changes from discogs_release_id to lml_identity_id. Skip rows where lml_identity_id IS NULL.
  • jobs/rotation-artist-backfill/lml-fetch.ts — replace fetchRelease / fetchArtist / extractPhase1ArtistIds with refreshForIdentities(ids: number[]).
  • jobs/rotation-artist-backfill/orchestrate.ts — collapse the two-tier loop into a one-tier batched call. Counters change shape (releases_* and artists_* become identities_* plus warmed_releases / warmed_artists, both derived from per-source release_outcome == "success" / artist outcome == "success" in the LML response). Cache hit/miss visibility is intentionally not surfaced as a BS-side counter — see the "Cache hit/miss visibility" note under "Sentry totals" below.
  • jobs/rotation-artist-backfill/deploy-guard.ts — update GATE_COMMIT_SHA to whatever ships LML#525.
  • jobs/rotation-artist-backfill/README.md — full rewrite of the mechanism section. Drop the "Discogs cache" narrative; describe the identity-based refresh.
  • shared/lml-client/src/index.ts — add refreshForIdentities(identityIds: number[]) wrapper.
  • tests/unit/jobs/rotation-artist-backfill/ — regenerate tests against the new shape.

Rate-limit recalibration (pre-merge gate)

The new endpoint fans out internally, so one BS-side call can spawn N Discogs calls. The BACKFILL_LML_* numbers need recalibrating before this PR can merge:

  • Today: BS at 20 req/min, each call hits Discogs ~once-and-fan-out-to-artists. Effective Discogs req/min ~80 (one release + 3-4 artists).
  • Tomorrow: BS at X req/min, each call multiplexes M identities through LML, each hitting Discogs once-per-release + N-per-release-artists.
  • Constraint: stay under LML's discogs_rate_limit=50 per replica, including all concurrent LML callers (runtime enrichment via lookupMetadata, picker resolution via getRotationTracksFromRelease, this cron). All callers share the same per-replica rate-limit budget. BACKFILL_LML_RPS should leave headroom — a reasonable starting target is ~30/min for the cron alone, leaving ~20/min for runtime callers. Verify the runtime steady-state Discogs req/min in Sentry before sizing the cron's budget.

Concretely: before merging, run a staging load test with the planned BACKFILL_LML_RPS and BACKFILL_LML_BATCH_SIZE AND a simulated runtime-caller baseline, and verify the Discogs req/min observed at LML stays under cap. Update the README runbook with the chosen numbers and the load-test methodology.

BACKFILL_LML_BATCH_SIZE is a hard constant, not a tunable. LML#525 caps POST /api/v1/cache/refresh-for-identities at 50 ids/request (400-on-overflow). The cap is derived from Discogs rate-limit × cold-cache fan-out ≤ Railway's request-timeout ceiling — recalibration only lands alongside an ingress change. BS should encode the value as const LML_REFRESH_BATCH_CAP = 50 in jobs/rotation-artist-backfill/orchestrate.ts and reference that constant rather than reading an env var. If the constant is kept as a tunable knob for one-shot operator flexibility, the startup must assert batchSize <= LML_REFRESH_BATCH_CAP and refuse to run otherwise — the runtime 400 from LML would otherwise show up only after the first batch.

Dashboard migration follow-up

The existing Sentry dashboards built on backfill.releases_* / backfill.artists_* will break the moment this cron switches shape. Filed as WXYC/Backend-Service#1402 — that issue covers the inventory + cutover of saved discover queries, alert rules, and team-shared dashboards keyed on the old attribute names. The dashboard migration must complete before this PR merges, so on-call doesn't lose visibility during the cutover window. BS#1402 is also wired as a blocked-by edge on this issue so the GitHub UI surfaces the dependency.

Constraints

  • Don't run both crons in parallel. Once this migration lands, the old discogs_release_id-based path is removed in the same PR (after BS#1380 backfill is ≥99% complete). Avoid drift between counters and dashboards.
  • Pre-merge rate-limit verification. See above — load-tested numbers required before merge.
  • Dashboard cutover. Sibling dashboard-migration issue (see above) must merge before this PR. The metric-name change is breaking for any alerts that key on the old names.
  • Coordinated deploy. BS#1380's backfill must hit ≥99% coverage on active rotation BEFORE this PR ships.

Acceptance criteria

  • LML#525 merged and deployed to production.

  • BS#1380 backfill at ≥99% coverage on active rotation rows.

  • Sibling dashboard-migration issue WXYC/Backend-Service#1402 closed before this PR merges.

  • Graph-level blocked-by edges wired via the GitHub dependencies API (not just prose): this issue blocked-by LML#525, BS#1380, and BS#1402. Required so the GitHub UI surfaces the blocked state when the queue is walked.

  • query.ts SELECTs lml_identity_id (and skips rows where it's NULL).

  • orchestrate.ts is a one-tier loop batching identity_ids into refreshForIdentities calls.

  • Sentry totals span attribute set updated, derived from per-id CacheRefreshResultItem results in the LML response:

    • backfill.identities_scanned — total identity_ids the cron walked
    • backfill.identities_resolved — count where per-id status != "error" (any useful work happened)
    • backfill.warmed_releases — sum across batches of per-source release_outcome == "success" (cache hit + fresh fetch + tombstone all count, per LML#525's "Per-id result semantics")
    • backfill.warmed_artists — sum across batches of per-source artist outcome == "success"
    • backfill.not_found — count where per-id status == "not_found" (BS holds stale lml_identity_id references)
    • backfill.not_implemented — count where per-id status == "not_implemented" (sources LML doesn't support yet)
    • backfill.lml_error — batch-level call failures (BS side) + count where per-id status == "error"

    Cache hit/miss visibility: the original AC had per-source warmed_hits / warmed_misses buckets. Dropped — LML#525's BulkCacheRefreshResponse (lookup/models.py:100-101 parallel) deliberately omits cache_stats from the response body, and LML's internal recorder methods (record_pg_cache_hit/miss, record_memory_cache_hit, record_api_call per discogs/fallthrough.py:311,317, discogs/memory_cache.py:257, discogs/service.py:*) are aggregate, not per-source-labeled. Hit/miss observability lives on LML-side Sentry dashboards via _project_cache_stats_to_transaction (lookup/router.py:472 parallel) — BS-side counters track warming success, not cache pressure. If future cost/SLO work needs the breakdown surfaced to BS, that's a per-source recorder refactor on the LML side, not a BS-side parsing change.

  • All numeric attributes set via Sentry.startSpan({ name, op, attributes: {...} }) at span creation time — never via late setAttribute, per BS#1081 (late setAttribute indexes numbers as strings and breaks avg() / p95() aggregation).

  • Sentry alert: backfill.not_found / backfill.identities_scanned > 1% rolling 24h ⇒ BS holds stale lml_identity_id references (the corruption signal — invisible without this counter, because today's errors bucket lumps it in with transient failures). Configured via the BS#1402 sibling.

  • Deploy gate updated to the new LML#525 anchor commit.

  • Unit tests rewritten; integration smoke test against staging LML.

  • Pre-merge load test confirms Discogs req/min at LML stays under discogs_rate_limit=50 per replica, including a simulated runtime-caller baseline.

  • Batch size encoded as const LML_REFRESH_BATCH_CAP = 50 in jobs/rotation-artist-backfill/orchestrate.ts, matching LML#525's per-request cap. If retained as an env-overridable knob (BACKFILL_LML_BATCH_SIZE), the cron asserts batchSize <= LML_REFRESH_BATCH_CAP at startup and refuses to run otherwise. README documents the derivation chain (Discogs rate-limit × cold-cache fan-out ≤ Railway request timeout) so future operators don't try to raise it without recalibrating LML's ingress.

  • README narrative section rewritten; "Known caveat" section on timeoutMs gap reviewed (the new endpoint may surface a different timeout shape).

  • PR description references the merged dashboard-migration PR (from BS#1402).

  • Old Discogs-specific helpers (isValidArtistId, extractPhase1ArtistIds, fetchRelease, fetchArtist) deleted.

Suggested approach

  1. Wait for LML#525 (which is blocked on the LML precursor) to land and deploy.
  2. Wait for BS#1380's backfill to reach ≥99% steady state.
  3. Complete BS#1402 (dashboard migration) first.
  4. Branch + worktree per global Git workflow.
  5. Update lml-client first (new endpoint wrapper). Land separately if scope is too large.
  6. Refactor cron. TDD per project CLAUDE.md.
  7. Run staging load test before opening this PR. Numbers go into the README runbook.
  8. Coordinated deploy: BS#1380's backfill must hit ≥99% coverage on active rotation BEFORE this PR merges.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    cross-cache-identityProject tag for the cross-cache-identity initiative (library hook + identity record + normalization)enhancementNew feature or requestlmlTouches library-metadata-lookup

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions