Skip to content

Path B re-run: forward-look frequency analysis after 30 days of concerts data #1373

@jakebromberg

Description

@jakebromberg

Scope

Re-run Path B of #1368's frequency analysis once the venue-events scraper has been running steadily in production for ≥30 days. Append a "Path B cross-check" section to plans/touring-events/frequency-analysis.md in wxyc-workspace.

Why a separate issue

Per #1368, Path A is the headline deliverable and runs as soon as the scraper (#1343) is deployed. Path B has to wait for two distinct things to settle:

  1. Scraper stability — ≥30 nightly runs of idempotent upserts to confirm the writer behaves as expected against drifting RHP HTML. This is not about forward-window data accumulating: RHP venues publish 60–90 days ahead, so the forward window is populated by day 1. The 30-day clock is about confidence that what we're reading is stable.
  2. Resolver coverageBackfill concerts.headlining_artist_id via local artist + alias resolver #1372 populating headlining_artist_id so the SQL JOINs cleanly on the canonical FK rather than via the brittle raw-name fallback.

Encoding these in a real ticket avoids the "we'll get to it" failure mode and lets the dependency be wired into the issue graph.

Earliest run

Today is 2026-06-08; #1345 just merged. Budget no earlier than 2026-07-10 for the scraper-stability clock, and gate on #1372 having run at least once. If either condition is unmet, do not start.

Procedure

  1. Confirm scraper has been running nightly for ≥30 days (check concerts.scraped_at distribution).
  2. Confirm Backfill concerts.headlining_artist_id via local artist + alias resolver #1372's resolver pass has run and report headlining_artist_id coverage. No hard threshold — touring headliners are typically well-known acts and may resolve well above LML's 24% catalog ceiling, or may underperform if the alias substrate hasn't caught up. Whatever it is, report it; the raw-vs-FK delta below quantifies the impact directly.
  3. Run all three of Frequency analysis: how often will the venue-events scraper surface a WXYC-relevant artist? #1368's match rules (loose / tight / tightest) against prod (read-only, per-turn auth), under both the canonical-FK join and the raw-name fallback. Six queries total. Canonical-FK shapes for loose + tight:
-- Loose: artist in library, canonical FK
SELECT date_trunc('day', c.starts_at)::date AS day, COUNT(DISTINCT c.id)
FROM wxyc_schema.concerts c
WHERE c.headlining_artist_id IN (SELECT DISTINCT artist_id FROM wxyc_schema.library)
  AND c.starts_at BETWEEN now() AND now() + interval '90 days'
  AND c.status <> 'cancelled'
GROUP BY 1 ORDER BY 1;

-- Tight: played in trailing 30d, canonical FK
SELECT date_trunc('day', c.starts_at)::date AS day, COUNT(DISTINCT c.id)
FROM wxyc_schema.concerts c
JOIN wxyc_schema.flowsheet f ON f.artist_id = c.headlining_artist_id
WHERE f.start_time >= now() - interval '30 days'
  AND c.starts_at BETWEEN now() AND now() + interval '90 days'
  AND c.status <> 'cancelled'
GROUP BY 1 ORDER BY 1;

Tightest (heavy/medium rotation in trailing 90d) joins wxyc_schema.rotation per #1368's methodology block — same canonical-FK shape, swap the flowsheet join for the rotation join.

Raw-name variants follow #1368's methodology (lower + leading-"The" strip on both sides). Run the same three rules with raw-name joins for the A/B comparison; the delta against canonical-FK is the resolver-coverage signal.

  1. Append a "Path B cross-check" section to the memo:
    • The six queries' results.
    • Side-by-side table: Path A (one-shot HTML pull) vs Path B raw-name (30 days of scraper data) vs Path B canonical-FK.
    • Resolver-coverage delta: raw-name match count minus canonical-FK match count, per rule. Comment on whether the gap is mostly alias-substrate (need richer aliases) or artists duplicate-name groups (the known unresolvable case in Backfill concerts.headlining_artist_id via local artist + alias resolver #1372).

Acceptance

  • Six Path B queries (three rules × {canonical-FK, raw-name}) run against ≥30 days of concerts data
  • Results appended to the memo as a "Path B cross-check" section
  • Side-by-side comparison table: Path A vs Path B raw-name vs Path B canonical-FK, all three match rules
  • Resolver-coverage delta reported per match rule; this delta is the headline new number whether or not it's large
  • If any rule shows a Path A vs Path B divergence of >25%, investigate root cause (sample-size, calendar drift between Path A's one-shot HTML and Path B's accumulated data, parser drift) and document the cause in the memo
  • Memo notes explicitly that Path B's 30-day window is too small to confirm the bimodal seasonal pattern asserted in Path A
  • Memo TL;DR updated if Path B materially changes the headline rate
  • If the headline rate moves enough to shift the recommended cell in Frequency analysis: how often will the venue-events scraper surface a WXYC-relevant artist? #1368's decision matrix (<1/mo ↔ 1–4/mo ↔ >4/mo), re-run the matrix and update the venue-extension / Bandsintown-leverage recommendation

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions