Skip to content

Frequency analysis: how often will the venue-events scraper surface a WXYC-relevant artist? #1368

@jakebromberg

Description

@jakebromberg

Background

The venue-events scraper (#1343, PRs #1345 + #1348) populates wxyc_schema.concerts nightly from Rockhouse Partners' WordPress schema.org payload. Initial coverage is five rooms: Cat's Cradle Main + Back Room, Haw River Ballroom, Motorco Music Hall (all under the cats-cradle RHP partner), plus Local 506 (local-506 RHP partner).

Before investing in iOS surface work, we want to size the rate at which a user would see a "WXYC artist is playing locally" event. The answer is load-bearing for three downstream decisions:

  • Whether the surface lives in its own tab, an artist-detail badge, or a push channel.
  • Whether five rooms is enough or we need to extend to additional Triangle venues (Pinhook, Cary Theater, Lincoln Theatre, Nightlight, ArtsCenter) or wait on Bandsintown for non-Triangle / non-RHP coverage.
  • Notification cadence ceilings.

A back-of-envelope sizing put this at roughly ~1 surfaceable show every 1–3 days under a loose ("in library") match rule and every 2–4 days under a tight ("played in last 30 days") rule, with ~2x seasonal swing. This ticket replaces those estimates with measured numbers and forces a concrete recommendation via a decision matrix.

Objective

Produce a written analysis (markdown memo, ~4–6 pages with tables) answering:

  1. Concert volume: How many distinct touring acts hit these five rooms per year? Per month? Per venue? What's the seasonal shape?
  2. Lead-time distribution: How far in advance do RHP venues publish? What share of scraped concerts are ≤14d / 15–60d / 61d+ out?
  3. Catalog overlap: What fraction of those touring acts are in wxyc_schema.library? Strict (artist_name) vs. strict-including-alternate_artist_name vs. alias-aware.
  4. Flowsheet overlap: What fraction have been played on wxyc_schema.flowsheet in the trailing 30/90/365 days?
  5. Per-user surfacing rate, stratified by listener type, under at least three match rules:
    • Loose: artist in library (broadest)
    • Tight: artist played on flowsheet in trailing 30 days
    • Tightest: artist in heavy/medium rotation in trailing 90 days
    • Strata to report separately: DJs, DJs with a show in the next 30 days, anonymous iOS app users.
      • DJ strata are per-user — "match" = artist appears in that DJ's trailing-90d flowsheet entries. "Next 30 days" comes from wxyc_schema.schedule + wxyc_schema.shift_covers.
      • Anonymous iOS app users have no per-device taste signal (anonymous_devices stores only identity + rate-limit state, no library/flowsheet linkage), so this stratum is reported as a station-level aggregate proxy — fraction of upcoming concerts whose artist appears in {library / flowsheet last 30d / rotation last 90d} — not a per-user rate.
  6. Headliner vs. headliner+support: How much does counting supporting_artists_raw shift the rate? Many WXYC artists tour as openers, not headliners.
  7. Match-quality sensitivity:
    • Strict (lower(library.artist_name) = lower(touring_artist) after stripping leading "The") vs. strict-including-alternate_artist_name vs. alias-aware via the artist_search_alias substrate (fix(artist-search-alias-consumer): coerce nullable binds to null (closes BS#1300) #1307).
    • Quantify the gap.
    • False-positive rate per rule: manually adjudicate 50 random strict-only matches and 50 random normalized-only matches; report FP%.
    • Record alias-substrate state at run date (e.g., runtime LATERAL flag from PR 5: ON or OFF; if OFF, query the substrate directly).
  8. Tourable-share of catalog: For ~30 randomly-sampled trailing-90-day flowsheet artists (weighted by play count), manually check whether they tour at all (active band, not reissue label / inactive / non-touring genre). The fraction anchors Bandsintown-partnership ROI.
  9. RHP-coverage upper bound: For the same ~30 artists, manually check for any 6-month-out Triangle date via Bandsintown.com / Songkick / artist's own site. The fraction the 5 RHP rooms catch is the "are 5 rooms enough" answer.
  10. Per-venue breakdown: How does overlap differ across CC Main, CC Back Room, Local 506, Haw River, Motorco? Informs venue-extension priority.
  11. Push-cadence ceiling math: For each match rule and listener stratum, compute expected pushes per opt-in user per week. Filter to rules that clear a ≤1 push/week UX ceiling.
  12. Geographic share: Non-Triangle listener share. List the cheap-signal sources up front (App Store Connect download geo, BS /proxy/* access logs, wxyc.org Cloudflare analytics, social-follower geo) and use the most accessible one rather than defaulting to "unknown."

Methodology

Two calibration paths — do both; they cross-check each other.

Path A — Backward-look (no scraper data needed)

  1. Hand-pull or one-shot-run the parser against next 90 days of CC + L506. The parser lives on feature/venues-concerts-schema (post-merge of feat(venue-events-scraper): RHP venue scraper job #1345, 2026-06-09); invoke parseEventPage (jobs/venue-events-scraper/parse.ts:186) via a one-off script against live HTML or fixtures. Runnable from the branch today — Path A does not block on feat(schema): add venues + concerts tables for touring-events #1348 landing.
  2. Normalize artist names; produce three resolved sets:
    • Headliners only (parsed.headlining_artist)
    • Headliners + supporting acts (union with parsed.supporting_artists)
    • Each routed through the existing alias machinery (or substrate query if runtime flag is off)
  3. JOIN against:
    • wxyc_schema.library (in-catalog rate, both strict and strict-including-alternate_artist_name)
    • wxyc_schema.flowsheet filtered by start_time >= now() - interval '30 days' (recently-played rate)
    • wxyc_schema.rotation filtered to active heavy/medium bins (if populated for the period)
  4. Drop concerts with status='cancelled' from the denominator.
  5. Report hit rates with Wilson confidence intervals on the pooled 12-month window; plot per-month point estimates separately.
  6. FP-rate audit: sample 50 strict-only and 50 normalized-only matches; manually adjudicate; report FP%.

Path A sidecar — Wayback seasonal-amplitude check

The seasonal-shape claim in Background (bimodal Sep–Nov + Feb–May peaks) is currently an assertion. Path B can't confirm it for ~12 months. As a cheap proxy:

  1. Pull 4 Wayback Machine snapshots of https://catscradle.com/events/ and https://local506.com/events/, one per quarter across the trailing 12 months.
  2. For each snapshot, count distinct /event/<slug>/ links using the existing extractEventLinks regex (jobs/venue-events-scraper/parse.ts:33).
  3. Plot the per-quarter index-size as a coarse monthly-volume amplitude. Confirms or refutes the bimodal-peaks claim.

Stop at the index page. Archived event detail pages may not carry the "Event Markup for Official Venue Sites" JSON-LD marker (RHP plugin version drift across snapshots) — chasing per-event parsing isn't worth the effort. The amplitude question is answered by index sizes alone.

Path B — Forward-look (after scraper has ≥30 days of data)

Once concerts has ≥30 days of real upserts, run the join directly in SQL. Until the resolver pass (see Related) lands, JOIN on headlining_artist_raw rather than headlining_artist_id — the writer never sets the FK on insert (jobs/venue-events-scraper/writer.ts:139):

-- Per-day surfaceable concert count, library match (raw-name join).
SELECT date_trunc('day', c.starts_at)::date AS day, COUNT(DISTINCT c.id)
FROM wxyc_schema.concerts c
JOIN wxyc_schema.library l
  ON lower(regexp_replace(c.headlining_artist_raw, '^the\s+', '', 'i'))
   = lower(regexp_replace(l.artist_name, '^the\s+', '', 'i'))
WHERE c.starts_at BETWEEN now() AND now() + interval '90 days'
  AND c.status <> 'cancelled'
GROUP BY 1 ORDER BY 1;

-- Recently-played match, raw-name join.
SELECT date_trunc('day', c.starts_at)::date AS day, COUNT(DISTINCT c.id)
FROM wxyc_schema.concerts c
JOIN wxyc_schema.artists a
  ON lower(regexp_replace(c.headlining_artist_raw, '^the\s+', '', 'i'))
   = lower(regexp_replace(a.name, '^the\s+', '', 'i'))
JOIN wxyc_schema.flowsheet f ON f.artist_id = a.id
WHERE f.start_time >= now() - interval '30 days'
  AND c.starts_at BETWEEN now() AND now() + interval '90 days'
  AND c.status <> 'cancelled'
GROUP BY 1 ORDER BY 1;

Also slice by month over the trailing year of concerts data once available, to confirm/refute the bimodal seasonal pattern (Sep–Nov + Feb–May peaks).

Path B re-runs on the canonical FK once the resolver pass lands — see Related.

Time-until-show and per-venue slices

For both paths, bucket results by:

  • Time-until-show: ≤14d / 15–60d / 61d+
  • Venue: cats-cradle / cats-cradle-back-room / haw-river-ballroom / motorco-music-hall / local-506

Tourable-share + RHP-coverage spot-check

Sample 30 artists from the trailing-90-day flowsheet, weighted by play count. For each:

  1. Tourable: does the artist currently tour? Y/N (with reason on N: inactive, reissue, classical, etc.)
  2. Triangle date in next 6 months: Y/N via Bandsintown.com / Songkick / artist's own site.
  3. Caught by RHP slice: Y/N (does the date land at CC / Back Room / Haw River / Motorco / Local 506?).

Report tourable% and RHP-catch% with Wilson CI on n=30.

Sample-size guardrail

If trailing-90 sample at CC + L506 is <50 distinct touring artists, extend the lookback to 180 days for the catalog/flowsheet join rates and flag the extension in the memo.

Push-cadence math

For each match rule R and each listener stratum S:

pushes_per_user_per_week(R, S) = overlap_rate(R, S) × concerts_per_week × P(R matches | concert)

Report which (R, S) combinations clear ≤1 push/user/week.

Match-quality sensitivity

Three strategies, reported side by side:

  1. Strict-name: LOWER(library.artist_name) = LOWER(touring_artist) after stripping leading "The "
  2. Strict-name-with-alt: also match library.alternate_artist_name
  3. Alias-aware: route through artist_search_alias substrate (see fix(artist-search-alias-consumer): coerce nullable binds to null (closes BS#1300) #1307; record runtime flag state at run-date)

For each: hit rate, delta vs. strict-name, manually-adjudicated false-positive rate (n=50 each from strict-only and normalized-only).

If alias delta > 20%, flag as a follow-up to dial in normalization before shipping the iOS surface.

Data sources

Source How to access Notes
wxyc_schema.library, wxyc_schema.flowsheet, wxyc_schema.artists, wxyc_schema.artist_search_alias, wxyc_schema.rotation Read-only via BS EC2 → prod RDS Per-turn auth required before prod-read
wxyc_schema.concerts, wxyc_schema.venues Same; populated by scraper after #1345 lands Empty until scraper has run ≥1 night
Live CC/L506 event HTML for Path A https://catscradle.com/events/ and https://local506.com/events/ Polite UA, ≤4 concurrent
Bandsintown.com / Songkick / artist sites for tourable + RHP-coverage spot-check Manual web n=30 sample
Geo signals App Store Connect download geo; BS /proxy/* access logs (check retention on EC2/CloudWatch); wxyc.org Cloudflare analytics; social-follower geo. Use most accessible. Don't default to "unknown" without trying these first

Output

Single markdown memo committed to plans/touring-events/frequency-analysis.md in the wxyc-workspace meta-repo (sibling to bandsintown-outreach.md). Memo should contain:

  • 1-paragraph TL;DR with the headline rate (loose + tight + tightest)
  • Methodology section reproducing the queries
  • Tables: shows/month per venue, hit rates by match rule (incl. FP%), seasonal slice, time-until-show slice, listener-stratum slice
  • Plot: monthly counts over prior 12 months (seasonal confirmation)
  • Tourable-share + RHP-coverage spot-check numbers with Wilson CI
  • Push-cadence math table, with the (R, S) cells that clear ≤1/week highlighted
  • Sensitivity discussion (match quality, geo, sample size)
  • Recommendation via decision matrix — pre-filled per output:
Trailing-30d hit rate (per anonymous-user stratum) Surface
<1/mo inline artist-detail badge only
1–4/mo badge + opt-in weekly digest
>4/mo dedicated tab + opt-in push (capped at ≤1/week)
  • Recommendation on venue extension (yes/no, which next) and Bandsintown-partnership leverage ratio.

Raw counts and per-row sample data committed as CSV next to the memo. Queries pinned to BS commit SHA.

Constraints

  • Read-only on prod. Do not write to flowsheet/library/concerts. The scraper itself does writes; the analysis does not.
  • Develop queries against the staging clone before running on prod.
  • Do not propose RDS storage bumps; prefer reversible levers.
  • Stay scoped to CC + L506 for the scraped sample. Tourable-share + RHP-coverage spot-check intentionally reaches beyond it for the upper-bound number.

Estimated effort

  • Path A (backward-look + raw HTML pull + memo skeleton): 3–4 hours
  • Path A Wayback sidecar (n=4 snapshots, index-only): ~1 hour
  • Tourable-share + RHP-coverage spot-check (manual, n=30): 1–2 hours
  • Memo writeup + decision matrix: 2 hours
  • Path B (forward-look): tracked as a separate follow-up; see Related.

Related

Acceptance

  • Path A run with current calendar; hit rates reported with Wilson CI on pooled window and per-month point estimates plotted
  • Path A Wayback sidecar: 4-quarter snapshot index counts; confirms or refutes the bimodal seasonal claim
  • Sensitivity table: strict vs strict-with-alt vs alias-aware, including FP% per rule
  • Supporting-acts variant reported alongside headliner-only
  • Per-venue breakdown table
  • Time-until-show bucket table (≤14d / 15–60d / 61d+)
  • Listener-stratum split: DJs / DJs-with-show-this-month / anonymous-app users
  • Tourable-share spot-check (n=30) + RHP-coverage spot-check (n=30) with Wilson CI
  • Push-cadence math table; rules that clear ≤1/week highlighted
  • Lead-time distribution histogram
  • Seasonal-shape plot (monthly counts prior 12 months)
  • Cancellation/reschedule filter applied (status='cancelled' dropped from denominator)
  • Alias-substrate state-as-of-run-date recorded
  • Geographic-share addressed (measured via one of the listed signals; "unknown" only if all listed signals are inaccessible)
  • Memo lands at plans/touring-events/frequency-analysis.md in wxyc-workspace
  • Raw counts committed as CSV; queries pinned to BS commit SHA
  • Recommendation via decision matrix; venue-extension + Bandsintown-leverage call concrete
  • Path B follow-up issue filed and Blocked-by-linked

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions