You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The venue-events scraper (#1343, PRs #1345 + #1348) populates wxyc_schema.concerts nightly from Rockhouse Partners' WordPress schema.org payload. Initial coverage is five rooms: Cat's Cradle Main + Back Room, Haw River Ballroom, Motorco Music Hall (all under the cats-cradle RHP partner), plus Local 506 (local-506 RHP partner).
Before investing in iOS surface work, we want to size the rate at which a user would see a "WXYC artist is playing locally" event. The answer is load-bearing for three downstream decisions:
Whether the surface lives in its own tab, an artist-detail badge, or a push channel.
Whether five rooms is enough or we need to extend to additional Triangle venues (Pinhook, Cary Theater, Lincoln Theatre, Nightlight, ArtsCenter) or wait on Bandsintown for non-Triangle / non-RHP coverage.
Notification cadence ceilings.
A back-of-envelope sizing put this at roughly ~1 surfaceable show every 1–3 days under a loose ("in library") match rule and every 2–4 days under a tight ("played in last 30 days") rule, with ~2x seasonal swing. This ticket replaces those estimates with measured numbers and forces a concrete recommendation via a decision matrix.
Objective
Produce a written analysis (markdown memo, ~4–6 pages with tables) answering:
Concert volume: How many distinct touring acts hit these five rooms per year? Per month? Per venue? What's the seasonal shape?
Lead-time distribution: How far in advance do RHP venues publish? What share of scraped concerts are ≤14d / 15–60d / 61d+ out?
Catalog overlap: What fraction of those touring acts are in wxyc_schema.library? Strict (artist_name) vs. strict-including-alternate_artist_name vs. alias-aware.
Flowsheet overlap: What fraction have been played on wxyc_schema.flowsheet in the trailing 30/90/365 days?
Per-user surfacing rate, stratified by listener type, under at least three match rules:
Loose: artist in library (broadest)
Tight: artist played on flowsheet in trailing 30 days
Tightest: artist in heavy/medium rotation in trailing 90 days
Strata to report separately: DJs, DJs with a show in the next 30 days, anonymous iOS app users.
DJ strata are per-user — "match" = artist appears in that DJ's trailing-90d flowsheet entries. "Next 30 days" comes from wxyc_schema.schedule + wxyc_schema.shift_covers.
Anonymous iOS app users have no per-device taste signal (anonymous_devices stores only identity + rate-limit state, no library/flowsheet linkage), so this stratum is reported as a station-level aggregate proxy — fraction of upcoming concerts whose artist appears in {library / flowsheet last 30d / rotation last 90d} — not a per-user rate.
Headliner vs. headliner+support: How much does counting supporting_artists_raw shift the rate? Many WXYC artists tour as openers, not headliners.
False-positive rate per rule: manually adjudicate 50 random strict-only matches and 50 random normalized-only matches; report FP%.
Record alias-substrate state at run date (e.g., runtime LATERAL flag from PR 5: ON or OFF; if OFF, query the substrate directly).
Tourable-share of catalog: For ~30 randomly-sampled trailing-90-day flowsheet artists (weighted by play count), manually check whether they tour at all (active band, not reissue label / inactive / non-touring genre). The fraction anchors Bandsintown-partnership ROI.
RHP-coverage upper bound: For the same ~30 artists, manually check for any 6-month-out Triangle date via Bandsintown.com / Songkick / artist's own site. The fraction the 5 RHP rooms catch is the "are 5 rooms enough" answer.
Per-venue breakdown: How does overlap differ across CC Main, CC Back Room, Local 506, Haw River, Motorco? Informs venue-extension priority.
Push-cadence ceiling math: For each match rule and listener stratum, compute expected pushes per opt-in user per week. Filter to rules that clear a ≤1 push/week UX ceiling.
Geographic share: Non-Triangle listener share. List the cheap-signal sources up front (App Store Connect download geo, BS /proxy/* access logs, wxyc.org Cloudflare analytics, social-follower geo) and use the most accessible one rather than defaulting to "unknown."
Methodology
Two calibration paths — do both; they cross-check each other.
The seasonal-shape claim in Background (bimodal Sep–Nov + Feb–May peaks) is currently an assertion. Path B can't confirm it for ~12 months. As a cheap proxy:
Pull 4 Wayback Machine snapshots of https://catscradle.com/events/ and https://local506.com/events/, one per quarter across the trailing 12 months.
For each snapshot, count distinct /event/<slug>/ links using the existing extractEventLinks regex (jobs/venue-events-scraper/parse.ts:33).
Plot the per-quarter index-size as a coarse monthly-volume amplitude. Confirms or refutes the bimodal-peaks claim.
Stop at the index page. Archived event detail pages may not carry the "Event Markup for Official Venue Sites" JSON-LD marker (RHP plugin version drift across snapshots) — chasing per-event parsing isn't worth the effort. The amplitude question is answered by index sizes alone.
Path B — Forward-look (after scraper has ≥30 days of data)
Once concerts has ≥30 days of real upserts, run the join directly in SQL. Until the resolver pass (see Related) lands, JOIN on headlining_artist_raw rather than headlining_artist_id — the writer never sets the FK on insert (jobs/venue-events-scraper/writer.ts:139):
-- Per-day surfaceable concert count, library match (raw-name join).SELECT date_trunc('day', c.starts_at)::dateAS day, COUNT(DISTINCT c.id)
FROMwxyc_schema.concerts c
JOINwxyc_schema.library l
ONlower(regexp_replace(c.headlining_artist_raw, '^the\s+', '', 'i'))
=lower(regexp_replace(l.artist_name, '^the\s+', '', 'i'))
WHEREc.starts_at BETWEEN now() AND now() + interval '90 days'ANDc.status<>'cancelled'GROUP BY1ORDER BY1;
-- Recently-played match, raw-name join.SELECT date_trunc('day', c.starts_at)::dateAS day, COUNT(DISTINCT c.id)
FROMwxyc_schema.concerts c
JOINwxyc_schema.artists a
ONlower(regexp_replace(c.headlining_artist_raw, '^the\s+', '', 'i'))
=lower(regexp_replace(a.name, '^the\s+', '', 'i'))
JOINwxyc_schema.flowsheet f ONf.artist_id=a.idWHEREf.start_time>= now() - interval '30 days'ANDc.starts_at BETWEEN now() AND now() + interval '90 days'ANDc.status<>'cancelled'GROUP BY1ORDER BY1;
Also slice by month over the trailing year of concerts data once available, to confirm/refute the bimodal seasonal pattern (Sep–Nov + Feb–May peaks).
Path B re-runs on the canonical FK once the resolver pass lands — see Related.
Sample 30 artists from the trailing-90-day flowsheet, weighted by play count. For each:
Tourable: does the artist currently tour? Y/N (with reason on N: inactive, reissue, classical, etc.)
Triangle date in next 6 months: Y/N via Bandsintown.com / Songkick / artist's own site.
Caught by RHP slice: Y/N (does the date land at CC / Back Room / Haw River / Motorco / Local 506?).
Report tourable% and RHP-catch% with Wilson CI on n=30.
Sample-size guardrail
If trailing-90 sample at CC + L506 is <50 distinct touring artists, extend the lookback to 180 days for the catalog/flowsheet join rates and flag the extension in the memo.
Push-cadence math
For each match rule R and each listener stratum S:
Bandsintown.com / Songkick / artist sites for tourable + RHP-coverage spot-check
Manual web
n=30 sample
Geo signals
App Store Connect download geo; BS /proxy/* access logs (check retention on EC2/CloudWatch); wxyc.org Cloudflare analytics; social-follower geo. Use most accessible.
Don't default to "unknown" without trying these first
Output
Single markdown memo committed to plans/touring-events/frequency-analysis.md in the wxyc-workspace meta-repo (sibling to bandsintown-outreach.md). Memo should contain:
1-paragraph TL;DR with the headline rate (loose + tight + tightest)
Methodology section reproducing the queries
Tables: shows/month per venue, hit rates by match rule (incl. FP%), seasonal slice, time-until-show slice, listener-stratum slice
Plot: monthly counts over prior 12 months (seasonal confirmation)
Tourable-share + RHP-coverage spot-check numbers with Wilson CI
Push-cadence math table, with the (R, S) cells that clear ≤1/week highlighted
Background
The venue-events scraper (#1343, PRs #1345 + #1348) populates
wxyc_schema.concertsnightly from Rockhouse Partners' WordPress schema.org payload. Initial coverage is five rooms: Cat's Cradle Main + Back Room, Haw River Ballroom, Motorco Music Hall (all under thecats-cradleRHP partner), plus Local 506 (local-506RHP partner).Before investing in iOS surface work, we want to size the rate at which a user would see a "WXYC artist is playing locally" event. The answer is load-bearing for three downstream decisions:
A back-of-envelope sizing put this at roughly ~1 surfaceable show every 1–3 days under a loose ("in library") match rule and every 2–4 days under a tight ("played in last 30 days") rule, with ~2x seasonal swing. This ticket replaces those estimates with measured numbers and forces a concrete recommendation via a decision matrix.
Objective
Produce a written analysis (markdown memo, ~4–6 pages with tables) answering:
wxyc_schema.library? Strict (artist_name) vs. strict-including-alternate_artist_namevs. alias-aware.wxyc_schema.flowsheetin the trailing 30/90/365 days?wxyc_schema.schedule+wxyc_schema.shift_covers.anonymous_devicesstores only identity + rate-limit state, no library/flowsheet linkage), so this stratum is reported as a station-level aggregate proxy — fraction of upcoming concerts whose artist appears in {library / flowsheet last 30d / rotation last 90d} — not a per-user rate.supporting_artists_rawshift the rate? Many WXYC artists tour as openers, not headliners.lower(library.artist_name) = lower(touring_artist)after stripping leading "The") vs. strict-including-alternate_artist_namevs. alias-aware via theartist_search_aliassubstrate (fix(artist-search-alias-consumer): coerce nullable binds to null (closes BS#1300) #1307)./proxy/*access logs, wxyc.org Cloudflare analytics, social-follower geo) and use the most accessible one rather than defaulting to "unknown."Methodology
Two calibration paths — do both; they cross-check each other.
Path A — Backward-look (no scraper data needed)
feature/venues-concerts-schema(post-merge of feat(venue-events-scraper): RHP venue scraper job #1345, 2026-06-09); invokeparseEventPage(jobs/venue-events-scraper/parse.ts:186) via a one-off script against live HTML or fixtures. Runnable from the branch today — Path A does not block on feat(schema): add venues + concerts tables for touring-events #1348 landing.parsed.headlining_artist)parsed.supporting_artists)wxyc_schema.library(in-catalog rate, both strict and strict-including-alternate_artist_name)wxyc_schema.flowsheetfiltered bystart_time >= now() - interval '30 days'(recently-played rate)wxyc_schema.rotationfiltered to active heavy/medium bins (if populated for the period)status='cancelled'from the denominator.Path A sidecar — Wayback seasonal-amplitude check
The seasonal-shape claim in Background (bimodal Sep–Nov + Feb–May peaks) is currently an assertion. Path B can't confirm it for ~12 months. As a cheap proxy:
https://catscradle.com/events/andhttps://local506.com/events/, one per quarter across the trailing 12 months./event/<slug>/links using the existingextractEventLinksregex (jobs/venue-events-scraper/parse.ts:33).Stop at the index page. Archived event detail pages may not carry the "Event Markup for Official Venue Sites" JSON-LD marker (RHP plugin version drift across snapshots) — chasing per-event parsing isn't worth the effort. The amplitude question is answered by index sizes alone.
Path B — Forward-look (after scraper has ≥30 days of data)
Once
concertshas ≥30 days of real upserts, run the join directly in SQL. Until the resolver pass (see Related) lands, JOIN onheadlining_artist_rawrather thanheadlining_artist_id— the writer never sets the FK on insert (jobs/venue-events-scraper/writer.ts:139):Also slice by month over the trailing year of
concertsdata once available, to confirm/refute the bimodal seasonal pattern (Sep–Nov + Feb–May peaks).Path B re-runs on the canonical FK once the resolver pass lands — see Related.
Time-until-show and per-venue slices
For both paths, bucket results by:
cats-cradle/cats-cradle-back-room/haw-river-ballroom/motorco-music-hall/local-506Tourable-share + RHP-coverage spot-check
Sample 30 artists from the trailing-90-day flowsheet, weighted by play count. For each:
Report tourable% and RHP-catch% with Wilson CI on n=30.
Sample-size guardrail
If trailing-90 sample at CC + L506 is <50 distinct touring artists, extend the lookback to 180 days for the catalog/flowsheet join rates and flag the extension in the memo.
Push-cadence math
For each match rule R and each listener stratum S:
Report which (R, S) combinations clear ≤1 push/user/week.
Match-quality sensitivity
Three strategies, reported side by side:
LOWER(library.artist_name) = LOWER(touring_artist)after stripping leading "The "library.alternate_artist_nameartist_search_aliassubstrate (see fix(artist-search-alias-consumer): coerce nullable binds to null (closes BS#1300) #1307; record runtime flag state at run-date)For each: hit rate, delta vs. strict-name, manually-adjudicated false-positive rate (n=50 each from strict-only and normalized-only).
If alias delta > 20%, flag as a follow-up to dial in normalization before shipping the iOS surface.
Data sources
wxyc_schema.library,wxyc_schema.flowsheet,wxyc_schema.artists,wxyc_schema.artist_search_alias,wxyc_schema.rotationwxyc_schema.concerts,wxyc_schema.venues/proxy/*access logs (check retention on EC2/CloudWatch); wxyc.org Cloudflare analytics; social-follower geo. Use most accessible.Output
Single markdown memo committed to
plans/touring-events/frequency-analysis.mdin thewxyc-workspacemeta-repo (sibling tobandsintown-outreach.md). Memo should contain:Raw counts and per-row sample data committed as CSV next to the memo. Queries pinned to BS commit SHA.
Constraints
Estimated effort
Related
feature/venue-events-scraper)feature/venues-concerts-schema)artist_search_aliassubstrate (informs alias-aware match rule and runtime-flag state to record)concerts.headlining_artist_idvia local artist + alias resolver #1372 — backfillconcerts.headlining_artist_id; gates the clean Path B rerunconcertsdata #1373 — re-run Path B after ≥30 days ofconcertsdataAcceptance
status='cancelled'dropped from denominator)plans/touring-events/frequency-analysis.mdin wxyc-workspaceBlocked-by-linked