You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The corpus spans 1990–2001, but every hand-verified page (n=19 measured per project_empirical_failure_modes_n19, with a ~60-page set in progress) is from April or August 1990. Several rates measured against 1990 — per-row edit rate (~76%), type_raw blank rate, confidence anti-calibration — are load-bearing for Phase 2 decisions: re-OCR queue threshold, reconciliation auto-accept thresholds (issue #62), and prompt-iteration eval gating. Carrying those numbers forward to later years implicitly assumes the model fails the same way on 1995 or 2001 handwriting and printed forms. We don't know that yet.
A second full 60-page review per year is unaffordable (11 post-1990 years × 60 ≈ 660 pages of hand work). The design here is a per-year drift spot check that's cheap enough to run every year, with a clear gate that triggers a full 60-page recalibration only when drift is detected.
Sources of drift across the corpus
Four axes that drive drift and are not aligned with calendar years:
Jock cohort. DJs cycle every few semesters. The handwriting clusters that dominate 1990 are not the ones that dominate 1995 or 2001.
Printed form drift. The flowsheet template changed at least once in this window — different column widths, different printed labels. core/page_layout.py's projection-profile detector is the right tool to identify template breakpoints from extracted layouts alone.
Type-column vocabulary. The H/M/L + Std/O/R rotation conventions evolved. Letters that look like "doodles" in 1990 may be valid in 2001.
Paper / scan quality. Earlier pages tend to be photocopied originals; later pages were sometimes faxed or re-scanned. Scan quality changes the model's failure-mode mix.
Design: two buckets per year
Splitting the sample into two buckets with different jobs is the single most important move. Conflating them invalidates the metrics.
Random-stratified bucket is for unbiased measurement against the 1990 baseline. Anomaly-weighted sampling overstates the error rate by construction — anomalies are the hard cases — so the drift test needs honest, distribution-matched estimates.
Anomaly-targeted bucket is for failure-mode discovery — finding patterns the prompt doesn't handle yet. Mining anomalies does not tell us "is this year worse than 1990," but it does tell us "what's broken that we haven't seen before."
Recommended split: ~70% random / ~30% anomaly. For a 13-page spot check that's 9 random + 4 anomaly. Track metrics separately for each bucket; never collapse them into a single number.
Random-stratified bucket: what to stratify on
Pick axes that drive within-year variance in handwriting and layout. Calendar months are a proxy for what we actually care about and miss the real signal.
Axis
Why
How
Jock identity
Each jock has a distinct hand; cohort is the dominant within-year variance
Sample to match the year's airtime distribution by quadrants[*].jock_raw
Page position in file
Page 1 vs page 30: handwriting fatigues over a multi-week file
Stratify into early / mid / late thirds
Day-of-week + overnight vs daytime
Overnight slots are faster/sloppier; weekend jocks differ from weekday regulars
Bucket from page_date_raw + hour_raw
Form-era
Template revisions are a categorical variance axis
Cluster pages by page_layout outputs (header_bottom_y, body_mid_y, column_mid_x) on the extracted corpus
Anomaly bucket: what counts
Each anomaly type should be represented at least once per spot check:
Lowest-decile model confidence. Weak signal (confidence is anti-calibrated per n=19), but still picks up some real hard cases.
oddities non-empty. Two patterns are already documented: overlay annotations (project_oddity_overlay_pattern) and doodle-in-type-column (project_oddity_doodle_entry). Any year showing new oddities tags is worth flagging immediately.
Layout outliers. Pages where core/page_layout.py reports values unusual relative to the year median — template-drift candidates.
Row-count outliers. Pages emitting p90 rows for their hour-block. Both ends matter: too few = under-emit (the rare-but-costly failure mode per n=19); too many = spurious splits.
type_raw vocabulary outliers. Pages where the type column emits letters outside the year's modal alphabet. Useful precisely because the rotation conventions change across years — this fires on legitimate drift, not just errors, and either case is worth verifying.
Cross-prompt disagreement. Re-extract a page with two prompt variants; pages where they disagree on row count or content are the highest-leverage verification (active-learning principle).
Free starting anomalies (1990)
The bbox sweep (project_bbox_sweep_result) named 42 strict-detector pathology candidates across 5 named bundles. These should be in the 1990 verification set automatically — they're known unresolved anomalies the existing detector already surfaces. Re-run the same detector on each subsequent year; equivalent pathology lists become the anomaly bucket's seed for that year.
Sample size and detectable drift
13 pages × ~30 rows ≈ 400 verified rows per spot-check year. Two-proportion z-test against the 1990 baseline (~1800 rows from 60 pages) can detect a row-edit-rate shift of roughly ±10 percentage points (76% → 66% or → 86%) at p<0.05. Smaller drifts are invisible — fine for a screen, but if we want to detect ±5pp shifts the spot check needs to double to ~25 pages.
Track each failure-mode metric separately rather than rolling them into one accuracy number. Per n=19, under-emit (3%) and row-edit (76%) have wildly different rates and wildly different fix costs; collapsing them hides drift in the rare-but-expensive failure modes.
Anti-patterns
Sampling only low-confidence pages. Biases all metrics; signal is weak.
Sampling uniformly across months instead of across jocks. Misses the actual variance axis.
Sampling only "interesting-looking" pages by eye. Confirmation bias.
Mixing anomaly and random into one bucket. Destroys unbiased metrics.
Sampling the same page position (first page, last page) across files. Page-position correlates with handwriting fatigue.
Per-year recipe
Extract the year's pages with the current pipeline (no hand verification yet).
Hand-verification effort budget is the binding constraint: spot checks should be designed to fit in a single Alex session per year (~13 pages plausible; 25 pages stretch). Anything more triggers the "actually do the full 60" branch.
Anomaly-bucket size scales with the failure-mode discovery rate, not coverage. Once a year's anomaly bucket stops surfacing new patterns, downgrade — don't grow it indefinitely.
Acceptance criteria
scripts/sample_year_for_drift.py emits a per-page list of (page, bucket=random|anomaly, stratification keys, anomaly reasons) for a given year without any hand verification.
Form-era cluster boundaries documented for 1990–2001 (output of page_layout cluster analysis on the full extracted corpus).
Drift-test metric definitions match the 1990 baseline numerator/denominator (per-row edit rate, per-page under-emit rate, type_raw blank rate, F1 for continuation / double_height / crossed_out).
Unit tests demonstrate the sampler stays within strata proportions on synthetic input and labels each output page with its bucket and anomaly reason.
Out of scope for this ticket
The full 60-page 1990 calibration itself (in progress separately).
Sequential-testing / multiple-comparison correction across years. Add once we have ≥3 years of data and false-positive drift detection becomes a real risk.
Track-grain reconciliation drift. Reconciliation calibration is keyed off the artist axis, which is decoupled from handwriting drift.
Acting on the drift verdict (queueing a year's full 60-page calibration). The sampler emits the plan; running it is a separate workflow.
Notes for implementer
Treat the random and anomaly buckets as having different downstream consumers. The random bucket feeds the drift gate (statistical test). The anomaly bucket feeds prompt iteration (qualitative review). Keeping the labels distinct in the output file is non-negotiable.
The form-era clustering should be done once across the full extracted corpus, not re-derived per year. A page's era is a fixed property of its scan; the per-year sampler reads the precomputed cluster id.
The sampler is deterministic given a seed. Same year + same seed → same sample. This matters for reproducibility when the same year is sampled multiple times (e.g., after a prompt change shifts row-count outliers).
The anomaly bucket should de-duplicate against the random bucket. If a randomly chosen page also matches an anomaly criterion, leave it in the random bucket and draw a different anomaly.
Related
project_empirical_failure_modes_n19.md — the 1990 baseline this calibrates against
project_bbox_sweep_result.md — the 5 named bundles + 42 pathology candidates that seed the 1990 anomaly bucket
project_oddity_overlay_pattern.md, project_oddity_doodle_entry.md — known oddity patterns
Spike: 5-page Gemini Flash calibration #43 — Gemini Flash calibration spike (separate calibration; cross-linked in case the Flash decision ships first and shifts the model under us)
Context
The corpus spans 1990–2001, but every hand-verified page (n=19 measured per
project_empirical_failure_modes_n19, with a ~60-page set in progress) is from April or August 1990. Several rates measured against 1990 — per-row edit rate (~76%), type_raw blank rate, confidence anti-calibration — are load-bearing for Phase 2 decisions: re-OCR queue threshold, reconciliation auto-accept thresholds (issue #62), and prompt-iteration eval gating. Carrying those numbers forward to later years implicitly assumes the model fails the same way on 1995 or 2001 handwriting and printed forms. We don't know that yet.A second full 60-page review per year is unaffordable (11 post-1990 years × 60 ≈ 660 pages of hand work). The design here is a per-year drift spot check that's cheap enough to run every year, with a clear gate that triggers a full 60-page recalibration only when drift is detected.
Sources of drift across the corpus
Four axes that drive drift and are not aligned with calendar years:
core/page_layout.py's projection-profile detector is the right tool to identify template breakpoints from extracted layouts alone.Design: two buckets per year
Splitting the sample into two buckets with different jobs is the single most important move. Conflating them invalidates the metrics.
Random-stratified bucket is for unbiased measurement against the 1990 baseline. Anomaly-weighted sampling overstates the error rate by construction — anomalies are the hard cases — so the drift test needs honest, distribution-matched estimates.
Anomaly-targeted bucket is for failure-mode discovery — finding patterns the prompt doesn't handle yet. Mining anomalies does not tell us "is this year worse than 1990," but it does tell us "what's broken that we haven't seen before."
Recommended split: ~70% random / ~30% anomaly. For a 13-page spot check that's 9 random + 4 anomaly. Track metrics separately for each bucket; never collapse them into a single number.
Random-stratified bucket: what to stratify on
Pick axes that drive within-year variance in handwriting and layout. Calendar months are a proxy for what we actually care about and miss the real signal.
quadrants[*].jock_rawpage_date_raw+hour_rawpage_layoutoutputs (header_bottom_y,body_mid_y,column_mid_x) on the extracted corpusAnomaly bucket: what counts
Each anomaly type should be represented at least once per spot check:
odditiesnon-empty. Two patterns are already documented: overlay annotations (project_oddity_overlay_pattern) and doodle-in-type-column (project_oddity_doodle_entry). Any year showing new oddities tags is worth flagging immediately.core/page_layout.pyreports values unusual relative to the year median — template-drift candidates.type_rawvocabulary outliers. Pages where the type column emits letters outside the year's modal alphabet. Useful precisely because the rotation conventions change across years — this fires on legitimate drift, not just errors, and either case is worth verifying.Free starting anomalies (1990)
The bbox sweep (
project_bbox_sweep_result) named 42 strict-detector pathology candidates across 5 named bundles. These should be in the 1990 verification set automatically — they're known unresolved anomalies the existing detector already surfaces. Re-run the same detector on each subsequent year; equivalent pathology lists become the anomaly bucket's seed for that year.Sample size and detectable drift
13 pages × ~30 rows ≈ 400 verified rows per spot-check year. Two-proportion z-test against the 1990 baseline (~1800 rows from 60 pages) can detect a row-edit-rate shift of roughly ±10 percentage points (76% → 66% or → 86%) at p<0.05. Smaller drifts are invisible — fine for a screen, but if we want to detect ±5pp shifts the spot check needs to double to ~25 pages.
Track each failure-mode metric separately rather than rolling them into one accuracy number. Per n=19, under-emit (3%) and row-edit (76%) have wildly different rates and wildly different fix costs; collapsing them hides drift in the rare-but-expensive failure modes.
Anti-patterns
Per-year recipe
Where
scripts/sample_year_for_drift.py(new) — emits the per-year sample plan, two-bucket structure preserved.core/page_layout.py— reuse the projection-profile detector for both form-era clustering and layout-outlier anomaly detection.data/verifier-pulled-refresh/— the in-progress 60-page 1990 baseline this calibrates against.Constraints
corrected_artistsemantics + design rule revisited #62) are not blocked by drift testing — the drift gate is separate and runs against the row-text metric set.Acceptance criteria
scripts/sample_year_for_drift.pyemits a per-page list of(page, bucket=random|anomaly, stratification keys, anomaly reasons)for a given year without any hand verification.page_layout), row-count outliers, oddities-tagged, type-vocab outliers, lowest-decile confidence.page_layoutcluster analysis on the full extracted corpus).Out of scope for this ticket
Notes for implementer
Related
project_empirical_failure_modes_n19.md— the 1990 baseline this calibrates againstproject_bbox_sweep_result.md— the 5 named bundles + 42 pathology candidates that seed the 1990 anomaly bucketproject_oddity_overlay_pattern.md,project_oddity_doodle_entry.md— known oddity patternscorrected_artistsemantics + design rule revisited #62 — Phase 2 reconciliation (parallel calibration, different axis)