Phase 2 calibration: per-year drift sampling design (stratified + anomaly buckets)

## Context

The corpus spans 1990–2001, but every hand-verified page (n=19 measured per `project_empirical_failure_modes_n19`, with a ~60-page set in progress) is from April or August 1990. Several rates measured against 1990 — per-row edit rate (~76%), type_raw blank rate, confidence anti-calibration — are load-bearing for Phase 2 decisions: re-OCR queue threshold, reconciliation auto-accept thresholds (issue #62), and prompt-iteration eval gating. Carrying those numbers forward to later years implicitly assumes the model fails the same way on 1995 or 2001 handwriting and printed forms. We don't know that yet.

A second full 60-page review per year is unaffordable (11 post-1990 years × 60 ≈ 660 pages of hand work). The design here is a per-year **drift spot check** that's cheap enough to run every year, with a clear gate that triggers a full 60-page recalibration only when drift is detected.

## Sources of drift across the corpus

Four axes that drive drift and are not aligned with calendar years:

- **Jock cohort.** DJs cycle every few semesters. The handwriting clusters that dominate 1990 are not the ones that dominate 1995 or 2001.
- **Printed form drift.** The flowsheet template changed at least once in this window — different column widths, different printed labels. `core/page_layout.py`'s projection-profile detector is the right tool to identify template breakpoints from extracted layouts alone.
- **Type-column vocabulary.** The H/M/L + Std/O/R rotation conventions evolved. Letters that look like "doodles" in 1990 may be valid in 2001.
- **Paper / scan quality.** Earlier pages tend to be photocopied originals; later pages were sometimes faxed or re-scanned. Scan quality changes the model's failure-mode mix.

## Design: two buckets per year

Splitting the sample into two buckets with different jobs is the single most important move. Conflating them invalidates the metrics.

**Random-stratified bucket** is for unbiased measurement against the 1990 baseline. Anomaly-weighted sampling overstates the error rate by construction — anomalies are the hard cases — so the drift test needs honest, distribution-matched estimates.

**Anomaly-targeted bucket** is for failure-mode discovery — finding patterns the prompt doesn't handle yet. Mining anomalies does not tell us "is this year worse than 1990," but it does tell us "what's broken that we haven't seen before."

Recommended split: ~70% random / ~30% anomaly. For a 13-page spot check that's 9 random + 4 anomaly. Track metrics separately for each bucket; never collapse them into a single number.

## Random-stratified bucket: what to stratify on

Pick axes that drive within-year variance in handwriting and layout. Calendar months are a proxy for what we actually care about and miss the real signal.

| Axis | Why | How |
|---|---|---|
| Jock identity | Each jock has a distinct hand; cohort is the dominant within-year variance | Sample to match the year's airtime distribution by `quadrants[*].jock_raw` |
| Page position in file | Page 1 vs page 30: handwriting fatigues over a multi-week file | Stratify into early / mid / late thirds |
| Day-of-week + overnight vs daytime | Overnight slots are faster/sloppier; weekend jocks differ from weekday regulars | Bucket from `page_date_raw` + `hour_raw` |
| Form-era | Template revisions are a categorical variance axis | Cluster pages by `page_layout` outputs (`header_bottom_y`, `body_mid_y`, `column_mid_x`) on the extracted corpus |

## Anomaly bucket: what counts

Each anomaly type should be represented at least once per spot check:

- **Lowest-decile model confidence.** Weak signal (confidence is anti-calibrated per n=19), but still picks up some real hard cases.
- **`oddities` non-empty.** Two patterns are already documented: overlay annotations (`project_oddity_overlay_pattern`) and doodle-in-type-column (`project_oddity_doodle_entry`). Any year showing new oddities tags is worth flagging immediately.
- **Layout outliers.** Pages where `core/page_layout.py` reports values unusual relative to the year median — template-drift candidates.
- **Row-count outliers.** Pages emitting <p10 or >p90 rows for their hour-block. Both ends matter: too few = under-emit (the rare-but-costly failure mode per n=19); too many = spurious splits.
- **`type_raw` vocabulary outliers.** Pages where the type column emits letters outside the year's modal alphabet. Useful precisely because the rotation conventions change across years — this fires on legitimate drift, not just errors, and either case is worth verifying.
- **Cross-prompt disagreement.** Re-extract a page with two prompt variants; pages where they disagree on row count or content are the highest-leverage verification (active-learning principle).

## Free starting anomalies (1990)

The bbox sweep (`project_bbox_sweep_result`) named **42 strict-detector pathology candidates across 5 named bundles**. These should be in the 1990 verification set automatically — they're known unresolved anomalies the existing detector already surfaces. Re-run the same detector on each subsequent year; equivalent pathology lists become the anomaly bucket's seed for that year.

## Sample size and detectable drift

13 pages × ~30 rows ≈ 400 verified rows per spot-check year. Two-proportion z-test against the 1990 baseline (~1800 rows from 60 pages) can detect a row-edit-rate shift of roughly ±10 percentage points (76% → 66% or → 86%) at p<0.05. Smaller drifts are invisible — fine for a screen, but if we want to detect ±5pp shifts the spot check needs to double to ~25 pages.

Track each failure-mode metric separately rather than rolling them into one accuracy number. Per n=19, under-emit (3%) and row-edit (76%) have wildly different rates and wildly different fix costs; collapsing them hides drift in the rare-but-expensive failure modes.

## Anti-patterns

- Sampling only low-confidence pages. Biases all metrics; signal is weak.
- Sampling uniformly across months instead of across jocks. Misses the actual variance axis.
- Sampling only "interesting-looking" pages by eye. Confirmation bias.
- Mixing anomaly and random into one bucket. Destroys unbiased metrics.
- Sampling the same page position (first page, last page) across files. Page-position correlates with handwriting fatigue.

## Per-year recipe

1. Extract the year's pages with the current pipeline (no hand verification yet).
2. Compute stratification keys per page: jock, day-of-week, hour-block, page-position-third, form-era cluster.
3. Run anomaly detectors: layout outliers, row-count outliers, oddities-tagged, type-vocab outliers, bottom-decile confidence.
4. Draw 9 random-stratified pages reflecting the year's distribution.
5. Draw 4 anomaly pages — at least one from each detector category, no overlap with the random bucket.
6. Hand-verify all 13; compute metrics separately for random vs anomaly.
7. Random-bucket metrics outside the 1990 confidence interval → year fails drift; queue a full 60-page calibration for that year.
8. Anomaly-bucket findings feed prompt iteration regardless of drift verdict.

## Where

- `scripts/sample_year_for_drift.py` (new) — emits the per-year sample plan, two-bucket structure preserved.
- `core/page_layout.py` — reuse the projection-profile detector for both form-era clustering and layout-outlier anomaly detection.
- `data/verifier-pulled-refresh/` — the in-progress 60-page 1990 baseline this calibrates against.

## Constraints

- The 95% auto-accept precision targets used elsewhere (issue #62) are not blocked by drift testing — the drift gate is separate and runs against the row-text metric set.
- Hand-verification effort budget is the binding constraint: spot checks should be designed to fit in a single Alex session per year (~13 pages plausible; 25 pages stretch). Anything more triggers the "actually do the full 60" branch.
- Anomaly-bucket size scales with the failure-mode discovery rate, not coverage. Once a year's anomaly bucket stops surfacing new patterns, downgrade — don't grow it indefinitely.

## Acceptance criteria

- [ ] `scripts/sample_year_for_drift.py` emits a per-page list of `(page, bucket=random|anomaly, stratification keys, anomaly reasons)` for a given year without any hand verification.
- [ ] Stratification axes implemented: jock, day-of-week, page-position-third, form-era cluster.
- [ ] Anomaly detectors wired: layout outliers (via `page_layout`), row-count outliers, oddities-tagged, type-vocab outliers, lowest-decile confidence.
- [ ] Form-era cluster boundaries documented for 1990–2001 (output of `page_layout` cluster analysis on the full extracted corpus).
- [ ] Drift-test metric definitions match the 1990 baseline numerator/denominator (per-row edit rate, per-page under-emit rate, type_raw blank rate, F1 for continuation / double_height / crossed_out).
- [ ] Unit tests demonstrate the sampler stays within strata proportions on synthetic input and labels each output page with its bucket and anomaly reason.

## Out of scope for this ticket

- The full 60-page 1990 calibration itself (in progress separately).
- Sequential-testing / multiple-comparison correction across years. Add once we have ≥3 years of data and false-positive drift detection becomes a real risk.
- Track-grain reconciliation drift. Reconciliation calibration is keyed off the artist axis, which is decoupled from handwriting drift.
- Acting on the drift verdict (queueing a year's full 60-page calibration). The sampler emits the plan; running it is a separate workflow.

## Notes for implementer

- Treat the random and anomaly buckets as having different downstream consumers. The random bucket feeds the drift gate (statistical test). The anomaly bucket feeds prompt iteration (qualitative review). Keeping the labels distinct in the output file is non-negotiable.
- The form-era clustering should be done once across the full extracted corpus, not re-derived per year. A page's era is a fixed property of its scan; the per-year sampler reads the precomputed cluster id.
- The sampler is deterministic given a seed. Same year + same seed → same sample. This matters for reproducibility when the same year is sampled multiple times (e.g., after a prompt change shifts row-count outliers).
- The anomaly bucket should de-duplicate against the random bucket. If a randomly chosen page also matches an anomaly criterion, leave it in the random bucket and draw a different anomaly.

## Related

- `project_empirical_failure_modes_n19.md` — the 1990 baseline this calibrates against
- `project_bbox_sweep_result.md` — the 5 named bundles + 42 pathology candidates that seed the 1990 anomaly bucket
- `project_oddity_overlay_pattern.md`, `project_oddity_doodle_entry.md` — known oddity patterns
- #62 — Phase 2 reconciliation (parallel calibration, different axis)
- #43 — Gemini Flash calibration spike (separate calibration; cross-linked in case the Flash decision ships first and shifts the model under us)
- #39 — Sprint 3 quality long-tail recovery (this work fits inside that scope)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 2 calibration: per-year drift sampling design (stratified + anomaly buckets) #66

Context

Sources of drift across the corpus

Design: two buckets per year

Random-stratified bucket: what to stratify on

Anomaly bucket: what counts

Free starting anomalies (1990)

Sample size and detectable drift

Anti-patterns

Per-year recipe

Where

Constraints

Acceptance criteria

Out of scope for this ticket

Notes for implementer

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Axis	Why	How
Jock identity	Each jock has a distinct hand; cohort is the dominant within-year variance	Sample to match the year's airtime distribution by `quadrants[*].jock_raw`
Page position in file	Page 1 vs page 30: handwriting fatigues over a multi-week file	Stratify into early / mid / late thirds
Day-of-week + overnight vs daytime	Overnight slots are faster/sloppier; weekend jocks differ from weekday regulars	Bucket from `page_date_raw` + `hour_raw`
Form-era	Template revisions are a categorical variance axis	Cluster pages by `page_layout` outputs (`header_bottom_y`, `body_mid_y`, `column_mid_x`) on the extracted corpus

Phase 2 calibration: per-year drift sampling design (stratified + anomaly buckets) #66

Description

Context

Sources of drift across the corpus

Design: two buckets per year

Random-stratified bucket: what to stratify on

Anomaly bucket: what counts

Free starting anomalies (1990)

Sample size and detectable drift

Anti-patterns

Per-year recipe

Where

Constraints

Acceptance criteria

Out of scope for this ticket

Notes for implementer

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions