WXYC · jakebromberg · May 11, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -71,6 +71,12 @@ cli.py                           Typer entrypoint: `flowsheets <subcommand>`.
 | Reconciliation against `@wxyc/shared` canonical artists | — | fuzzy-match + auto-correct |
 | Bulk full-corpus run | not in this PR | calibrate first, then schedule |
 
+## Research adapters
+
+Production extraction goes through Gemini in `core/gemini.py`. The non-Gemini adapters in `scripts/calibrate_models.py` (`churro`, `qwen-vl`, `modal-churro`, `modal-qwen-vl`, `modal-qwen-vl-quad`, `local-quadrant-smoke`) are research-only: they exist as calibration regression alarms so a Gemini-side regression has something to compare against, and as a record of what we've tried. None of them are production candidates.
+
+Quality on the 5-golden set (matched rows out of 76): Gemini 3 Pro 68/76; modal-qwen-vl-quad 50/76; modal-qwen-vl 50/76 (with grammar-constrained decoding, 24/76 before); modal-churro 36/76. The best Modal adapter sits ~24% behind Gemini on quality while costing ~6× more per page (`modal-qwen-vl-quad` ≈ $0.06-0.12/page). Any dollar figure quoted in `scripts/calibrate_models.py` for a Modal adapter at corpus scale (e.g. "~$1200-1800") is exploration spend if we calibrated it on the full corpus — it is not a planned spend.
+
 ## Marker scheme
 
 Mirroring `request-o-matic` and `library-metadata-lookup`:

diff --git a/scripts/calibrate_models.py b/scripts/calibrate_models.py
@@ -8,43 +8,60 @@
 The pure scoring logic lives in `core/calibration.py` and is unit-tested.
 This file is the CLI + adapter wiring.
 
+Production vs. research adapters:
+
+  Production extraction goes through Gemini (`core/gemini.py`). The
+  Modal adapters below (`modal-churro`, `modal-qwen-vl`,
+  `modal-qwen-vl-quad`) and the local Qwen/Churro adapters are kept
+  here as RESEARCH adapters — they're calibration regression alarms,
+  not a candidate production path. The best Modal adapter scored
+  50/76 matched rows on the 5-golden set; Gemini 3 Pro scored 68/76 on
+  the same set. Any cost band quoted below for a Modal adapter is
+  research/exploration spend if calibrated at corpus scale, not a
+  planned production spend.
+
 Models:
 
   gemini-stored    Read previously-saved Gemini results from `data/results/`.
                    No API call. Use this as the baseline to compare against.
                    Maps `<stem>.png` to `data/results/**/<stem>.json` when
                    the stem encodes `<pdf-stem>-page<NN>` (the convention
-                   the golden README documents).
+                   the golden README documents). Production parity check.
 
-  churro           Run stanford-oval/churro-3B locally via transformers.
-                   Requires: `pip install transformers torch pillow`.
-                   ~3B params, fp16 ≈ 6-8GB RAM/VRAM.
+  churro           [research] Run stanford-oval/churro-3B locally via
+                   transformers. Requires: `pip install transformers torch
+                   pillow`. ~3B params, fp16 ≈ 6-8GB RAM/VRAM.
 
-  qwen-vl          Run Qwen2.5-VL-7B-Instruct locally via transformers.
-                   Requires: `pip install transformers torch pillow`.
-                   For 4-bit quantized variants, use `--qwen-model` to
-                   point at an Unsloth checkpoint.
+  qwen-vl          [research] Run Qwen2.5-VL-7B-Instruct locally via
+                   transformers. Requires: `pip install transformers torch
+                   pillow`. For 4-bit quantized variants, use `--qwen-model`
+                   to point at an Unsloth checkpoint.
 
-  modal-churro     Same as `churro` but the inference runs on a remote
-                   A100 via Modal. ~30-60s per page, no local RAM use.
-                   Requires: `pip install modal && modal token new`.
+  modal-churro     [research] Same as `churro` but the inference runs on a
+                   remote A100 via Modal. ~30-60s per page, no local RAM use.
+                   Requires: `pip install modal && modal token new`. Scored
+                   36/76 on the 5 goldens vs Gemini's 68/76.
 
-  modal-qwen-vl    Same as `qwen-vl`, on Modal A100. Pricing reference:
-                   one golden page ~$0.01-0.02; full 18K-page corpus
-                   run ~$100-200.
+  modal-qwen-vl    [research] Same as `qwen-vl`, on Modal A100. Per-page
+                   spend reference: ~$0.01-0.02. Quality: scored 24/76
+                   pre-grammar, 50/76 with grammar-constrained decoding
+                   (still below Gemini's 68/76).
 
   modal-qwen-vl-quad
-                   Per-quadrant Qwen-VL on Modal: crops the page into
-                   4 sub-images plus a header strip and a footer strip,
+                   [research] Per-quadrant Qwen-VL on Modal: crops the page
+                   into 4 sub-images plus a header strip and a footer strip,
                    calls the model 6x per page, assembles a PageResult
                    locally. Eliminates cross-quadrant content placement
                    errors that the single-shot `modal-qwen-vl` adapter
-                   still suffers. ~6x cost (~$0.06-0.12/page); full
-                   corpus ~$1200-1800.
+                   still suffers. ~6x cost (~$0.06-0.12/page). Quality:
+                   50/76 on the 5 goldens, still below Gemini's 68/76 —
+                   the structural fix didn't close the gap. An 18K-page
+                   corpus run would cost ~$1200-1800 of *exploration*
+                   spend; this is not a planned spend.
 
   local-quadrant-smoke
-                   Local-only crop-quality smoke check: runs Churro
-                   (already on disk) per quadrant on the SAME crops
+                   [research] Local-only crop-quality smoke check: runs
+                   Churro (already on disk) per quadrant on the SAME crops
                    `modal-qwen-vl-quad` produces, then wraps each crop's
                    raw OCR text into the matching `Quadrant`. Pair with
                    --check-row-counts to fail fast on broken cropping