Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,12 @@ cli.py Typer entrypoint: `flowsheets <subcommand>`.
| Reconciliation against `@wxyc/shared` canonical artists | — | fuzzy-match + auto-correct |
| Bulk full-corpus run | not in this PR | calibrate first, then schedule |

## Research adapters

Production extraction goes through Gemini in `core/gemini.py`. The non-Gemini adapters in `scripts/calibrate_models.py` (`churro`, `qwen-vl`, `modal-churro`, `modal-qwen-vl`, `modal-qwen-vl-quad`, `local-quadrant-smoke`) are research-only: they exist as calibration regression alarms so a Gemini-side regression has something to compare against, and as a record of what we've tried. None of them are production candidates.

Quality on the 5-golden set (matched rows out of 76): Gemini 3 Pro 68/76; modal-qwen-vl-quad 50/76; modal-qwen-vl 50/76 (with grammar-constrained decoding, 24/76 before); modal-churro 36/76. The best Modal adapter sits ~24% behind Gemini on quality while costing ~6× more per page (`modal-qwen-vl-quad` ≈ $0.06-0.12/page). Any dollar figure quoted in `scripts/calibrate_models.py` for a Modal adapter at corpus scale (e.g. "~$1200-1800") is exploration spend if we calibrated it on the full corpus — it is not a planned spend.

## Marker scheme

Mirroring `request-o-matic` and `library-metadata-lookup`:
Expand Down
57 changes: 37 additions & 20 deletions scripts/calibrate_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,43 +8,60 @@
The pure scoring logic lives in `core/calibration.py` and is unit-tested.
This file is the CLI + adapter wiring.

Production vs. research adapters:

Production extraction goes through Gemini (`core/gemini.py`). The
Modal adapters below (`modal-churro`, `modal-qwen-vl`,
`modal-qwen-vl-quad`) and the local Qwen/Churro adapters are kept
here as RESEARCH adapters — they're calibration regression alarms,
not a candidate production path. The best Modal adapter scored
50/76 matched rows on the 5-golden set; Gemini 3 Pro scored 68/76 on
the same set. Any cost band quoted below for a Modal adapter is
research/exploration spend if calibrated at corpus scale, not a
planned production spend.

Models:

gemini-stored Read previously-saved Gemini results from `data/results/`.
No API call. Use this as the baseline to compare against.
Maps `<stem>.png` to `data/results/**/<stem>.json` when
the stem encodes `<pdf-stem>-page<NN>` (the convention
the golden README documents).
the golden README documents). Production parity check.

churro Run stanford-oval/churro-3B locally via transformers.
Requires: `pip install transformers torch pillow`.
~3B params, fp16 ≈ 6-8GB RAM/VRAM.
churro [research] Run stanford-oval/churro-3B locally via
transformers. Requires: `pip install transformers torch
pillow`. ~3B params, fp16 ≈ 6-8GB RAM/VRAM.

qwen-vl Run Qwen2.5-VL-7B-Instruct locally via transformers.
Requires: `pip install transformers torch pillow`.
For 4-bit quantized variants, use `--qwen-model` to
point at an Unsloth checkpoint.
qwen-vl [research] Run Qwen2.5-VL-7B-Instruct locally via
transformers. Requires: `pip install transformers torch
pillow`. For 4-bit quantized variants, use `--qwen-model`
to point at an Unsloth checkpoint.

modal-churro Same as `churro` but the inference runs on a remote
A100 via Modal. ~30-60s per page, no local RAM use.
Requires: `pip install modal && modal token new`.
modal-churro [research] Same as `churro` but the inference runs on a
remote A100 via Modal. ~30-60s per page, no local RAM use.
Requires: `pip install modal && modal token new`. Scored
36/76 on the 5 goldens vs Gemini's 68/76.

modal-qwen-vl Same as `qwen-vl`, on Modal A100. Pricing reference:
one golden page ~$0.01-0.02; full 18K-page corpus
run ~$100-200.
modal-qwen-vl [research] Same as `qwen-vl`, on Modal A100. Per-page
spend reference: ~$0.01-0.02. Quality: scored 24/76
pre-grammar, 50/76 with grammar-constrained decoding
(still below Gemini's 68/76).

modal-qwen-vl-quad
Per-quadrant Qwen-VL on Modal: crops the page into
4 sub-images plus a header strip and a footer strip,
[research] Per-quadrant Qwen-VL on Modal: crops the page
into 4 sub-images plus a header strip and a footer strip,
calls the model 6x per page, assembles a PageResult
locally. Eliminates cross-quadrant content placement
errors that the single-shot `modal-qwen-vl` adapter
still suffers. ~6x cost (~$0.06-0.12/page); full
corpus ~$1200-1800.
still suffers. ~6x cost (~$0.06-0.12/page). Quality:
50/76 on the 5 goldens, still below Gemini's 68/76 —
the structural fix didn't close the gap. An 18K-page
corpus run would cost ~$1200-1800 of *exploration*
spend; this is not a planned spend.

local-quadrant-smoke
Local-only crop-quality smoke check: runs Churro
(already on disk) per quadrant on the SAME crops
[research] Local-only crop-quality smoke check: runs
Churro (already on disk) per quadrant on the SAME crops
`modal-qwen-vl-quad` produces, then wraps each crop's
raw OCR text into the matching `Quadrant`. Pair with
--check-row-counts to fail fast on broken cropping
Expand Down
Loading