Problem
When Gemini returns page_date_raw=null or comments_raw=null but core.page_layout.detect_page_layout shows the corresponding band (header strip / footer strip) has content, the model has demonstrably missed something specific. Re-running the full page is overkill; re-calling Gemini against just the cropped strip is sub-cent and targeted.
End state
A post-extraction step in core/pipeline.py that, for each completed page:
- checks whether
page_date_raw is null AND the header strip has detected content, and/or
- checks whether
comments_raw is null AND the footer strip has detected content,
then for each missing-but-likely-present field, calls Gemini against just the cropped strip with the existing HEADER_EXTRACTION_PROMPT / FOOTER_EXTRACTION_PROMPT. Recovered value is merged into the PageResult and persisted.
Where
core/pipeline.py — the recall step after the main extract.
core/gemini.py — a header-only / footer-only extract method (or a generic extract_strip(image_bytes, prompt, wire_schema) that takes the cropped strip and the prompt).
_crop_header_strip / _crop_footer_strip currently live in scripts/calibrate_models.py. This PR likely moves them to core/ (e.g. core/crops.py) since they'd now be used in production, not just calibration. Update Modal adapter imports accordingly.
core/prompts.py — HEADER_EXTRACTION_PROMPT and FOOTER_EXTRACTION_PROMPT already exist (built for modal-qwen-vl-quad); reuse as-is.
Constraints
- Recall fires only when BOTH conditions hold: field is null AND layout shows content in that band. False positives are expensive (extra API calls); err on the side of not recalling.
- Cap recalls at 2 per page (one header + one footer). No retry on recall failures — if the strip-only call also returns null, accept it.
- Persist the recovered field by writing back to the on-disk JSON. Existing 34 corpus JSONs predate this — they aren't candidates for the recall pass.
- Add a
recovered_from_strip_recall flag (or similar metadata) on the page so downstream consumers can tell apart "Gemini got this on the first try" from "we had to recall." Useful for quality analysis.
Acceptance criteria
Notes for implementer
The Modal-quad work built HEADER_EXTRACTION_PROMPT and FOOTER_EXTRACTION_PROMPT and a parallel set of wire schemas (HEADER_WIRE_SCHEMA, FOOTER_WIRE_SCHEMA). For Gemini, you can either reuse the same prompts and rely on Gemini's response_schema (the natural Gemini path) or define a tiny header/footer-only Pydantic model just for these recalls. The latter mirrors GeminiPageResult/PageResult discipline.
Related
Problem
When Gemini returns
page_date_raw=nullorcomments_raw=nullbutcore.page_layout.detect_page_layoutshows the corresponding band (header strip / footer strip) has content, the model has demonstrably missed something specific. Re-running the full page is overkill; re-calling Gemini against just the cropped strip is sub-cent and targeted.End state
A post-extraction step in
core/pipeline.pythat, for each completed page:page_date_rawis null AND the header strip has detected content, and/orcomments_rawis null AND the footer strip has detected content,then for each missing-but-likely-present field, calls Gemini against just the cropped strip with the existing
HEADER_EXTRACTION_PROMPT/FOOTER_EXTRACTION_PROMPT. Recovered value is merged into thePageResultand persisted.Where
core/pipeline.py— the recall step after the main extract.core/gemini.py— a header-only / footer-only extract method (or a genericextract_strip(image_bytes, prompt, wire_schema)that takes the cropped strip and the prompt)._crop_header_strip/_crop_footer_stripcurrently live inscripts/calibrate_models.py. This PR likely moves them tocore/(e.g.core/crops.py) since they'd now be used in production, not just calibration. Update Modal adapter imports accordingly.core/prompts.py—HEADER_EXTRACTION_PROMPTandFOOTER_EXTRACTION_PROMPTalready exist (built formodal-qwen-vl-quad); reuse as-is.Constraints
recovered_from_strip_recallflag (or similar metadata) on the page so downstream consumers can tell apart "Gemini got this on the first try" from "we had to recall." Useful for quality analysis.Acceptance criteria
PageResult.Notes for implementer
The Modal-quad work built
HEADER_EXTRACTION_PROMPTandFOOTER_EXTRACTION_PROMPTand a parallel set of wire schemas (HEADER_WIRE_SCHEMA,FOOTER_WIRE_SCHEMA). For Gemini, you can either reuse the same prompts and rely on Gemini'sresponse_schema(the natural Gemini path) or define a tiny header/footer-only Pydantic model just for these recalls. The latter mirrorsGeminiPageResult/PageResultdiscipline.Related