Skip to content

PR F: Header/footer-only selective recall when those fields come back null #45

@jakebromberg

Description

@jakebromberg

Problem

When Gemini returns page_date_raw=null or comments_raw=null but core.page_layout.detect_page_layout shows the corresponding band (header strip / footer strip) has content, the model has demonstrably missed something specific. Re-running the full page is overkill; re-calling Gemini against just the cropped strip is sub-cent and targeted.

End state

A post-extraction step in core/pipeline.py that, for each completed page:

  1. checks whether page_date_raw is null AND the header strip has detected content, and/or
  2. checks whether comments_raw is null AND the footer strip has detected content,

then for each missing-but-likely-present field, calls Gemini against just the cropped strip with the existing HEADER_EXTRACTION_PROMPT / FOOTER_EXTRACTION_PROMPT. Recovered value is merged into the PageResult and persisted.

Where

  • core/pipeline.py — the recall step after the main extract.
  • core/gemini.py — a header-only / footer-only extract method (or a generic extract_strip(image_bytes, prompt, wire_schema) that takes the cropped strip and the prompt).
  • _crop_header_strip / _crop_footer_strip currently live in scripts/calibrate_models.py. This PR likely moves them to core/ (e.g. core/crops.py) since they'd now be used in production, not just calibration. Update Modal adapter imports accordingly.
  • core/prompts.pyHEADER_EXTRACTION_PROMPT and FOOTER_EXTRACTION_PROMPT already exist (built for modal-qwen-vl-quad); reuse as-is.

Constraints

  • Recall fires only when BOTH conditions hold: field is null AND layout shows content in that band. False positives are expensive (extra API calls); err on the side of not recalling.
  • Cap recalls at 2 per page (one header + one footer). No retry on recall failures — if the strip-only call also returns null, accept it.
  • Persist the recovered field by writing back to the on-disk JSON. Existing 34 corpus JSONs predate this — they aren't candidates for the recall pass.
  • Add a recovered_from_strip_recall flag (or similar metadata) on the page so downstream consumers can tell apart "Gemini got this on the first try" from "we had to recall." Useful for quality analysis.

Acceptance criteria

  • Recall triggers only on null-field + layout-content-detected pages.
  • Recovered value is merged into the on-disk PageResult.
  • No regression on pages that don't trigger the recall.
  • Unit tests cover: no recall needed, header-only recall, footer-only recall, both recalls, recall returns null again (graceful accept), layout detection fails (don't recall).
  • Integration test confirms the recovered page is queryable just like a normally-extracted one.

Notes for implementer

The Modal-quad work built HEADER_EXTRACTION_PROMPT and FOOTER_EXTRACTION_PROMPT and a parallel set of wire schemas (HEADER_WIRE_SCHEMA, FOOTER_WIRE_SCHEMA). For Gemini, you can either reuse the same prompts and rely on Gemini's response_schema (the natural Gemini path) or define a tiny header/footer-only Pydantic model just for these recalls. The latter mirrors GeminiPageResult/PageResult discipline.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions