Skip to content

Sprint 3: Quality long-tail recovery (optional) #39

@jakebromberg

Description

@jakebromberg

Goal

Recover quality on the ~5-10% of pages where Gemini undercounts rows or misses page-level fields, using the model-agnostic infrastructure built during the Modal explorations (grid-line detection, per-quadrant cropping, row-count discrepancy, header/footer strip prompts). Optional sprint — gating decision happens after Sprint 1+2 ship and the corpus output has been inspected.

Cost impact

Selective recoveries only: rough estimate is ~20-30% increase on the Sprint 1+2 corpus baseline, since only the flagged subset gets the per-quadrant 5-6× treatment.

Sprint scope

Type Issue Effort
PR D Lift compare_row_counts from calibration to a production-time check ~1 day
PR E gemini-quad adapter for selective re-extraction (depends on PR D) ~1-2 days
PR F Header/footer-only selective recall when those fields come back null ~half day

Success criteria

  • PR D's check fires on every completed page; flagged-pages list is produced as a side product of the corpus run.
  • PR E's gemini-quad adapter runs only on flagged pages and recovers measurably better row coverage on the 5 goldens vs single-shot Gemini.
  • PR F's selective recall fires only when (a) the corresponding field is null AND (b) detect_page_layout shows content in that band.
  • No regression on pages that don't trigger any recovery path.

Sub-issues

Dependencies

Sprints 1 and 2 should ship first. The decision to run Sprint 3 at all should be informed by inspecting post-Sprint-2 corpus output — if the quality is already acceptable, the 20-30% cost premium for long-tail recovery may not be worth it.

Out of scope

  • Default-path per-quadrant cropping. Running every page through 5-6 API calls would multiply corpus cost without commensurate quality gain on Gemini's failure distribution (which isn't primarily layout misplacement, per the 5-golden calibration).
  • Phase 2 schema enhancements (date normalization, type column normalization, library reconciliation).

Context

The Modal explorations produced a lot of model-agnostic page-shape-aware infrastructure: core.page_layout.detect_page_layout, _crop_quadrants / _crop_header_strip / _crop_footer_strip in scripts/calibrate_models.py, the row-count discrepancy logic in core/golden.py, and the HEADER_EXTRACTION_PROMPT / FOOTER_EXTRACTION_PROMPT constants in core/prompts.py. This sprint lifts that infrastructure from "calibration only" to "production tools," reusing it with Gemini as the transcriber.

Related

  • Sprint 1 — Cost wins (upstream baseline).
  • Sprint 2 — Flash tier decision (upstream baseline).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions