Skip to content

PR A: Batch mode + context caching in core/gemini.py #40

@jakebromberg

Description

@jakebromberg

Problem

core/gemini.py calls Gemini one page at a time at real-time pricing with no context caching. The prompt prefix (~2-3K tokens of PAGE_EXTRACTION_PROMPT plus the JSON response schema) is identical across the entire ~16K-page corpus — paying full input rate on every call wastes ~75% of that portion. Gemini also exposes a batch mode at ~50% of real-time pricing with a 24h SLA, perfectly suited to a one-shot corpus extraction.

End state

GeminiClient supports both optimizations:

  1. Context caching. A cachedContent resource is created at the start of a corpus run from PAGE_EXTRACTION_PROMPT + the response schema. Subsequent extract_page calls reference the cached prefix and only pay full rate for the image plus their own request specifics.
  2. Batch mode (opt-in). A batch=True path (constructor flag or sibling method) routes through Gemini's batch API and reconciles results back to the existing jobs.db state machine.

The real-time, non-cached path remains the default; both new modes are opt-in so dev iteration isn't affected.

Where

  • core/gemini.py — the GeminiClient wrapper. New cache lifecycle methods + a batch submission path.
  • core/pipeline.py — corpus orchestration; creates the cache once at corpus-run start, polls batch jobs.
  • core/jobs.py — if batch mode needs an intermediate "submitted" status, add it without breaking the existing pending → rendered → processing → completed flow.
  • cli.py--batch flag on the corpus-extract subcommand.

Constraints

  • Context caching has a minimum-size threshold (currently ~1024 tokens). Confirm the prompt+schema combined comfortably exceeds it. If not, skip caching gracefully.
  • Batch mode is async: submit → poll → fetch. The existing pipeline is sync per page; the batch path needs a polling loop bounded by SLA. Don't block dev iteration on the 24h SLA — keep real-time as the default.
  • Cache TTL is configurable. For a one-shot corpus run the default is fine; if/when we re-run periodically, surface the TTL knob.
  • Existing 34 corpus JSONs must remain valid (no schema mutation that would invalidate them).

Acceptance criteria

  • GeminiClient.extract_page uses cachedContent when a cache exists; falls back to the un-cached path cleanly when it doesn't.
  • Pipeline creates the cache once at corpus-run start; subsequent calls reuse it.
  • Batch mode is wired via opt-in flag; submission and polling complete end-to-end against a small (≤10-page) test set.
  • Sanity check: billed input-text tokens drop ~75% on a 10-page run (after the first call); batch run total ~50% of real-time band.
  • All existing tests pass; new tests cover the cache lifecycle and the batch submit/poll flow (mocked).
  • external_api-marker tests exercise both paths against the real API in the scheduled workflow.

Notes for implementer

  • The Gemini SDK exposes client.caches.create(model=..., system_instruction=..., contents=..., ttl=...) and the corresponding cached_content parameter on generate calls. Response schema can stay attached to the cache.
  • Batch mode uses a different endpoint (client.batches.create / equivalent). Responses come back as a single payload that you iterate over against your submission order; keep the order aligned with jobs.db rows.
  • Don't conflate the two features in tests — they're independent and one should be debuggable without the other.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions