Problem
core/gemini.py calls Gemini one page at a time at real-time pricing with no context caching. The prompt prefix (~2-3K tokens of PAGE_EXTRACTION_PROMPT plus the JSON response schema) is identical across the entire ~16K-page corpus — paying full input rate on every call wastes ~75% of that portion. Gemini also exposes a batch mode at ~50% of real-time pricing with a 24h SLA, perfectly suited to a one-shot corpus extraction.
End state
GeminiClient supports both optimizations:
- Context caching. A
cachedContent resource is created at the start of a corpus run from PAGE_EXTRACTION_PROMPT + the response schema. Subsequent extract_page calls reference the cached prefix and only pay full rate for the image plus their own request specifics.
- Batch mode (opt-in). A
batch=True path (constructor flag or sibling method) routes through Gemini's batch API and reconciles results back to the existing jobs.db state machine.
The real-time, non-cached path remains the default; both new modes are opt-in so dev iteration isn't affected.
Where
core/gemini.py — the GeminiClient wrapper. New cache lifecycle methods + a batch submission path.
core/pipeline.py — corpus orchestration; creates the cache once at corpus-run start, polls batch jobs.
core/jobs.py — if batch mode needs an intermediate "submitted" status, add it without breaking the existing pending → rendered → processing → completed flow.
cli.py — --batch flag on the corpus-extract subcommand.
Constraints
- Context caching has a minimum-size threshold (currently ~1024 tokens). Confirm the prompt+schema combined comfortably exceeds it. If not, skip caching gracefully.
- Batch mode is async: submit → poll → fetch. The existing pipeline is sync per page; the batch path needs a polling loop bounded by SLA. Don't block dev iteration on the 24h SLA — keep real-time as the default.
- Cache TTL is configurable. For a one-shot corpus run the default is fine; if/when we re-run periodically, surface the TTL knob.
- Existing 34 corpus JSONs must remain valid (no schema mutation that would invalidate them).
Acceptance criteria
Notes for implementer
- The Gemini SDK exposes
client.caches.create(model=..., system_instruction=..., contents=..., ttl=...) and the corresponding cached_content parameter on generate calls. Response schema can stay attached to the cache.
- Batch mode uses a different endpoint (
client.batches.create / equivalent). Responses come back as a single payload that you iterate over against your submission order; keep the order aligned with jobs.db rows.
- Don't conflate the two features in tests — they're independent and one should be debuggable without the other.
Related
Problem
core/gemini.pycalls Gemini one page at a time at real-time pricing with no context caching. The prompt prefix (~2-3K tokens ofPAGE_EXTRACTION_PROMPTplus the JSON response schema) is identical across the entire ~16K-page corpus — paying full input rate on every call wastes ~75% of that portion. Gemini also exposes a batch mode at ~50% of real-time pricing with a 24h SLA, perfectly suited to a one-shot corpus extraction.End state
GeminiClientsupports both optimizations:cachedContentresource is created at the start of a corpus run fromPAGE_EXTRACTION_PROMPT+ the response schema. Subsequentextract_pagecalls reference the cached prefix and only pay full rate for the image plus their own request specifics.batch=Truepath (constructor flag or sibling method) routes through Gemini's batch API and reconciles results back to the existingjobs.dbstate machine.The real-time, non-cached path remains the default; both new modes are opt-in so dev iteration isn't affected.
Where
core/gemini.py— theGeminiClientwrapper. New cache lifecycle methods + a batch submission path.core/pipeline.py— corpus orchestration; creates the cache once at corpus-run start, polls batch jobs.core/jobs.py— if batch mode needs an intermediate "submitted" status, add it without breaking the existingpending → rendered → processing → completedflow.cli.py—--batchflag on the corpus-extract subcommand.Constraints
Acceptance criteria
GeminiClient.extract_pageusescachedContentwhen a cache exists; falls back to the un-cached path cleanly when it doesn't.external_api-marker tests exercise both paths against the real API in the scheduled workflow.Notes for implementer
client.caches.create(model=..., system_instruction=..., contents=..., ttl=...)and the correspondingcached_contentparameter on generate calls. Response schema can stay attached to the cache.client.batches.create/ equivalent). Responses come back as a single payload that you iterate over against your submission order; keep the order aligned withjobs.dbrows.Related