Skip to content

infer.py: CLI flags for ngram params, --resume, --results_jsonl, --max_pages/images, mode validation, tempdir cleanup#21

Open
shreyasmene06 wants to merge 1 commit into
baidu:mainfrom
shreyasmene06:fix/infer-py-ergonomics
Open

infer.py: CLI flags for ngram params, --resume, --results_jsonl, --max_pages/images, mode validation, tempdir cleanup#21
shreyasmene06 wants to merge 1 commit into
baidu:mainfrom
shreyasmene06:fix/infer-py-ergonomics

Conversation

@shreyasmene06

Copy link
Copy Markdown

Fixes #17.

Summary

One focused PR covering all 6 bugs + 3 features from the issue. All changes are scoped to infer.py and README.md; no model or upstream dependency touched.

Bug fixes

  • Hardcoded ngram params (infer.py:197-202 in the old code): ngram_size and ngram_window were both module-level constants (35 and 128). The README's infer_multi example uses ngram_window=1024 for multi-page. Now both are CLI flags; default is 1024 to match the multi-page recommendation, and --ngram_size 0 disables the custom logit processor entirely.
  • --image_mode gundam silently accepted in PDF mode: build_jobs now raises ValueError with a clear message. Also changed the default from gundam to base, since multi-page / dataset use is more common in batch runs and gundam was the silent footgun.
  • pdf_to_images leaked /tmp/pdf_ocr_*/: now uses tempfile.TemporaryDirectory, returned to the caller so run() can cleanup() in a finally block (so cleanup happens even on crash).
  • Failed requests wrote empty .md files: collect_stream_silent now takes a write_output flag; only successful requests get an output file. Downstream consumers can now tell tokens > 0 from failed.
  • Dataset images sorted by descending file size: changed to alphabetical. The previous ordering had no semantic justification and made logs confusing.
  • README kernels version mismatch: line 129 said 0.9.0, line 135 installed 0.11.7. Updated prose to match the install line.

New features

  • --resume: skip images / pages whose .md already exists and is non-empty. Empty files are retried (so a previous crash mid-write doesn't get stuck).
  • --results_jsonl <path>: one structured record per request with index, name, status, tokens, decode_time_s, wall_time_s, output_file, error. Flushed after every record so a crash doesn't lose results.
  • --max_pages N (PDF) and --max_images N (image_dir): subset caps for quick smoke tests.
  • run() exits with status 1 when every request failed, so pipelines / CI can detect batch failures. Previously a total-failure run exited 0.
  • Run summary now reports ok=N, failed=M, skipped=K instead of just N/M.

Behavior changes worth noting

  • Default --image_mode is now base, not gundam. This matches the README's stronger recommendation (multi-page / dataset use case) and prevents the silent PDF+gundam footgun. Single-image users running gundam explicitly via the README code will be unaffected; the change only affects the default for infer.py.
  • Default --ngram_window is now 1024, not 128. This matches the README's infer_multi setting. The old 128 was appropriate for single image but produced worse multi-page output.

If either of these default flips is unwanted, both are easy to swap back — happy to change either before merge.

Testing

  • All new CLI flags are accepted by argparse (verified via --help).
  • uv tool run --from ruff ruff check infer.py → all checks passed.
  • python -m py_compile infer.py → OK.
  • 7 offline tests via mocked fitz/sglang:
    • tempdir lifecycle (cleanup works on .cleanup())
    • max_pages cap
    • tempdir returned to caller, kept alive until run() finishes
    • gundam + pdf raises ValueError
    • alphabetical sort, not by size
    • max_images cap
    • _already_done semantics (empty file → retry)
    • ResultsWriter writes valid JSONL; None path is a no-op
  • Could not run live inference here (no GPU / SGLang wheel installed). The SGLang server interface (/v1/chat/completions with custom_logit_processor) is unchanged; only the wrapper script and the payload's custom_params values are touched.

…x_pages/images, mode validation, tempdir cleanup

Fixes baidu#17.

Bugs:
- ngram_size and ngram_window were hardcoded to 35/128 even though the
  README's multi-page example uses 1024. Now both are CLI flags; default
  is 1024 to match the multi-page recommended setting, with --ngram_size 0
  to disable the custom logit processor entirely.
- --image_mode gundam was silently accepted with --pdf, producing poor
  multi-page output. build_jobs now raises ValueError.
- pdf_to_images leaked /tmp/pdf_ocr_* directories forever (mkdtemp without
  cleanup). Now uses tempfile.TemporaryDirectory and run() cleans up in a
  finally block.
- collect_stream_silent unconditionally truncated/created the output .md
  even on failed requests, so downstream consumers could not distinguish
  'succeeded with no text' from 'failed after 5 retries'. Now controlled
  by a write_output flag; run() does not pass the output file for
  skipped/failed requests.
- Dataset image list was sorted by descending file size, largest first,
  with no semantic reason. Now sorted alphabetically.
- README line 129 said kernels==0.9.0 but line 135 installed 0.11.7.
  Updated prose to match the install line.

Features:
- --resume: skip images/pages whose .md already exists and is non-empty
  (empty files are retried).
- --results_jsonl <path>: write one structured record per request
  (index, name, status, tokens, decode_time_s, wall_time_s, output_file,
  error) for pipeline integration.
- --max_pages N (PDF mode) and --max_images N (image_dir mode) for
  quick smoke tests.
- run() exits with status 1 when every request failed, so CI / pipelines
  can detect batch failures.
- New 'skipped' status included in the run summary alongside ok/failed.

README: documented the new flags under 'Useful options' and added a note
that --pdf requires --image_mode base.

@rajpratham1 rajpratham1 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing the changes, this looks like a well-rounded usability improvement rather than just a feature addition. The PR adds several practical CLI options (--resume, --results_jsonl, --max_pages, --max_images, configurable n-gram parameters), improves temporary directory cleanup, updates the README accordingly, and preserves backward compatibility for the main inference flow. The changes are cohesive and the documentation has been updated alongside the implementation.

@kushdab

kushdab commented Jun 26, 2026

Copy link
Copy Markdown

Very thorough PR — this addresses the full scope of #17 in one shot. The implementation quality is high. A few things worth discussing before merge:

Substantive concern: DEFAULT_NGRAM_WINDOW = 1024 as the universal default

DEFAULT_NGRAM_SIZE = 35
DEFAULT_NGRAM_WINDOW = 1024   # was 128

The README explicitly documents two recommended settings:

Mode ngram_size ngram_window
Single image (gundam / base) 35 128
Multi-page / PDF 35 1024

Setting --ngram_window to 1024 as the CLI default means any user running single-image mode without explicitly passing --ngram_window 128 gets a wider deduplication window than the model was tuned for. For most documents this is fine, but on short dense images (business cards, labels, single-paragraph scans) the wider window can suppress legitimate near-repetitions in the output.

Suggestion: either keep DEFAULT_NGRAM_WINDOW = 128 (the conservative single-image default) and let users override for multi-page, or make the default conditional on mode:

if args.ngram_window is None:
    args.ngram_window = 1024 if (args.pdf or args.image_dir) else 128

PDF tmpdir cleanup — TemporaryDirectory vs atexit

Returning the TemporaryDirectory object from pdf_to_images() and keeping it alive in the caller is a clean approach. One note: the caller must hold the reference alive for the full inference duration. If someone copies pdf_to_images into their own script and writes paths, _ = pdf_to_images(...), the _ is discarded immediately and the finalizer can run before inference finishes (immediate in CPython, but not guaranteed in PyPy). A docstring warning on this would help.

Overlap with existing PRs

This PR overlaps with two open PRs targeting the same bugs:

This PR is a strict superset of both.

What's excellent

  • The --resume flag is genuinely missing and high-value for long batch runs.
  • --results_jsonl makes the output machine-readable — great for pipeline integration.
  • The structured {status, tokens, decode_time, error} return from infer_one is much cleaner than the old tuple.
  • write_output=False on the resume-skip path is correct (don't clobber an existing .md).
  • The --max_pages / --max_images safety valves are useful.
  • The --image_mode base enforcement for PDF mode is the right guard (PR fix: clean up PDF tmpdir and validate --image_mode in PDF mode #26 also adds this).

Overall this is the strongest single fix for #17 and should supersede the narrower PRs. The ngram_window default is the main thing I'd resolve before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug+feature] infer.py: hardcoded ngram params, no --resume, no structured results, image-mode validation, tmpdir leak

3 participants