infer.py: CLI flags for ngram params, --resume, --results_jsonl, --max_pages/images, mode validation, tempdir cleanup by shreyasmene06 · Pull Request #21 · baidu/Unlimited-OCR

shreyasmene06 · 2026-06-24T15:51:49Z

Fixes #17.

Summary

One focused PR covering all 6 bugs + 3 features from the issue. All changes are scoped to infer.py and README.md; no model or upstream dependency touched.

Bug fixes

Hardcoded ngram params (infer.py:197-202 in the old code): ngram_size and ngram_window were both module-level constants (35 and 128). The README's infer_multi example uses ngram_window=1024 for multi-page. Now both are CLI flags; default is 1024 to match the multi-page recommendation, and --ngram_size 0 disables the custom logit processor entirely.
--image_mode gundam silently accepted in PDF mode: build_jobs now raises ValueError with a clear message. Also changed the default from gundam to base, since multi-page / dataset use is more common in batch runs and gundam was the silent footgun.
pdf_to_images leaked /tmp/pdf_ocr_*/: now uses tempfile.TemporaryDirectory, returned to the caller so run() can cleanup() in a finally block (so cleanup happens even on crash).
Failed requests wrote empty .md files: collect_stream_silent now takes a write_output flag; only successful requests get an output file. Downstream consumers can now tell tokens > 0 from failed.
Dataset images sorted by descending file size: changed to alphabetical. The previous ordering had no semantic justification and made logs confusing.
README kernels version mismatch: line 129 said 0.9.0, line 135 installed 0.11.7. Updated prose to match the install line.

New features

--resume: skip images / pages whose .md already exists and is non-empty. Empty files are retried (so a previous crash mid-write doesn't get stuck).
--results_jsonl <path>: one structured record per request with index, name, status, tokens, decode_time_s, wall_time_s, output_file, error. Flushed after every record so a crash doesn't lose results.
--max_pages N (PDF) and --max_images N (image_dir): subset caps for quick smoke tests.
run() exits with status 1 when every request failed, so pipelines / CI can detect batch failures. Previously a total-failure run exited 0.
Run summary now reports ok=N, failed=M, skipped=K instead of just N/M.

Behavior changes worth noting

Default --image_mode is now base, not gundam. This matches the README's stronger recommendation (multi-page / dataset use case) and prevents the silent PDF+gundam footgun. Single-image users running gundam explicitly via the README code will be unaffected; the change only affects the default for infer.py.
Default --ngram_window is now 1024, not 128. This matches the README's infer_multi setting. The old 128 was appropriate for single image but produced worse multi-page output.

If either of these default flips is unwanted, both are easy to swap back — happy to change either before merge.

Testing

All new CLI flags are accepted by argparse (verified via --help).
uv tool run --from ruff ruff check infer.py → all checks passed.
python -m py_compile infer.py → OK.
7 offline tests via mocked fitz/sglang:
- tempdir lifecycle (cleanup works on .cleanup())
- max_pages cap
- tempdir returned to caller, kept alive until run() finishes
- gundam + pdf raises ValueError
- alphabetical sort, not by size
- max_images cap
- _already_done semantics (empty file → retry)
- ResultsWriter writes valid JSONL; None path is a no-op
Could not run live inference here (no GPU / SGLang wheel installed). The SGLang server interface (/v1/chat/completions with custom_logit_processor) is unchanged; only the wrapper script and the payload's custom_params values are touched.

…x_pages/images, mode validation, tempdir cleanup Fixes baidu#17. Bugs: - ngram_size and ngram_window were hardcoded to 35/128 even though the README's multi-page example uses 1024. Now both are CLI flags; default is 1024 to match the multi-page recommended setting, with --ngram_size 0 to disable the custom logit processor entirely. - --image_mode gundam was silently accepted with --pdf, producing poor multi-page output. build_jobs now raises ValueError. - pdf_to_images leaked /tmp/pdf_ocr_* directories forever (mkdtemp without cleanup). Now uses tempfile.TemporaryDirectory and run() cleans up in a finally block. - collect_stream_silent unconditionally truncated/created the output .md even on failed requests, so downstream consumers could not distinguish 'succeeded with no text' from 'failed after 5 retries'. Now controlled by a write_output flag; run() does not pass the output file for skipped/failed requests. - Dataset image list was sorted by descending file size, largest first, with no semantic reason. Now sorted alphabetically. - README line 129 said kernels==0.9.0 but line 135 installed 0.11.7. Updated prose to match the install line. Features: - --resume: skip images/pages whose .md already exists and is non-empty (empty files are retried). - --results_jsonl <path>: write one structured record per request (index, name, status, tokens, decode_time_s, wall_time_s, output_file, error) for pipeline integration. - --max_pages N (PDF mode) and --max_images N (image_dir mode) for quick smoke tests. - run() exits with status 1 when every request failed, so CI / pipelines can detect batch failures. - New 'skipped' status included in the run summary alongside ok/failed. README: documented the new flags under 'Useful options' and added a note that --pdf requires --image_mode base.

rajpratham1

After reviewing the changes, this looks like a well-rounded usability improvement rather than just a feature addition. The PR adds several practical CLI options (--resume, --results_jsonl, --max_pages, --max_images, configurable n-gram parameters), improves temporary directory cleanup, updates the README accordingly, and preserves backward compatibility for the main inference flow. The changes are cohesive and the documentation has been updated alongside the implementation.

kushdab · 2026-06-26T00:04:20Z

Very thorough PR — this addresses the full scope of #17 in one shot. The implementation quality is high. A few things worth discussing before merge:

Substantive concern: `DEFAULT_NGRAM_WINDOW = 1024` as the universal default

DEFAULT_NGRAM_SIZE = 35
DEFAULT_NGRAM_WINDOW = 1024   # was 128

The README explicitly documents two recommended settings:

Mode	`ngram_size`	`ngram_window`
Single image (`gundam` / `base`)	35	128
Multi-page / PDF	35	1024

Setting --ngram_window to 1024 as the CLI default means any user running single-image mode without explicitly passing --ngram_window 128 gets a wider deduplication window than the model was tuned for. For most documents this is fine, but on short dense images (business cards, labels, single-paragraph scans) the wider window can suppress legitimate near-repetitions in the output.

Suggestion: either keep DEFAULT_NGRAM_WINDOW = 128 (the conservative single-image default) and let users override for multi-page, or make the default conditional on mode:

if args.ngram_window is None:
    args.ngram_window = 1024 if (args.pdf or args.image_dir) else 128

PDF tmpdir cleanup — `TemporaryDirectory` vs `atexit`

Returning the TemporaryDirectory object from pdf_to_images() and keeping it alive in the caller is a clean approach. One note: the caller must hold the reference alive for the full inference duration. If someone copies pdf_to_images into their own script and writes paths, _ = pdf_to_images(...), the _ is discarded immediately and the finalizer can run before inference finishes (immediate in CPython, but not guaranteed in PyPy). A docstring warning on this would help.

Overlap with existing PRs

This PR overlaps with two open PRs targeting the same bugs:

PR fix(infer): add max_tokens guard + --trust-remote-code for SGLang #29 (fix/infer-max-tokens-and-trust-remote-code) — fixes the max_tokens headroom bug and --trust-remote-code for SGLang. Both are covered here with your more thorough structured-result approach. If this PR lands, fix(infer): add max_tokens guard + --trust-remote-code for SGLang #29 can be closed.
PR fix(infer+docs): PDF tmpdir cleanup + remove broken kernels install from README #34 (fix/pdf-tmpdir-cleanup-and-readme-kernels) — fixes PDF tmpdir and the kernels==0.9.0 README typo. Both covered here too. If this PR lands, fix(infer+docs): PDF tmpdir cleanup + remove broken kernels install from README #34 can be closed.

This PR is a strict superset of both.

What's excellent

The --resume flag is genuinely missing and high-value for long batch runs.
--results_jsonl makes the output machine-readable — great for pipeline integration.
The structured {status, tokens, decode_time, error} return from infer_one is much cleaner than the old tuple.
write_output=False on the resume-skip path is correct (don't clobber an existing .md).
The --max_pages / --max_images safety valves are useful.
The --image_mode base enforcement for PDF mode is the right guard (PR fix: clean up PDF tmpdir and validate --image_mode in PDF mode #26 also adds this).

Overall this is the strongest single fix for #17 and should supersede the narrower PRs. The ngram_window default is the main thing I'd resolve before merge.

aqilaziz mentioned this pull request Jun 24, 2026

feat: add HF Spaces demo (app.py), requirements.txt, and unit tests #23

Open

rajpratham1 approved these changes Jun 25, 2026

View reviewed changes

kushdab mentioned this pull request Jun 25, 2026

[bug+feature] infer.py: hardcoded ngram params, no --resume, no structured results, image-mode validation, tmpdir leak #17

Open

This was referenced Jun 26, 2026

fix: clean up PDF tmpdir and validate --image_mode in PDF mode #26

Open

Make batch inference resumable and leak-free #36

Open

Add Docker image publishing and API runtime #39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

infer.py: CLI flags for ngram params, --resume, --results_jsonl, --max_pages/images, mode validation, tempdir cleanup#21

infer.py: CLI flags for ngram params, --resume, --results_jsonl, --max_pages/images, mode validation, tempdir cleanup#21
shreyasmene06 wants to merge 1 commit into
baidu:mainfrom
shreyasmene06:fix/infer-py-ergonomics

shreyasmene06 commented Jun 24, 2026

Uh oh!

rajpratham1 left a comment

Uh oh!

kushdab commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

shreyasmene06 commented Jun 24, 2026

Summary

Bug fixes

New features

Behavior changes worth noting

Testing

Uh oh!

rajpratham1 left a comment

Choose a reason for hiding this comment

Uh oh!

kushdab commented Jun 26, 2026

Substantive concern: DEFAULT_NGRAM_WINDOW = 1024 as the universal default

PDF tmpdir cleanup — TemporaryDirectory vs atexit

Overlap with existing PRs

What's excellent

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Substantive concern: `DEFAULT_NGRAM_WINDOW = 1024` as the universal default

PDF tmpdir cleanup — `TemporaryDirectory` vs `atexit`