infer.py: CLI flags for ngram params, --resume, --results_jsonl, --max_pages/images, mode validation, tempdir cleanup#21
Conversation
…x_pages/images, mode validation, tempdir cleanup Fixes baidu#17. Bugs: - ngram_size and ngram_window were hardcoded to 35/128 even though the README's multi-page example uses 1024. Now both are CLI flags; default is 1024 to match the multi-page recommended setting, with --ngram_size 0 to disable the custom logit processor entirely. - --image_mode gundam was silently accepted with --pdf, producing poor multi-page output. build_jobs now raises ValueError. - pdf_to_images leaked /tmp/pdf_ocr_* directories forever (mkdtemp without cleanup). Now uses tempfile.TemporaryDirectory and run() cleans up in a finally block. - collect_stream_silent unconditionally truncated/created the output .md even on failed requests, so downstream consumers could not distinguish 'succeeded with no text' from 'failed after 5 retries'. Now controlled by a write_output flag; run() does not pass the output file for skipped/failed requests. - Dataset image list was sorted by descending file size, largest first, with no semantic reason. Now sorted alphabetically. - README line 129 said kernels==0.9.0 but line 135 installed 0.11.7. Updated prose to match the install line. Features: - --resume: skip images/pages whose .md already exists and is non-empty (empty files are retried). - --results_jsonl <path>: write one structured record per request (index, name, status, tokens, decode_time_s, wall_time_s, output_file, error) for pipeline integration. - --max_pages N (PDF mode) and --max_images N (image_dir mode) for quick smoke tests. - run() exits with status 1 when every request failed, so CI / pipelines can detect batch failures. - New 'skipped' status included in the run summary alongside ok/failed. README: documented the new flags under 'Useful options' and added a note that --pdf requires --image_mode base.
rajpratham1
left a comment
There was a problem hiding this comment.
After reviewing the changes, this looks like a well-rounded usability improvement rather than just a feature addition. The PR adds several practical CLI options (--resume, --results_jsonl, --max_pages, --max_images, configurable n-gram parameters), improves temporary directory cleanup, updates the README accordingly, and preserves backward compatibility for the main inference flow. The changes are cohesive and the documentation has been updated alongside the implementation.
|
Very thorough PR — this addresses the full scope of #17 in one shot. The implementation quality is high. A few things worth discussing before merge: Substantive concern:
|
| Mode | ngram_size |
ngram_window |
|---|---|---|
Single image (gundam / base) |
35 | 128 |
| Multi-page / PDF | 35 | 1024 |
Setting --ngram_window to 1024 as the CLI default means any user running single-image mode without explicitly passing --ngram_window 128 gets a wider deduplication window than the model was tuned for. For most documents this is fine, but on short dense images (business cards, labels, single-paragraph scans) the wider window can suppress legitimate near-repetitions in the output.
Suggestion: either keep DEFAULT_NGRAM_WINDOW = 128 (the conservative single-image default) and let users override for multi-page, or make the default conditional on mode:
if args.ngram_window is None:
args.ngram_window = 1024 if (args.pdf or args.image_dir) else 128PDF tmpdir cleanup — TemporaryDirectory vs atexit
Returning the TemporaryDirectory object from pdf_to_images() and keeping it alive in the caller is a clean approach. One note: the caller must hold the reference alive for the full inference duration. If someone copies pdf_to_images into their own script and writes paths, _ = pdf_to_images(...), the _ is discarded immediately and the finalizer can run before inference finishes (immediate in CPython, but not guaranteed in PyPy). A docstring warning on this would help.
Overlap with existing PRs
This PR overlaps with two open PRs targeting the same bugs:
- PR fix(infer): add max_tokens guard + --trust-remote-code for SGLang #29 (
fix/infer-max-tokens-and-trust-remote-code) — fixes themax_tokensheadroom bug and--trust-remote-codefor SGLang. Both are covered here with your more thorough structured-result approach. If this PR lands, fix(infer): add max_tokens guard + --trust-remote-code for SGLang #29 can be closed. - PR fix(infer+docs): PDF tmpdir cleanup + remove broken kernels install from README #34 (
fix/pdf-tmpdir-cleanup-and-readme-kernels) — fixes PDF tmpdir and thekernels==0.9.0README typo. Both covered here too. If this PR lands, fix(infer+docs): PDF tmpdir cleanup + remove broken kernels install from README #34 can be closed.
This PR is a strict superset of both.
What's excellent
- The
--resumeflag is genuinely missing and high-value for long batch runs. --results_jsonlmakes the output machine-readable — great for pipeline integration.- The structured
{status, tokens, decode_time, error}return frominfer_oneis much cleaner than the old tuple. write_output=Falseon the resume-skip path is correct (don't clobber an existing.md).- The
--max_pages/--max_imagessafety valves are useful. - The
--image_mode baseenforcement for PDF mode is the right guard (PR fix: clean up PDF tmpdir and validate --image_mode in PDF mode #26 also adds this).
Overall this is the strongest single fix for #17 and should supersede the narrower PRs. The ngram_window default is the main thing I'd resolve before merge.
Fixes #17.
Summary
One focused PR covering all 6 bugs + 3 features from the issue. All changes are scoped to
infer.pyandREADME.md; no model or upstream dependency touched.Bug fixes
infer.py:197-202in the old code):ngram_sizeandngram_windowwere both module-level constants (35and128). The README'sinfer_multiexample usesngram_window=1024for multi-page. Now both are CLI flags; default is1024to match the multi-page recommendation, and--ngram_size 0disables the custom logit processor entirely.--image_mode gundamsilently accepted in PDF mode:build_jobsnow raisesValueErrorwith a clear message. Also changed the default fromgundamtobase, since multi-page / dataset use is more common in batch runs andgundamwas the silent footgun.pdf_to_imagesleaked/tmp/pdf_ocr_*/: now usestempfile.TemporaryDirectory, returned to the caller sorun()cancleanup()in afinallyblock (so cleanup happens even on crash)..mdfiles:collect_stream_silentnow takes awrite_outputflag; only successful requests get an output file. Downstream consumers can now telltokens > 0fromfailed.kernelsversion mismatch: line 129 said0.9.0, line 135 installed0.11.7. Updated prose to match the install line.New features
--resume: skip images / pages whose.mdalready exists and is non-empty. Empty files are retried (so a previous crash mid-write doesn't get stuck).--results_jsonl <path>: one structured record per request withindex,name,status,tokens,decode_time_s,wall_time_s,output_file,error. Flushed after every record so a crash doesn't lose results.--max_pages N(PDF) and--max_images N(image_dir): subset caps for quick smoke tests.run()exits with status1when every request failed, so pipelines / CI can detect batch failures. Previously a total-failure run exited0.ok=N, failed=M, skipped=Kinstead of justN/M.Behavior changes worth noting
--image_modeis nowbase, notgundam. This matches the README's stronger recommendation (multi-page / dataset use case) and prevents the silent PDF+gundamfootgun. Single-image users runninggundamexplicitly via the README code will be unaffected; the change only affects the default forinfer.py.--ngram_windowis now1024, not128. This matches the README'sinfer_multisetting. The old128was appropriate for single image but produced worse multi-page output.If either of these default flips is unwanted, both are easy to swap back — happy to change either before merge.
Testing
--help).uv tool run --from ruff ruff check infer.py→ all checks passed.python -m py_compile infer.py→ OK.fitz/sglang:.cleanup())max_pagescaprun()finishesgundam+pdfraisesValueErrormax_imagescap_already_donesemantics (empty file → retry)ResultsWriterwrites valid JSONL;Nonepath is a no-op/v1/chat/completionswithcustom_logit_processor) is unchanged; only the wrapper script and the payload'scustom_paramsvalues are touched.