fix(infer+docs): PDF tmpdir cleanup + remove broken kernels install from README#34
fix(infer+docs): PDF tmpdir cleanup + remove broken kernels install from README#34kushdab wants to merge 2 commits into
Conversation
pdf_to_images() creates a tempfile.mkdtemp() directory but never removes it. At 300 DPI a 100-page PDF generates 400-600 MB of PNG files under /tmp/pdf_ocr_*; repeated runs fill disk silently. Fix: register atexit.register(shutil.rmtree, tmp_dir, ignore_errors=True) immediately after creating the directory. The cleanup fires on normal exit and on unhandled exceptions alike. ignore_errors=True means a partial cleanup never masks the real error. The image paths remain valid for the entire run since the directory is only removed after main() returns. Fixes baidu#20 (disk exhaustion on repeated PDF runs).
…11.7 inconsistency The installation section had two problems: 1. The prose said 'pin kernels==0.9.0' but the shell command used 'kernels==0.11.7'. 2. Either version installs a standalone sgl_kernel package that can downgrade the version bundled in the custom SGLang wheel, causing 'Ignore import error when loading sglang.srt.models.unlimited_ocr' at startup — silently breaking the unlimited_ocr model registration and producing the ValueError described in baidu#12 even when --trust-remote-code is passed. Fix: remove the 'uv pip install kernels' step entirely with an explanatory comment. The custom wheel manages its own sgl_kernel dependency; no separate install is needed. Fixes root cause documented in baidu#12.
rajpratham1
left a comment
There was a problem hiding this comment.
This is a small but worthwhile maintenance PR. It fixes a real resource management issue by ensuring temporary PDF rendering directories are cleaned up automatically on process exit, and it corrects the README to avoid an installation step that could introduce dependency conflicts. Both changes improve the developer experience without affecting the inference logic.
|
Good find on the kernels/sgl_kernel version conflict — that's a more Two things still open after this merges:
One correction to my earlier comment on #27: the PDF page order issue |
Summary
Two fixes in one PR — both are straightforward, low-risk changes.
Fix 1: PDF temp directory cleanup (fixes disk exhaustion on repeated runs)
Problem:
pdf_to_images()creates atempfile.mkdtemp(prefix="pdf_ocr_")directory but never removes it. At 300 DPI, a 100-page PDF generates 400–600 MB of PNGs under/tmp/pdf_ocr_*/. Runninginfer.pyon multiple PDFs silently fills the disk.Fix: Register cleanup with
atexitimmediately after creating the directory.ignore_errors=Truemeans a partial cleanup (e.g. NFS hiccup) never masks the actual run error.atexitfires on both normal exit and unhandled exceptions.Fix 2: README — remove separate
kernelsinstall; fix 0.9.0 vs 0.11.7 inconsistency (fixes #12 root cause)Problem: The installation section had two contradictions:
kernels==0.9.0"uv pip install kernels==0.11.7Either version is wrong. Installing
kernelsat all after the custom SGLang wheel is the root cause of #12:The custom wheel bundles
sgl_kernelat the version it was compiled against. Installing a standalonekernelspackage afterward callspipto installsgl_kernelas a dependency — and the version pulled bykernels==0.11.7(0.4.1) is older than what the wheel expects (0.4.4), which breaks the C-extension imports inside the wheel'sunlimited_ocr.pymodule. SGLang swallows the error:…and then crashes with
ValueError: UnlimitedOCRForCausalLM not supported, even when--trust-remote-codeis passed.Fix: Remove the
uv pip install kernelsline and add a comment explaining why.Note on PDF page order
@emanthen also flagged PDF pages being scrambled by a size sort in
build_jobs(). This does not appear in the current codebase — the PDF path inbuild_jobs()usespdf_to_images()directly and iterates in document order without passing throughcollect_dataset_images(). The size sort only applies to--image_dirmode, where it correctly front-loads slow work. No change needed for page order.Closes #12 (root cause) · Addresses disk exhaustion on PDF runs