Skip to content

fix(infer+docs): PDF tmpdir cleanup + remove broken kernels install from README#34

Open
kushdab wants to merge 2 commits into
baidu:mainfrom
kushdab:fix/pdf-tmpdir-cleanup-and-readme-kernels
Open

fix(infer+docs): PDF tmpdir cleanup + remove broken kernels install from README#34
kushdab wants to merge 2 commits into
baidu:mainfrom
kushdab:fix/pdf-tmpdir-cleanup-and-readme-kernels

Conversation

@kushdab

@kushdab kushdab commented Jun 25, 2026

Copy link
Copy Markdown

Summary

Two fixes in one PR — both are straightforward, low-risk changes.


Fix 1: PDF temp directory cleanup (fixes disk exhaustion on repeated runs)

Problem: pdf_to_images() creates a tempfile.mkdtemp(prefix="pdf_ocr_") directory but never removes it. At 300 DPI, a 100-page PDF generates 400–600 MB of PNGs under /tmp/pdf_ocr_*/. Running infer.py on multiple PDFs silently fills the disk.

Fix: Register cleanup with atexit immediately after creating the directory.

import atexit, shutil

tmp_dir = tempfile.mkdtemp(prefix="pdf_ocr_")
# cleanup on exit — runs after main() returns, so image paths stay valid the whole run
atexit.register(shutil.rmtree, tmp_dir, ignore_errors=True)

ignore_errors=True means a partial cleanup (e.g. NFS hiccup) never masks the actual run error. atexit fires on both normal exit and unhandled exceptions.


Fix 2: README — remove separate kernels install; fix 0.9.0 vs 0.11.7 inconsistency (fixes #12 root cause)

Problem: The installation section had two contradictions:

  • Prose: "pin kernels==0.9.0"
  • Shell command: uv pip install kernels==0.11.7

Either version is wrong. Installing kernels at all after the custom SGLang wheel is the root cause of #12:

The custom wheel bundles sgl_kernel at the version it was compiled against. Installing a standalone kernels package afterward calls pip to install sgl_kernel as a dependency — and the version pulled by kernels==0.11.7 (0.4.1) is older than what the wheel expects (0.4.4), which breaks the C-extension imports inside the wheel's unlimited_ocr.py module. SGLang swallows the error:

Ignore import error when loading sglang.srt.models.unlimited_ocr

…and then crashes with ValueError: UnlimitedOCRForCausalLM not supported, even when --trust-remote-code is passed.

Fix: Remove the uv pip install kernels line and add a comment explaining why.

- uv pip install kernels==0.11.7
+ # Note: do NOT install 'kernels' separately — the wheel bundles sgl_kernel at the
+ # correct version. A separate install will downgrade sgl_kernel and break unlimited_ocr.

Note on PDF page order

@emanthen also flagged PDF pages being scrambled by a size sort in build_jobs(). This does not appear in the current codebase — the PDF path in build_jobs() uses pdf_to_images() directly and iterates in document order without passing through collect_dataset_images(). The size sort only applies to --image_dir mode, where it correctly front-loads slow work. No change needed for page order.

Closes #12 (root cause) · Addresses disk exhaustion on PDF runs

kushdab added 2 commits June 25, 2026 12:40
pdf_to_images() creates a tempfile.mkdtemp() directory but never removes it.
At 300 DPI a 100-page PDF generates 400-600 MB of PNG files under /tmp/pdf_ocr_*;
repeated runs fill disk silently.

Fix: register atexit.register(shutil.rmtree, tmp_dir, ignore_errors=True) immediately
after creating the directory.  The cleanup fires on normal exit and on unhandled
exceptions alike.  ignore_errors=True means a partial cleanup never masks the real
error.  The image paths remain valid for the entire run since the directory is only
removed after main() returns.

Fixes baidu#20 (disk exhaustion on repeated PDF runs).
…11.7 inconsistency

The installation section had two problems:
1. The prose said 'pin kernels==0.9.0' but the shell command used 'kernels==0.11.7'.
2. Either version installs a standalone sgl_kernel package that can downgrade the
   version bundled in the custom SGLang wheel, causing
   'Ignore import error when loading sglang.srt.models.unlimited_ocr' at startup —
   silently breaking the unlimited_ocr model registration and producing the ValueError
   described in baidu#12 even when --trust-remote-code is passed.

Fix: remove the 'uv pip install kernels' step entirely with an explanatory comment.
The custom wheel manages its own sgl_kernel dependency; no separate install is needed.

Fixes root cause documented in baidu#12.

@rajpratham1 rajpratham1 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a small but worthwhile maintenance PR. It fixes a real resource management issue by ensuring temporary PDF rendering directories are cleaned up automatically on process exit, and it corrects the README to avoid an installation step that could introduce dependency conflicts. Both changes improve the developer experience without affecting the inference logic.

@emanthen

Copy link
Copy Markdown

Good find on the kernels/sgl_kernel version conflict — that's a more
precise root cause than what was diagnosed in #12 before. The atexit
cleanup is also the right approach.

Two things still open after this merges:

  1. --trust-remote-code still missing from start_server() cmd in infer.py.
    PR fix(infer): add max_tokens guard + --trust-remote-code for SGLang #29 had this but isn't merged. Worth including here or tracking
    separately so it doesn't get lost between PRs.

  2. max_tokens still absent from the payload in infer_one(). Without it,
    pathological inputs (dense tables, repetitive structure) will run until
    REQUEST_TIMEOUT (20 min) rather than hard-stopping at the context limit.
    Suggested fix:

    "max_tokens": CONTEXT_LENGTH - 4096, # headroom for image + prompt tokens

One correction to my earlier comment on #27: the PDF page order issue
does NOT exist in the current codebase — the PDF branch in build_jobs()
iterates pdf_to_images() directly without going through collect_dataset_images().
That was wrong on my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UnlimitedOCRForCausalLM not supported by SGLang (ValueError)

3 participants