Skip to content

fix(infer): add max_tokens guard + --trust-remote-code for SGLang#29

Open
kushdab wants to merge 2 commits into
baidu:mainfrom
kushdab:fix/infer-max-tokens-and-trust-remote-code
Open

fix(infer): add max_tokens guard + --trust-remote-code for SGLang#29
kushdab wants to merge 2 commits into
baidu:mainfrom
kushdab:fix/infer-max-tokens-and-trust-remote-code

Conversation

@kushdab

@kushdab kushdab commented Jun 25, 2026

Copy link
Copy Markdown

Summary

Two targeted fixes for infer.py to address hanging inference and SGLang startup errors.


Fix 1: max_tokens guard in request payload (fixes #24)

Problem: When inference is run on certain images (sparse tables, high-contrast layouts), the model enters a detection-token repetition loop and never emits EOS. Because infer.py passes no token limit to the SGLang server, generation hangs indefinitely.

Fix: Add "max_tokens": CONTEXT_LENGTH to the request payload. This re-uses the existing CONTEXT_LENGTH constant (default 32768) so the guard is already configurable via --context_length. A natural EOS before the limit is completely unaffected; only runaway generation is stopped.

payload = {
    ...
    "max_tokens": CONTEXT_LENGTH,  # prevents infinite generation loops
}

Fix 2: --trust-remote-code in SGLang server launch (fixes #12, part of #27)

Problem: SGLang raises ValueError: Model architecture UnlimitedOCRForCausalLM is not supported on startup because the custom model class is not registered in the standard SGLang model registry.

Fix: Pass --trust-remote-code when launching the server, which tells SGLang to load the model class from the repo's modeling_unlimitedocr.py (same mechanism as Transformers).

python -m sglang.launch_server \
  --model baidu/Unlimited-OCR \
  --trust-remote-code \   # <-- added
  --port 30000

Testing

  • Tested the hang fix by passing an image that previously caused the loop: generation now terminates at the token limit with a partial but useful output, and the next file proceeds normally.
  • SGLang server starts without ValueError with --trust-remote-code.

Closes #24 · Part of #27 · Ref #12

- Add `max_tokens: CONTEXT_LENGTH` to SGLang request payload so
  pathological inputs (repetitive table detection loops) can never
  cause infer.py to hang indefinitely. The partial output is still
  captured and saved; a natural EOS before the limit is unaffected.
- Pass `--trust-remote-code` when launching the SGLang server so
  the custom UnlimitedOCRForCausalLM architecture is recognized
  without a ValueError on startup (fixes baidu#12, baidu#27).

Fixes baidu#24
Closes part of baidu#27 (item 2)
Ref baidu#12
max_tokens in the OpenAI-compat API is output-only; image + prompt tokens consume
part of the 32768 context window before generation begins. A dense 1024px document
image uses 1500-4000 image tokens depending on resolution and crop_mode, leaving
potentially less than 32768 - image_tokens tokens available for output.

Setting max_tokens = CONTEXT_LENGTH (32768) can silently truncate large-image inputs
because total_tokens = image_tokens + prompt_tokens + output_tokens <= context_length.

Fix: subtract 4096 as a conservative image-token headroom (CONTEXT_LENGTH - 4096 = 28672).
Natural EOS before the limit is unaffected; only runaway generation is stopped.

Addresses review feedback from @emanthen on baidu#27.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UnlimitedOCRForCausalLM not supported by SGLang (ValueError) Inference hangs on specific table image

1 participant