fix(infer): add max_tokens guard + --trust-remote-code for SGLang#29
Open
kushdab wants to merge 2 commits into
Open
fix(infer): add max_tokens guard + --trust-remote-code for SGLang#29kushdab wants to merge 2 commits into
kushdab wants to merge 2 commits into
Conversation
- Add `max_tokens: CONTEXT_LENGTH` to SGLang request payload so pathological inputs (repetitive table detection loops) can never cause infer.py to hang indefinitely. The partial output is still captured and saved; a natural EOS before the limit is unaffected. - Pass `--trust-remote-code` when launching the SGLang server so the custom UnlimitedOCRForCausalLM architecture is recognized without a ValueError on startup (fixes baidu#12, baidu#27). Fixes baidu#24 Closes part of baidu#27 (item 2) Ref baidu#12
max_tokens in the OpenAI-compat API is output-only; image + prompt tokens consume part of the 32768 context window before generation begins. A dense 1024px document image uses 1500-4000 image tokens depending on resolution and crop_mode, leaving potentially less than 32768 - image_tokens tokens available for output. Setting max_tokens = CONTEXT_LENGTH (32768) can silently truncate large-image inputs because total_tokens = image_tokens + prompt_tokens + output_tokens <= context_length. Fix: subtract 4096 as a conservative image-token headroom (CONTEXT_LENGTH - 4096 = 28672). Natural EOS before the limit is unaffected; only runaway generation is stopped. Addresses review feedback from @emanthen on baidu#27.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two targeted fixes for
infer.pyto address hanging inference and SGLang startup errors.Fix 1:
max_tokensguard in request payload (fixes #24)Problem: When inference is run on certain images (sparse tables, high-contrast layouts), the model enters a detection-token repetition loop and never emits EOS. Because
infer.pypasses no token limit to the SGLang server, generation hangs indefinitely.Fix: Add
"max_tokens": CONTEXT_LENGTHto the request payload. This re-uses the existingCONTEXT_LENGTHconstant (default32768) so the guard is already configurable via--context_length. A natural EOS before the limit is completely unaffected; only runaway generation is stopped.Fix 2:
--trust-remote-codein SGLang server launch (fixes #12, part of #27)Problem: SGLang raises
ValueError: Model architecture UnlimitedOCRForCausalLM is not supportedon startup because the custom model class is not registered in the standard SGLang model registry.Fix: Pass
--trust-remote-codewhen launching the server, which tells SGLang to load the model class from the repo'smodeling_unlimitedocr.py(same mechanism as Transformers).Testing
ValueErrorwith--trust-remote-code.Closes #24 · Part of #27 · Ref #12