You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a CUDA-based Docker image that starts the OpenAI-compatible SGLang API server by default
Use a CUDA devel image only for the build stage and a smaller CUDA runtime image for the final stage
Build a wheelhouse in the build stage and install from it in the runtime stage to keep build dependencies out of the final image
Add --trust-remote-code to the SGLang server command for the custom Unlimited-OCR model code
Remove the redundant kernels install so the bundled SGLang wheel keeps its matching kernel package
Keep batch inference available by overriding the container command with python infer.py ...
Document Docker usage, PDF batch mode with --image_mode base, and the publishing secrets maintainers can configure
Publishing
The workflow publishes to GHCR with the built-in GITHUB_TOKEN. To also publish to Docker Hub, repository maintainers only need to add these repository secrets:
DOCKERHUB_USERNAME
DOCKERHUB_TOKEN
Validation
git diff --check
docker compose config
YAML parsed for .github/workflows/docker.yml and docker-compose.yml
Verified Docker Hub contains the selected nvidia/cuda:12.9.1-cudnn-runtime-ubuntu24.04 base image tag
Well-structured PR the multi-stage build, non-root user, named volumes, and the GHCR CI workflow are all done correctly. A few issues worth fixing before this lands:
🐛 Runtime stage still uses the devel image doubles image size for no benefit
ARG CUDA_IMAGE=nvidia/cuda:12.9.1-cudnn-devel-ubuntu24.04
FROM ${CUDA_IMAGE} AS build # fine: needs nvcc and headers
...
FROM ${CUDA_IMAGE} AS runtime # ← wrong: devel is ~7–8 GB
The runtime stage only needs the CUDA runtime libraries not the compiler, headers, or cuDNN development files. Replace the second FROM:
FROM nvidia/cuda:12.9.1-cudnn-runtime-ubuntu24.04 AS runtime
This cuts the final image size by roughly half (devel ≈ 7.5 GB → runtime ≈ 3.5 GB), which matters for first-pull time and registry storage.
🐛 --trust-remote-code is missing from the Docker CMD
baidu/Unlimited-OCR ships custom model code in modeling_unlimitedocr.py. Without --trust-remote-code, SGLang will refuse to load it and exit immediately:
🐛 kernels==0.11.7 in requirements-sglang.txt causes silent model registration failure
The custom SGLang wheel bundles its own sgl_kernel. Installing kernels==0.11.7 alongside it downgrades sgl_kernel, which breaks SGLang's unlimited_ocr.py model registration at import time. SGLang swallows the import error and starts anyway, then crashes at inference with ValueError: UnlimitedOCRForCausalLM is not supported by SGLang (see issue #12). PR #34 removes this line from the README; the same fix should apply to this file:
If kernels genuinely needs to be pinned for the bundled wheel, the correct version is 0.9.0 (matching the existing wheel manifest) — but in practice the wheel already bundles the right sgl_kernel and the separate kernels install is redundant.
📖 PDF batch inference example uses --image_mode gundam — should be base
python infer.py \
--pdf /data/document.pdf \
--image_mode gundam # ← wrong for PDF
gundam uses per-tile crop inference via infer(), which is for single images only. For multi-page PDF mode, base calls infer_multi() which processes the full page context. Using gundam with --pdf silently produces wrong output or empty results. Change to --image_mode base in both the README examples and the compose file defaults.
(This same issue was flagged in the reviews for PRs #21 and #36 — worth making sure it doesn't sneak into documentation.)
Multi-GPU note (informational, not blocking)
docker run --gpus all exposes all GPU devices to the container, but SGLang uses only one unless --tensor-parallel-size N is passed to launch_server. Worth a one-liner in the docs:
# Multi-GPU: add --tensor-parallel-size to match the number of GPUs
docker run ... unlimited-ocr:local # default: single GPU
docker run ... -e TP_SIZE=4 ... # or pass --tensor-parallel-size 4 in CMD
What's done well
Multi-stage build is clean venv copy eliminates heavy build deps from runtime layer
Non-root unlimited user with explicit UID/GID args is the right pattern
Volume layout (/data, /app/outputs, /app/log, HF cache) is sensible and matches infer.py conventions
GHCR CI with GITHUB_TOKEN + optional Docker Hub secrets is the standard approach
cancel-in-progress: true on the concurrency group prevents stale image builds from racing
paths: filter on the workflow trigger avoids unnecessary Docker builds on unrelated commits
Fix the four items above (runtime image, --trust-remote-code, kernels removal, --image_mode base in docs) and this is ready to merge.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--trust-remote-codeto the SGLang server command for the custom Unlimited-OCR model codekernelsinstall so the bundled SGLang wheel keeps its matching kernel packagepython infer.py ...--image_mode base, and the publishing secrets maintainers can configurePublishing
The workflow publishes to GHCR with the built-in
GITHUB_TOKEN. To also publish to Docker Hub, repository maintainers only need to add these repository secrets:DOCKERHUB_USERNAMEDOCKERHUB_TOKENValidation
git diff --checkdocker compose config.github/workflows/docker.ymlanddocker-compose.ymlnvidia/cuda:12.9.1-cudnn-runtime-ubuntu24.04base image tag