Skip to content

Add Docker image publishing and API runtime#39

Draft
marcelMaier wants to merge 1 commit into
baidu:mainfrom
marcelMaier:docker-image-publishing
Draft

Add Docker image publishing and API runtime#39
marcelMaier wants to merge 1 commit into
baidu:mainfrom
marcelMaier:docker-image-publishing

Conversation

@marcelMaier

@marcelMaier marcelMaier commented Jun 26, 2026

Copy link
Copy Markdown

Summary

  • Add a CUDA-based Docker image that starts the OpenAI-compatible SGLang API server by default
  • Use a CUDA devel image only for the build stage and a smaller CUDA runtime image for the final stage
  • Build a wheelhouse in the build stage and install from it in the runtime stage to keep build dependencies out of the final image
  • Add --trust-remote-code to the SGLang server command for the custom Unlimited-OCR model code
  • Remove the redundant kernels install so the bundled SGLang wheel keeps its matching kernel package
  • Keep batch inference available by overriding the container command with python infer.py ...
  • Document Docker usage, PDF batch mode with --image_mode base, and the publishing secrets maintainers can configure

Publishing

The workflow publishes to GHCR with the built-in GITHUB_TOKEN. To also publish to Docker Hub, repository maintainers only need to add these repository secrets:

  • DOCKERHUB_USERNAME
  • DOCKERHUB_TOKEN

Validation

  • git diff --check
  • docker compose config
  • YAML parsed for .github/workflows/docker.yml and docker-compose.yml
  • Verified Docker Hub contains the selected nvidia/cuda:12.9.1-cudnn-runtime-ubuntu24.04 base image tag

@marcelMaier

Copy link
Copy Markdown
Author

This is building on my other PR and activates an API Server so theres a ready to use inference.

@kushdab

kushdab commented Jun 26, 2026

Copy link
Copy Markdown

Well-structured PR the multi-stage build, non-root user, named volumes, and the GHCR CI workflow are all done correctly. A few issues worth fixing before this lands:

🐛 Runtime stage still uses the devel image doubles image size for no benefit

ARG CUDA_IMAGE=nvidia/cuda:12.9.1-cudnn-devel-ubuntu24.04
FROM ${CUDA_IMAGE} AS build   # fine: needs nvcc and headers
...
FROM ${CUDA_IMAGE} AS runtime  # ← wrong: devel is ~7–8 GB

The runtime stage only needs the CUDA runtime libraries not the compiler, headers, or cuDNN development files. Replace the second FROM:

FROM nvidia/cuda:12.9.1-cudnn-runtime-ubuntu24.04 AS runtime

This cuts the final image size by roughly half (devel ≈ 7.5 GB → runtime ≈ 3.5 GB), which matters for first-pull time and registry storage.

🐛 --trust-remote-code is missing from the Docker CMD

baidu/Unlimited-OCR ships custom model code in modeling_unlimitedocr.py. Without --trust-remote-code, SGLang will refuse to load it and exit immediately:

Error: Loading baidu/Unlimited-OCR requires --trust-remote-code

Add the flag to the CMD:

CMD ["python", "-m", "sglang.launch_server", \
    "--model", "baidu/Unlimited-OCR", \
    "--trust-remote-code", \          # ← add this
    "--served-model-name", "Unlimited-OCR", \
    ...

🐛 kernels==0.11.7 in requirements-sglang.txt causes silent model registration failure

The custom SGLang wheel bundles its own sgl_kernel. Installing kernels==0.11.7 alongside it downgrades sgl_kernel, which breaks SGLang's unlimited_ocr.py model registration at import time. SGLang swallows the import error and starts anyway, then crashes at inference with ValueError: UnlimitedOCRForCausalLM is not supported by SGLang (see issue #12). PR #34 removes this line from the README; the same fix should apply to this file:

-./wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl
-kernels==0.11.7
-pymupdf==1.27.2.2
+./wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl
+pymupdf==1.27.2.2

If kernels genuinely needs to be pinned for the bundled wheel, the correct version is 0.9.0 (matching the existing wheel manifest) — but in practice the wheel already bundles the right sgl_kernel and the separate kernels install is redundant.

📖 PDF batch inference example uses --image_mode gundam — should be base

python infer.py \
    --pdf /data/document.pdf \
    --image_mode gundam    # ← wrong for PDF

gundam uses per-tile crop inference via infer(), which is for single images only. For multi-page PDF mode, base calls infer_multi() which processes the full page context. Using gundam with --pdf silently produces wrong output or empty results. Change to --image_mode base in both the README examples and the compose file defaults.

(This same issue was flagged in the reviews for PRs #21 and #36 — worth making sure it doesn't sneak into documentation.)

Multi-GPU note (informational, not blocking)

docker run --gpus all exposes all GPU devices to the container, but SGLang uses only one unless --tensor-parallel-size N is passed to launch_server. Worth a one-liner in the docs:

# Multi-GPU: add --tensor-parallel-size to match the number of GPUs
docker run ... unlimited-ocr:local  # default: single GPU
docker run ... -e TP_SIZE=4 ...     # or pass --tensor-parallel-size 4 in CMD

What's done well

  • Multi-stage build is clean venv copy eliminates heavy build deps from runtime layer
  • Non-root unlimited user with explicit UID/GID args is the right pattern
  • Volume layout (/data, /app/outputs, /app/log, HF cache) is sensible and matches infer.py conventions
  • GHCR CI with GITHUB_TOKEN + optional Docker Hub secrets is the standard approach
  • cancel-in-progress: true on the concurrency group prevents stale image builds from racing
  • paths: filter on the workflow trigger avoids unnecessary Docker builds on unrelated commits

Fix the four items above (runtime image, --trust-remote-code, kernels removal, --image_mode base in docs) and this is ready to merge.

@marcelMaier marcelMaier force-pushed the docker-image-publishing branch from b79f954 to 6712ac3 Compare June 26, 2026 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants