feat: parallel OCR with transform_markdown_parallel#356
Conversation
Add pdf_craft.parallel (partition_pages, run_parallel_ocr) to run OCR over disjoint page ranges in separate processes, then reuse the existing markdown transform. Export transform_markdown_parallel from the package. Ignore /.hf-cache for Hugging Face cache when colocated with the repo. Tests cover page partitioning and orchestration with mocked OCR. Made-with: Cursor
Summary by CodeRabbit
WalkthroughThe pull request introduces parallel OCR processing functionality to the pdf_craft package. A new Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Key observations
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
pdf_craft/parallel.py (3)
153-156: Typo:asserts_pathshould beassets_path.The variable name
asserts_pathappears to be a typo forassets_path, which would be more consistent with theasset_pathparameter name used in_WorkerConfigandOCR.recognize().✏️ Suggested fix
- asserts_path = analysing_path / "assets" + assets_path = analysing_path / "assets" pages_path = analysing_path / "ocr" - asserts_path.mkdir(parents=True, exist_ok=True) + assets_path.mkdir(parents=True, exist_ok=True) pages_path.mkdir(parents=True, exist_ok=True)Also update references on lines 178 and 200 from
asserts_pathtoassets_path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pdf_craft/parallel.py` around lines 153 - 156, Rename the misspelled variable asserts_path to assets_path in the block that creates asset directories and update all references to it (including later uses in the same function/class), ensuring consistency with the asset_path parameter used in _WorkerConfig and OCR.recognize(); change the variable name where it's declared and where it's referenced (previously asserts_path) so the directory creation and subsequent code use assets_path.
221-227: Consider adding a timeout top.join()to prevent indefinite blocking.If a worker process hangs (e.g., GPU deadlock, infinite loop in OCR), the parent process will block forever on
join(). Adding a timeout with periodic checks would improve resilience.♻️ Suggested approach
failed: list[tuple[int, int | None]] = [] for idx, p in enumerate(processes): - p.join() + p.join(timeout=3600) # 1 hour timeout per worker + if p.is_alive(): + p.terminate() + p.join(timeout=10) + failed.append((idx, None)) # None indicates timeout + continue if p.exitcode != 0: failed.append((idx, p.exitcode))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pdf_craft/parallel.py` around lines 221 - 227, The current loop joins each worker with p.join() and can block forever; change it to use p.join(timeout=...) in a short loop per process (checking p.is_alive()) and track elapsed time per process (identify by processes and idx), then if the timeout is exceeded call p.terminate() / p.kill() and record the exit as failed.append((idx, p.exitcode or None)) so the parent doesn't hang; ensure you also flush/join again after termination to reap the process and keep existing logic that appends failures based on p.exitcode.
93-95: Avoid catchingBaseException; preferException.Catching
BaseExceptioninterceptsKeyboardInterruptandSystemExit, which can mask legitimate termination signals and prevent proper cleanup. Since traceback is printed anyway, catchingExceptionis sufficient for error reporting while allowing system signals to propagate naturally.♻️ Suggested fix
- except BaseException: + except Exception: traceback.print_exc() sys.exit(1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pdf_craft/parallel.py` around lines 93 - 95, Replace the overly broad except BaseException: handler in pdf_craft/parallel.py with except Exception (e.g., change the clause that currently calls traceback.print_exc() and sys.exit(1)); this preserves the error reporting via traceback.print_exc() and exit via sys.exit(1) while allowing KeyboardInterrupt and SystemExit to propagate normally. Ensure the handler references Exception (optionally capture it as e) instead of BaseException and leave the existing traceback.print_exc() and sys.exit(1) behavior intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pdf_craft/functions.py`:
- Around line 79-103: The transform_markdown_parallel function currently accepts
aborted and on_ocr_event but doesn't propagate them into the parallel OCR phase
(run_parallel_ocr), so abort signals and OCR events are lost; fix this by
updating run_parallel_ocr's signature to accept aborted: AbortedCheck and
on_ocr_event: Callable[[OCREvent], None] (or optional equivalents),
thread-safely pass those parameters from transform_markdown_parallel into
run_parallel_ocr, and ensure the worker/task code inside run_parallel_ocr
invokes aborted to stop work early and calls on_ocr_event for OCR
progress/events; alternatively, if you prefer not to support them in parallel
mode, remove aborted and on_ocr_event from transform_markdown_parallel's
signature or add a clear docstring note explaining they are ignored in parallel
mode (reference symbols: transform_markdown_parallel, run_parallel_ocr, aborted,
on_ocr_event).
---
Nitpick comments:
In `@pdf_craft/parallel.py`:
- Around line 153-156: Rename the misspelled variable asserts_path to
assets_path in the block that creates asset directories and update all
references to it (including later uses in the same function/class), ensuring
consistency with the asset_path parameter used in _WorkerConfig and
OCR.recognize(); change the variable name where it's declared and where it's
referenced (previously asserts_path) so the directory creation and subsequent
code use assets_path.
- Around line 221-227: The current loop joins each worker with p.join() and can
block forever; change it to use p.join(timeout=...) in a short loop per process
(checking p.is_alive()) and track elapsed time per process (identify by
processes and idx), then if the timeout is exceeded call p.terminate() /
p.kill() and record the exit as failed.append((idx, p.exitcode or None)) so the
parent doesn't hang; ensure you also flush/join again after termination to reap
the process and keep existing logic that appends failures based on p.exitcode.
- Around line 93-95: Replace the overly broad except BaseException: handler in
pdf_craft/parallel.py with except Exception (e.g., change the clause that
currently calls traceback.print_exc() and sys.exit(1)); this preserves the error
reporting via traceback.print_exc() and exit via sys.exit(1) while allowing
KeyboardInterrupt and SystemExit to propagate normally. Ensure the handler
references Exception (optionally capture it as e) instead of BaseException and
leave the existing traceback.print_exc() and sys.exit(1) behavior intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9de47ddf-00d9-4f88-9554-3f4f0da0d4f3
📒 Files selected for processing (5)
.gitignorepdf_craft/__init__.pypdf_craft/functions.pypdf_craft/parallel.pytests/test_parallel.py
| def transform_markdown_parallel( | ||
| pdf_path: PathLike | str, | ||
| markdown_path: PathLike | str, | ||
| workers: int, | ||
| pdf_handler: PDFHandler | None = None, | ||
| markdown_assets_path: PathLike | str | None = None, | ||
| analysing_path: PathLike | str | None = None, | ||
| ocr_size: DeepSeekOCRSize = "gundam", | ||
| models_cache_path: PathLike | str | None = None, | ||
| local_only: bool = False, | ||
| gpu_ids: Sequence[int] | None = None, | ||
| dpi: int | None = None, | ||
| max_page_image_file_size: int | None = None, | ||
| includes_cover: bool = False, | ||
| includes_footnotes: bool = False, | ||
| ignore_pdf_errors: IgnorePDFErrorsChecker = False, | ||
| ignore_ocr_errors: IgnoreOCRErrorsChecker = False, | ||
| generate_plot: bool = False, | ||
| toc_llm: LLM | None = None, | ||
| toc_assumed: bool = False, | ||
| aborted: AbortedCheck = lambda: False, | ||
| max_ocr_tokens: int | None = None, | ||
| max_ocr_output_tokens: int | None = None, | ||
| on_ocr_event: Callable[[OCREvent], None] = lambda _: None, | ||
| ) -> OCRTokensMetering: |
There was a problem hiding this comment.
aborted and on_ocr_event are not propagated to the parallel OCR phase.
The function signature accepts aborted and on_ocr_event parameters, but run_parallel_ocr doesn't support them. This means:
abortedwon't interrupt the parallel OCR workers—only the subsequenttransform_markdownstep can be abortedon_ocr_eventwon't receive any events during OCR processing (events are discarded in workers)
Consider either:
- Documenting this limitation in the docstring
- Removing these parameters from the parallel variant if they're misleading
📝 Suggested docstring clarification
"""Like :func:`transform_markdown`, but runs OCR over page ranges in parallel processes first.
Use ``gpu_ids`` with length ``workers`` to pin each worker to a GPU (e.g. ``[0, 1, 2, 3]``).
With a single GPU, prefer ``workers=1`` or expect high VRAM use. Error checkers must be
picklable when ``workers > 1`` (booleans are fine; lambdas are not).
+
+ Note: The ``aborted`` callback only takes effect after parallel OCR completes (during the
+ transform phase). Similarly, ``on_ocr_event`` only receives events from the final transform
+ step, not from the parallel OCR workers.
"""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pdf_craft/functions.py` around lines 79 - 103, The
transform_markdown_parallel function currently accepts aborted and on_ocr_event
but doesn't propagate them into the parallel OCR phase (run_parallel_ocr), so
abort signals and OCR events are lost; fix this by updating run_parallel_ocr's
signature to accept aborted: AbortedCheck and on_ocr_event: Callable[[OCREvent],
None] (or optional equivalents), thread-safely pass those parameters from
transform_markdown_parallel into run_parallel_ocr, and ensure the worker/task
code inside run_parallel_ocr invokes aborted to stop work early and calls
on_ocr_event for OCR progress/events; alternatively, if you prefer not to
support them in parallel mode, remove aborted and on_ocr_event from
transform_markdown_parallel's signature or add a clear docstring note explaining
they are ignored in parallel mode (reference symbols:
transform_markdown_parallel, run_parallel_ocr, aborted, on_ocr_event).
Summary
pdf_craft.parallel(partition_pages,run_parallel_ocr) to run OCR over disjoint page ranges in separate processes, then reuse the existing markdown pipeline.transform_markdown_parallelfrom the package./.hf-cachefor Hugging Face cache when colocated with the repo.Made with Cursor