Skip to content

feat: parallel OCR with transform_markdown_parallel#356

Open
vv4alekseev wants to merge 1 commit into
oomol-lab:mainfrom
vv4alekseev:feature/parallel-ocr
Open

feat: parallel OCR with transform_markdown_parallel#356
vv4alekseev wants to merge 1 commit into
oomol-lab:mainfrom
vv4alekseev:feature/parallel-ocr

Conversation

@vv4alekseev

Copy link
Copy Markdown

Summary

  • Add pdf_craft.parallel (partition_pages, run_parallel_ocr) to run OCR over disjoint page ranges in separate processes, then reuse the existing markdown pipeline.
  • Export transform_markdown_parallel from the package.
  • Ignore /.hf-cache for Hugging Face cache when colocated with the repo.
  • Tests for partitioning and orchestration (mocked OCR).

Made with Cursor

Add pdf_craft.parallel (partition_pages, run_parallel_ocr) to run OCR over
disjoint page ranges in separate processes, then reuse the existing markdown
transform. Export transform_markdown_parallel from the package.

Ignore /.hf-cache for Hugging Face cache when colocated with the repo.

Tests cover page partitioning and orchestration with mocked OCR.

Made-with: Cursor
@coderabbitai

coderabbitai Bot commented Apr 6, 2026

Copy link
Copy Markdown

Summary by CodeRabbit

  • New Features

    • Added parallel markdown transformation with configurable worker processes and GPU assignment options.
  • Tests

    • Added test coverage for parallel processing page partitioning, orchestration, and error handling.
  • Chores

    • Updated module exports and project configuration files.

Walkthrough

The pull request introduces parallel OCR processing functionality to the pdf_craft package. A new pdf_craft/parallel.py module implements page partitioning and parallel OCR orchestration via the partition_pages() and run_parallel_ocr() functions. A new public function transform_markdown_parallel() is added to pdf_craft/functions.py that leverages this parallel infrastructure before invoking the standard markdown transformation. The function is re-exported through pdf_craft/__init__.py. Corresponding test coverage is added in tests/test_parallel.py to validate page partitioning logic and parallel OCR orchestration, including GPU configuration and error handling. The .gitignore file is updated to exclude the /.hf-cache directory.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Key observations

  • New module with process orchestration: The pdf_craft/parallel.py module (228 lines) introduces multi-process PDF handling with worker spawning, GPU device assignment, process completion checking, and error propagation logic.
  • Extended API surface: The new transform_markdown_parallel() function exposes parallel-specific parameters (workers, gpu_ids) alongside existing OCR/transformation parameters, requiring careful parameter coordination across function calls.
  • Process synchronization complexity: The implementation manages per-worker token budget division, subprocess spawning with the spawn method, collective error handling across workers, and shared output directory management.
  • Comprehensive test coverage: The test suite validates page partitioning edge cases, parallel OCR invocation, GPU configuration constraints, and the requirement that pdf_handler must be None when using multiple workers.
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title follows the required format and clearly describes the main feature being added: parallel OCR functionality with a new transform_markdown_parallel function.
Description check ✅ Passed The description is directly related to the changeset, providing a clear summary of all major changes including the new parallel module, export additions, and gitignore updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
pdf_craft/parallel.py (3)

153-156: Typo: asserts_path should be assets_path.

The variable name asserts_path appears to be a typo for assets_path, which would be more consistent with the asset_path parameter name used in _WorkerConfig and OCR.recognize().

✏️ Suggested fix
-    asserts_path = analysing_path / "assets"
+    assets_path = analysing_path / "assets"
     pages_path = analysing_path / "ocr"
-    asserts_path.mkdir(parents=True, exist_ok=True)
+    assets_path.mkdir(parents=True, exist_ok=True)
     pages_path.mkdir(parents=True, exist_ok=True)

Also update references on lines 178 and 200 from asserts_path to assets_path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pdf_craft/parallel.py` around lines 153 - 156, Rename the misspelled variable
asserts_path to assets_path in the block that creates asset directories and
update all references to it (including later uses in the same function/class),
ensuring consistency with the asset_path parameter used in _WorkerConfig and
OCR.recognize(); change the variable name where it's declared and where it's
referenced (previously asserts_path) so the directory creation and subsequent
code use assets_path.

221-227: Consider adding a timeout to p.join() to prevent indefinite blocking.

If a worker process hangs (e.g., GPU deadlock, infinite loop in OCR), the parent process will block forever on join(). Adding a timeout with periodic checks would improve resilience.

♻️ Suggested approach
     failed: list[tuple[int, int | None]] = []
     for idx, p in enumerate(processes):
-        p.join()
+        p.join(timeout=3600)  # 1 hour timeout per worker
+        if p.is_alive():
+            p.terminate()
+            p.join(timeout=10)
+            failed.append((idx, None))  # None indicates timeout
+            continue
         if p.exitcode != 0:
             failed.append((idx, p.exitcode))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pdf_craft/parallel.py` around lines 221 - 227, The current loop joins each
worker with p.join() and can block forever; change it to use p.join(timeout=...)
in a short loop per process (checking p.is_alive()) and track elapsed time per
process (identify by processes and idx), then if the timeout is exceeded call
p.terminate() / p.kill() and record the exit as failed.append((idx, p.exitcode
or None)) so the parent doesn't hang; ensure you also flush/join again after
termination to reap the process and keep existing logic that appends failures
based on p.exitcode.

93-95: Avoid catching BaseException; prefer Exception.

Catching BaseException intercepts KeyboardInterrupt and SystemExit, which can mask legitimate termination signals and prevent proper cleanup. Since traceback is printed anyway, catching Exception is sufficient for error reporting while allowing system signals to propagate naturally.

♻️ Suggested fix
-    except BaseException:
+    except Exception:
         traceback.print_exc()
         sys.exit(1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pdf_craft/parallel.py` around lines 93 - 95, Replace the overly broad except
BaseException: handler in pdf_craft/parallel.py with except Exception (e.g.,
change the clause that currently calls traceback.print_exc() and sys.exit(1));
this preserves the error reporting via traceback.print_exc() and exit via
sys.exit(1) while allowing KeyboardInterrupt and SystemExit to propagate
normally. Ensure the handler references Exception (optionally capture it as e)
instead of BaseException and leave the existing traceback.print_exc() and
sys.exit(1) behavior intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pdf_craft/functions.py`:
- Around line 79-103: The transform_markdown_parallel function currently accepts
aborted and on_ocr_event but doesn't propagate them into the parallel OCR phase
(run_parallel_ocr), so abort signals and OCR events are lost; fix this by
updating run_parallel_ocr's signature to accept aborted: AbortedCheck and
on_ocr_event: Callable[[OCREvent], None] (or optional equivalents),
thread-safely pass those parameters from transform_markdown_parallel into
run_parallel_ocr, and ensure the worker/task code inside run_parallel_ocr
invokes aborted to stop work early and calls on_ocr_event for OCR
progress/events; alternatively, if you prefer not to support them in parallel
mode, remove aborted and on_ocr_event from transform_markdown_parallel's
signature or add a clear docstring note explaining they are ignored in parallel
mode (reference symbols: transform_markdown_parallel, run_parallel_ocr, aborted,
on_ocr_event).

---

Nitpick comments:
In `@pdf_craft/parallel.py`:
- Around line 153-156: Rename the misspelled variable asserts_path to
assets_path in the block that creates asset directories and update all
references to it (including later uses in the same function/class), ensuring
consistency with the asset_path parameter used in _WorkerConfig and
OCR.recognize(); change the variable name where it's declared and where it's
referenced (previously asserts_path) so the directory creation and subsequent
code use assets_path.
- Around line 221-227: The current loop joins each worker with p.join() and can
block forever; change it to use p.join(timeout=...) in a short loop per process
(checking p.is_alive()) and track elapsed time per process (identify by
processes and idx), then if the timeout is exceeded call p.terminate() /
p.kill() and record the exit as failed.append((idx, p.exitcode or None)) so the
parent doesn't hang; ensure you also flush/join again after termination to reap
the process and keep existing logic that appends failures based on p.exitcode.
- Around line 93-95: Replace the overly broad except BaseException: handler in
pdf_craft/parallel.py with except Exception (e.g., change the clause that
currently calls traceback.print_exc() and sys.exit(1)); this preserves the error
reporting via traceback.print_exc() and exit via sys.exit(1) while allowing
KeyboardInterrupt and SystemExit to propagate normally. Ensure the handler
references Exception (optionally capture it as e) instead of BaseException and
leave the existing traceback.print_exc() and sys.exit(1) behavior intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9de47ddf-00d9-4f88-9554-3f4f0da0d4f3

📥 Commits

Reviewing files that changed from the base of the PR and between 516e5f1 and f80ba91.

📒 Files selected for processing (5)
  • .gitignore
  • pdf_craft/__init__.py
  • pdf_craft/functions.py
  • pdf_craft/parallel.py
  • tests/test_parallel.py

Comment thread pdf_craft/functions.py
Comment on lines +79 to +103
def transform_markdown_parallel(
pdf_path: PathLike | str,
markdown_path: PathLike | str,
workers: int,
pdf_handler: PDFHandler | None = None,
markdown_assets_path: PathLike | str | None = None,
analysing_path: PathLike | str | None = None,
ocr_size: DeepSeekOCRSize = "gundam",
models_cache_path: PathLike | str | None = None,
local_only: bool = False,
gpu_ids: Sequence[int] | None = None,
dpi: int | None = None,
max_page_image_file_size: int | None = None,
includes_cover: bool = False,
includes_footnotes: bool = False,
ignore_pdf_errors: IgnorePDFErrorsChecker = False,
ignore_ocr_errors: IgnoreOCRErrorsChecker = False,
generate_plot: bool = False,
toc_llm: LLM | None = None,
toc_assumed: bool = False,
aborted: AbortedCheck = lambda: False,
max_ocr_tokens: int | None = None,
max_ocr_output_tokens: int | None = None,
on_ocr_event: Callable[[OCREvent], None] = lambda _: None,
) -> OCRTokensMetering:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

aborted and on_ocr_event are not propagated to the parallel OCR phase.

The function signature accepts aborted and on_ocr_event parameters, but run_parallel_ocr doesn't support them. This means:

  1. aborted won't interrupt the parallel OCR workers—only the subsequent transform_markdown step can be aborted
  2. on_ocr_event won't receive any events during OCR processing (events are discarded in workers)

Consider either:

  • Documenting this limitation in the docstring
  • Removing these parameters from the parallel variant if they're misleading
📝 Suggested docstring clarification
     """Like :func:`transform_markdown`, but runs OCR over page ranges in parallel processes first.
 
     Use ``gpu_ids`` with length ``workers`` to pin each worker to a GPU (e.g. ``[0, 1, 2, 3]``).
     With a single GPU, prefer ``workers=1`` or expect high VRAM use. Error checkers must be
     picklable when ``workers > 1`` (booleans are fine; lambdas are not).
+
+    Note: The ``aborted`` callback only takes effect after parallel OCR completes (during the
+    transform phase). Similarly, ``on_ocr_event`` only receives events from the final transform
+    step, not from the parallel OCR workers.
     """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pdf_craft/functions.py` around lines 79 - 103, The
transform_markdown_parallel function currently accepts aborted and on_ocr_event
but doesn't propagate them into the parallel OCR phase (run_parallel_ocr), so
abort signals and OCR events are lost; fix this by updating run_parallel_ocr's
signature to accept aborted: AbortedCheck and on_ocr_event: Callable[[OCREvent],
None] (or optional equivalents), thread-safely pass those parameters from
transform_markdown_parallel into run_parallel_ocr, and ensure the worker/task
code inside run_parallel_ocr invokes aborted to stop work early and calls
on_ocr_event for OCR progress/events; alternatively, if you prefer not to
support them in parallel mode, remove aborted and on_ocr_event from
transform_markdown_parallel's signature or add a clear docstring note explaining
they are ignored in parallel mode (reference symbols:
transform_markdown_parallel, run_parallel_ocr, aborted, on_ocr_event).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant