fix: add safe_root path traversal guard for analysing_path by shamario13 · Pull Request #361 · oomol-lab/pdf-craft

shamario13 · 2026-05-19T23:01:00Z

Summary

Comprehensive security hardening based on a full audit of the codebase. Fixes span four severity levels: HIGH, MEDIUM, LOW, and INFORMATIONAL.

HIGH

Path traversal via user-supplied `analysing_path`

Files: pdf_craft/common/folder.py, pdf_craft/common/asset.py, pdf_craft/transform.py

When pdf-craft is wrapped in a service that accepts a user-controlled analysing_path, an attacker could point it at an arbitrary writable location (e.g. /var/www/html/), causing the OCR pipeline to write files outside the intended working directory.

Added assert_within_root(path, root) helper that resolves both paths before comparing (symlinks and .. components cannot bypass it)
Extended EnsureFolder with an optional safe_root parameter that validates analysing_path before any directory is created
Added safe_root parameter to transform_markdown and transform_epub (backward-compatible, defaults to None)
Resolved AssetHub._asset_path to an absolute canonical path at construction time

# Usage in a service
transform.transform_markdown(
    pdf_path=user_pdf,
    markdown_path=out_md,
    analysing_path=user_supplied_path,  # untrusted
    safe_root=/var/app/workdir,       # enforced boundary
)
# raises ValueError if user_supplied_path resolves outside /var/app/workdir

MEDIUM

Unpinned GitHub Actions (supply chain)

Files: .github/workflows/merge-build.yml, .github/workflows/pr-check.yml

Floating version tags (@v4, @v5, @v3) allow a tag-hijack on any of these actions to run attacker-controlled code in CI with full GITHUB_TOKEN access. Pinned all three to full commit SHAs:

actions/checkout → 34e114876b0b11c390a56381ad16ebd13914f8d5 (v4)
actions/setup-python → a26af69be951a213d495a4c3e4e4022e16d87065 (v5)
abatilo/actions-poetry → fd0e6716a0de25ef6ade151b8b53190b0376acfd (v3)

LLM log files world-readable

File: pdf_craft/llm/core.py

Log files contain verbatim PDF text and were created with the process umask (typically world-readable on multi-user systems). Added os.chmod(file_path, 0o600) after FileHandler creation on POSIX systems.

API key in plaintext config file

Files: scripts/gen_md.py, scripts/gen_epub.py

Both scripts now check the PDF_CRAFT_API_KEY environment variable first and fall back to format.json only if it is not set, providing a safer alternative to storing credentials on disk.

LLM prompt injection via OCR-extracted text

File: pdf_craft/toc/llm_analyser.py

PDF-extracted heading and TOC text was embedded verbatim into LLM prompts. Wrapped all PDF-derived text in <text>...</text> delimiters and updated both system prompts to instruct the model that only content inside those tags is PDF data.

XML entity expansion (billion laughs)

Files: pdf_craft/common/xml.py, pyproject.toml

Replaced xml.etree.ElementTree.fromstring with defusedxml.ElementTree.fromstring throughout. Added defusedxml>=0.7.1 as a dependency.

LOW

Unvalidated det coordinates reach `Image.crop()`

File: pdf_craft/sequence/chapter.py

_parse_det() now rejects negative values and degenerate bounding boxes (x1 >= x2 or y1 >= y2) before they reach Pillow's Image.crop().

DPI amplification from crafted page dimensions

File: pdf_craft/pdf/page_ref.py

A PDF with unrealistically large declared page dimensions could produce an unbounded DPI value, causing Poppler to attempt excessive memory allocation. Added _MAX_DPI = 600 cap in _dpi_with_size().

Fixed-name temp file race in `save_xml()`

File: pdf_craft/common/xml.py

Replaced the fixed .xml.tmp suffix with tempfile.mkstemp(dir=file_path.parent), giving each writer a unique random name while keeping the write-then-rename atomic (same directory = same filesystem).

Truncated UUID context ID

File: pdf_craft/llm/context.py

Changed uuid4().hex[:12] (48-bit) to uuid4().hex (128-bit). No reason to truncate.

INFORMATIONAL

O(n³) worst-case in `has_repetitive_ngrams()`

File: pdf_craft/pdf/ngrams.py

Replaced per-comparison tuple(chars[i:i+n]) creation (O(n) per comparison) with a Rabin-Karp rolling hash. N-gram comparisons are now O(1) (hash check) with O(n) string verification only on hash matches, eliminating the CPU spike risk from crafted OCR output.

No size limit before XML parsing

File: pdf_craft/common/xml.py

Added a _MAX_XML_BYTES = 100 MB gate in read_xml() that checks file_path.stat().st_size before reading the file into memory, preventing OOM from unexpectedly large intermediate files.

`json-repair` silently reconstructed malformed LLM output

File: pdf_craft/toc/llm_analyser.py

Removed the json-repair dependency entirely. Both _validate_title_response and _validate_toc_response now use strict json.loads() — malformed JSON in the RESULT section triggers a model retry with a descriptive error message instead of being silently reconstructed (which could mask prompt-injection payloads). Also removed json-repair from pyproject.toml.

Test plan

transform_markdown / transform_epub with analysing_path inside safe_root — works normally
Same with analysing_path outside safe_root — raises ValueError before any filesystem writes
Same without safe_root — no behavioural change (backward-compatible)
PDF with unrealistically large page dimensions — DPI capped at 600, no OOM
XML det attribute with negative or degenerate values — ValueError raised at parse time
LLM response with malformed JSON in RESULT — triggers retry with clear error, not silent repair
Log files created with log_dir_path set — chmod 600 on POSIX, owner-only readable
PDF_CRAFT_API_KEY env var set — scripts use it instead of format.json key

🤖 Generated with Claude Code

Introduces assert_within_root() in EnsureFolder and a safe_root parameter on transform_markdown/transform_epub so callers wrapping the library in a service can enforce that the working directory stays inside an expected filesystem boundary. Also resolves AssetHub._asset_path to an absolute canonical path at construction time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-05-19T23:01:15Z

Walkthrough

This PR adds multiple safety and behavior changes across the codebase: it enforces path containment via a new assert_within_root and EnsureFolder.safe_root (wired through Transform.transform_markdown/transform_epub), normalizes AssetHub paths with Path.resolve(), tightens XML parsing and atomic save behavior, implements rolling-hash n-gram repetition detection, caps computed DPI, validates rectangle coordinates, updates LLM prompts and strict JSON parsing, allows runtime LLM key overrides, tightens per-request logfile permissions, and pins CI action versions.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Transform
  participant EnsureFolder
  participant assert_within_root
  Client->>Transform: transform_markdown(..., safe_root)
  Transform->>EnsureFolder: EnsureFolder(path=..., safe_root=to_path(safe_root))
  Client->>EnsureFolder: __enter__()
  EnsureFolder->>assert_within_root: assert_within_root(path, safe_root)
  assert_within_root->>assert_within_root: resolve & check containment
  alt contained
    assert_within_root-->>EnsureFolder: validation passes
    EnsureFolder-->>Transform: context ready
  else not contained
    assert_within_root-->>EnsureFolder: raise ValueError
    EnsureFolder-->>Transform: propagate error
  end

Possibly related PRs

oomol-lab/pdf-craft#274: Also modifies Transform.transform_epub signature and plumbing (overlaps at the same method level).
oomol-lab/pdf-craft#266: Changes the EPUB transformation flow including Transform.transform_epub, intersecting at the same function.
oomol-lab/pdf-craft#339: Related changes to pdf_craft/toc/llm_analyser prompt construction and response handling.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title follows the required format with type and subject, and is closely related to the main security improvement: adding a safe_root path traversal guard.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description comprehensively relates to the changeset, detailing security hardening across multiple severity levels with clear file-by-file mapping.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

✨ Simplify code

Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pdf_craft/common/folder.py`:
- Around line 5-12: In assert_within_root, capture the caught ValueError and
explicitly chain the new ValueError to clarify intent: change the except block
to "except ValueError as err:" and re-raise the new ValueError using "raise
ValueError(f\"Path '{path}' must be within the safe root '{root}'\") from err"
so the original exception is preserved and Ruff B904 is satisfied.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 118f3d30-e5d3-4da0-bacc-315a610f96b3

📥 Commits

Reviewing files that changed from the base of the PR and between d593ad6 and bd8d0de.

📒 Files selected for processing (3)

pdf_craft/common/asset.py
pdf_craft/common/folder.py
pdf_craft/transform.py

coderabbitai · 2026-05-19T23:03:09Z

+def assert_within_root(path: Path, root: Path) -> None:
+    """Raise ValueError if path does not resolve to a location inside root."""
+    try:
+        path.resolve().relative_to(root.resolve())
+    except ValueError:
+        raise ValueError(
+            f"Path '{path}' must be within the safe root '{root}'"
+        )


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Chain the exception to clarify intent.

The Ruff B904 warning is valid. When re-raising in an except block, use from None to indicate intentional replacement (suppressing the original traceback) or from err to preserve the chain.

Proposed fix

def assert_within_root(path: Path, root: Path) -> None: """Raise ValueError if path does not resolve to a location inside root.""" try: path.resolve().relative_to(root.resolve()) - except ValueError: + except ValueError as err: raise ValueError( f"Path '{path}' must be within the safe root '{root}'" - ) + ) from err

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def assert_within_root(path: Path, root: Path) -> None:

"""Raise ValueError if path does not resolve to a location inside root."""

try:

path.resolve().relative_to(root.resolve())

except ValueError:

raise ValueError(

f"Path '{path}' must be within the safe root '{root}'"

)

def assert_within_root(path: Path, root: Path) -> None:

"""Raise ValueError if path does not resolve to a location inside root."""

try:

path.resolve().relative_to(root.resolve())

except ValueError as err:

raise ValueError(

f"Path '{path}' must be within the safe root '{root}'"

) from err

🧰 Tools

🪛 Ruff (0.15.13)

[warning] 10-12: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pdf_craft/common/folder.py` around lines 5 - 12, In assert_within_root, capture the caught ValueError and explicitly chain the new ValueError to clarify intent: change the except block to "except ValueError as err:" and re-raise the new ValueError using "raise ValueError(f\"Path '{path}' must be within the safe root '{root}'\") from err" so the original exception is preserved and Ruff B904 is satisfied.

- Pin GitHub Actions to commit SHAs (actions/checkout, setup-python, abatilo/actions-poetry) to prevent tag-hijack supply chain attacks - Restrict LLM log files to owner-only (chmod 600) on POSIX; logs contain verbatim PDF text and must not be world-readable - Support PDF_CRAFT_API_KEY env var in scripts as a safer alternative to storing the key in format.json - Wrap PDF-derived text in <text>...</text> delimiters in LLM prompts and instruct the model to treat tagged content as data, reducing prompt injection impact from crafted PDFs - Replace xml.etree.ElementTree.fromstring with defusedxml to prevent XML entity expansion (billion-laughs) attacks on intermediate files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Validate det coordinates in _parse_det(): reject negative values and degenerate boxes (x1>=x2 or y1>=y2) before they reach Image.crop() - Cap computed DPI at 600 in _dpi_with_size() to prevent excessive Poppler memory allocation from PDFs with crafted page dimensions - Replace fixed .xml.tmp temp file with mkstemp() so concurrent writers get unique names and the write-then-rename stays atomic - Use full uuid4().hex (128-bit) as LLM context ID instead of the truncated 12-char (48-bit) variant Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pdf_craft/llm/core.py`:
- Around line 128-130: The os.chmod call that tightens log file permissions
(sys.platform != "win32" and os.chmod(file_path, 0o600)) should be made
best-effort so it cannot raise and break request handling during logger
creation: wrap the os.chmod(file_path, 0o600) call in a try/except (catch
OSError/PermissionError or Exception) and swallow the error (optionally log a
debug/warn via the module logger) so failures to change permissions do not
propagate out of the logger creation path in pdf_craft/llm/core.py.

In `@pdf_craft/toc/llm_analyser.py`:
- Around line 232-233: Escape PDF-extracted text before embedding inside the
"<text>...</text>" f-strings to prevent injection/termination (e.g., replace &,
<, > and specifically "</text>"). Create or use an escape helper (e.g.,
xml.sax.saxutils.escape or html.escape) and apply it to title.text (and the
other occurrences) in the f-strings: replace title.text with
escape_for_xml(title.text) in the f-string shown and in the similar
constructions around lines 401-402 and 412-413; also add the appropriate import
for the escape utility.

In `@scripts/gen_epub.py`:
- Line 28: The current call uses key=os.environ.get("PDF_CRAFT_API_KEY",
llm_config["key"]) which treats an empty string as a present value; change it to
use a truthy fallback so empty env vars don't override llm_config. Replace that
expression with key=(os.environ.get("PDF_CRAFT_API_KEY") or llm_config["key"])
so PDF_CRAFT_API_KEY is only used when non-empty; reference the env var name
PDF_CRAFT_API_KEY and the llm_config mapping to locate the change.

In `@scripts/gen_md.py`:
- Line 22: The current assignment of the PDF Craft API key uses
os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]) which treats an empty env
var as a valid override; change it to use a truthy fallback so empty strings
don't override the config (e.g., use the env value only if truthy, otherwise
fallback to llm_config["key"]); update the expression around the key parameter
where key=... is set and ensure llm_config["key"] remains the default when the
environment variable is empty or missing.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bed78abe-bf64-4d31-b742-fbb52904f50f

📥 Commits

Reviewing files that changed from the base of the PR and between bd8d0de and f6e2249.

📒 Files selected for processing (8)

.github/workflows/merge-build.yml
.github/workflows/pr-check.yml
pdf_craft/common/xml.py
pdf_craft/llm/core.py
pdf_craft/toc/llm_analyser.py
pyproject.toml
scripts/gen_epub.py
scripts/gen_md.py

✅ Files skipped from review due to trivial changes (2)

.github/workflows/merge-build.yml
.github/workflows/pr-check.yml

coderabbitai · 2026-05-19T23:13:27Z

+        # Restrict log file to owner-only on POSIX; logs contain verbatim PDF text.
+        if sys.platform != "win32":
+            os.chmod(file_path, 0o600)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make log permission tightening best-effort.

At Lines 129-130, os.chmod exceptions can bubble up and break request handling during logger creation. Permission hardening should not take down the LLM request path.

Proposed fix

# Restrict log file to owner-only on POSIX; logs contain verbatim PDF text. if sys.platform != "win32": - os.chmod(file_path, 0o600) + try: + os.chmod(file_path, 0o600) + except OSError: + # Keep request path available even when chmod is unsupported. + pass

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Restrict log file to owner-only on POSIX; logs contain verbatim PDF text.

if sys.platform != "win32":

os.chmod(file_path, 0o600)

# Restrict log file to owner-only on POSIX; logs contain verbatim PDF text.

if sys.platform != "win32":

try:

os.chmod(file_path, 0o600)

except OSError:

# Keep request path available even when chmod is unsupported.

pass

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pdf_craft/llm/core.py` around lines 128 - 130, The os.chmod call that tightens log file permissions (sys.platform != "win32" and os.chmod(file_path, 0o600)) should be made best-effort so it cannot raise and break request handling during logger creation: wrap the os.chmod(file_path, 0o600) call in a try/except (catch OSError/PermissionError or Exception) and swallow the error (optionally log a debug/warn via the module logger) so failures to change permissions do not propagate out of the logger creation path in pdf_craft/llm/core.py.

coderabbitai · 2026-05-19T23:13:28Z

+            f"  {idx}: [Group:{group_num}, Page:{title.ref[0]}, Size:{title.height:.1f}] <text>{title.text}</text>"
        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Escape PDF text before wrapping with <text> delimiters.

Lines 232/401/412 inject raw source text inside <text>...</text>. If extracted text contains </text>, it can terminate the boundary and bypass the delimiter-based protection.

Proposed fix

+import html @@ +def _wrap_pdf_text(text: str) -> str: + return f"<text>{html.escape(text, quote=False)}</text>" + @@ - f" {idx}: [Group:{group_num}, Page:{title.ref[0]}, Size:{title.height:.1f}] <text>{title.text}</text>" + f" {idx}: [Group:{group_num}, Page:{title.ref[0]}, Size:{title.height:.1f}] {_wrap_pdf_text(title.text)}" @@ - f" {idx}: [Indent:{toc_entry.indent:.1f}, Size:{toc_entry.font_size:.1f}] <text>{toc_entry.text}</text>" + f" {idx}: [Indent:{toc_entry.indent:.1f}, Size:{toc_entry.font_size:.1f}] {_wrap_pdf_text(toc_entry.text)}" @@ - prompt_lines.append(f" {letter_id}: <text>{title}</text>") + prompt_lines.append(f" {letter_id}: {_wrap_pdf_text(title)}")

Also applies to: 401-402, 412-413

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pdf_craft/toc/llm_analyser.py` around lines 232 - 233, Escape PDF-extracted text before embedding inside the "<text>...</text>" f-strings to prevent injection/termination (e.g., replace &, <, > and specifically "</text>"). Create or use an escape helper (e.g., xml.sax.saxutils.escape or html.escape) and apply it to title.text (and the other occurrences) in the f-strings: replace title.text with escape_for_xml(title.text) in the f-string shown and in the similar constructions around lines 401-402 and 412-413; also add the appropriate import for the escape utility.

coderabbitai · 2026-05-19T23:13:28Z

            llm_config = json.load(f)
        toc_llm = LLM(
-            key=llm_config["key"],
+            key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]),


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle empty PDF_CRAFT_API_KEY values safely.

At Line 28, PDF_CRAFT_API_KEY="" still wins over llm_config["key"], causing avoidable auth failures. Prefer a truthy fallback.

Proposed fix

- key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]), + key=os.getenv("PDF_CRAFT_API_KEY") or llm_config["key"],

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]),

key=os.getenv("PDF_CRAFT_API_KEY") or llm_config["key"],

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/gen_epub.py` at line 28, The current call uses key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]) which treats an empty string as a present value; change it to use a truthy fallback so empty env vars don't override llm_config. Replace that expression with key=(os.environ.get("PDF_CRAFT_API_KEY") or llm_config["key"]) so PDF_CRAFT_API_KEY is only used when non-empty; reference the env var name PDF_CRAFT_API_KEY and the llm_config mapping to locate the change.

coderabbitai · 2026-05-19T23:13:28Z

            llm_config = json.load(f)
        toc_llm = LLM(
-            key=llm_config["key"],
+            key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]),


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle empty PDF_CRAFT_API_KEY values safely.

At Line 22, an empty env var still overrides format.json, which can silently pass an invalid key downstream. Use a truthy fallback instead.

Proposed fix

- key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]), + key=os.getenv("PDF_CRAFT_API_KEY") or llm_config["key"],

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]),

key=os.getenv("PDF_CRAFT_API_KEY") or llm_config["key"],

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/gen_md.py` at line 22, The current assignment of the PDF Craft API key uses os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]) which treats an empty env var as a valid override; change it to use a truthy fallback so empty strings don't override the config (e.g., use the env value only if truthy, otherwise fallback to llm_config["key"]); update the expression around the key parameter where key=... is set and ensure llm_config["key"] remains the default when the environment variable is empty or missing.

- Replace O(n) tuple-per-comparison in has_repetitive_ngrams() with a Rabin-Karp rolling hash; ngram comparisons are now O(1) (hash) with O(n) string verification only on hash matches, eliminating the CPU spike risk from crafted OCR output - Add 100 MB size gate in read_xml() before reading the file into memory to prevent OOM from multi-thousand-page PDF intermediate files - Remove json-repair dependency from both LLM validators; malformed JSON in the RESULT section now triggers a retry with a descriptive error instead of being silently reconstructed, closing the prompt-injection masking vector Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pdf_craft/common/xml.py`:
- Around line 29-34: The size check using file_path.stat() has a TOCTOU gap;
instead open the file and read at most _MAX_XML_BYTES + 1 bytes (e.g. use
file_path.open(...) or open(file_path, "r", encoding="utf-8") and read with a
size limit) then validate the buffer length and raise the same ValueError if it
exceeds _MAX_XML_BYTES, and finally call fromstring on the read buffer; update
the logic that currently uses file_path.stat(), file_path.read_text(), and
fromstring to use the buffered content and _MAX_XML_BYTES for the enforcement.

In `@pdf_craft/pdf/page_ref.py`:
- Around line 119-123: The DPI computation must reject or clamp non-positive
inputs: before computing `computed` in the DPI calculation (where `file_size`,
`width_inch`, `height_inch` and constants `_BYTES_PER_PIXEL`,
`_PNG_COMPRESSION_RATIO`, `_MAX_DPI` are used), validate that `file_size > 0`
and `width_inch > 0` and `height_inch > 0`; if any are non-positive, either
raise a clear ValueError or return a clamped minimum DPI (e.g., 1) so `render()`
never receives 0 or triggers a divide-by-zero — then proceed to compute
`computed` and return `min(max(computed, 1), _MAX_DPI)` (or raise) to ensure a
valid DPI is always produced.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b326c661-e1fe-49c3-b02c-281815e02a9c

📥 Commits

Reviewing files that changed from the base of the PR and between f6e2249 and d98411f.

📒 Files selected for processing (7)

pdf_craft/common/xml.py
pdf_craft/llm/context.py
pdf_craft/pdf/ngrams.py
pdf_craft/pdf/page_ref.py
pdf_craft/sequence/chapter.py
pdf_craft/toc/llm_analyser.py
pyproject.toml

🚧 Files skipped from review as they are similar to previous changes (1)

pyproject.toml

coderabbitai · 2026-05-19T23:21:33Z

+        size = file_path.stat().st_size
+        if size > _MAX_XML_BYTES:
+            raise ValueError(
+                f"XML file exceeds size limit ({size} > {_MAX_XML_BYTES} bytes): {file_path}"
+            )
        return fromstring(file_path.read_text(encoding="utf-8"))


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce the XML byte cap on the bytes actually read.

Line 29 checks st_size before the file is opened, then Line 34 does a separate full read_text(). That leaves a TOCTOU gap where the file can be swapped or grown after the size check, defeating the new 100 MB guard and still pulling an oversized payload into memory. Read at most _MAX_XML_BYTES + 1 from the opened file and validate that buffer instead.

Proposed fix

def read_xml(file_path: Path) -> Element: try: - size = file_path.stat().st_size - if size > _MAX_XML_BYTES: + with file_path.open("rb") as f: + xml_bytes = f.read(_MAX_XML_BYTES + 1) + if len(xml_bytes) > _MAX_XML_BYTES: raise ValueError( - f"XML file exceeds size limit ({size} > {_MAX_XML_BYTES} bytes): {file_path}" + f"XML file exceeds size limit (> {_MAX_XML_BYTES} bytes): {file_path}" ) - return fromstring(file_path.read_text(encoding="utf-8")) + return fromstring(xml_bytes) except ValueError: raise

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

size = file_path.stat().st_size

if size > _MAX_XML_BYTES:

raise ValueError(

f"XML file exceeds size limit ({size} > {_MAX_XML_BYTES} bytes): {file_path}"

)

return fromstring(file_path.read_text(encoding="utf-8"))

def read_xml(file_path: Path) -> Element:

try:

with file_path.open("rb") as f:

xml_bytes = f.read(_MAX_XML_BYTES + 1)

if len(xml_bytes) > _MAX_XML_BYTES:

raise ValueError(

f"XML file exceeds size limit (> {_MAX_XML_BYTES} bytes): {file_path}"

)

return fromstring(xml_bytes)

except ValueError:

raise

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pdf_craft/common/xml.py` around lines 29 - 34, The size check using file_path.stat() has a TOCTOU gap; instead open the file and read at most _MAX_XML_BYTES + 1 bytes (e.g. use file_path.open(...) or open(file_path, "r", encoding="utf-8") and read with a size limit) then validate the buffer length and raise the same ValueError if it exceeds _MAX_XML_BYTES, and finally call fromstring on the read buffer; update the logic that currently uses file_path.stat(), file_path.read_text(), and fromstring to use the buffered content and _MAX_XML_BYTES for the enforcement.

coderabbitai · 2026-05-19T23:21:33Z

+        computed = (
            file_size
            / (width_inch * height_inch * _BYTES_PER_PIXEL * _PNG_COMPRESSION_RATIO)
        ) ** 0.5
+        return min(computed, _MAX_DPI)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject non-positive inputs before computing the capped DPI.

This only caps the upper bound. If file_size <= 0 or either page dimension is non-positive, the method can still divide by zero or return a value that rounds down to 0, which render() then forwards as an invalid DPI. Fail fast here, or clamp to a minimum of 1, so malformed PDFs/config don’t surface as opaque render errors.

Proposed fix

def _dpi_with_size( self, file_size: int, width_inch: float, height_inch: float ) -> float: # Formula: file_size = width_px * height_px * bytes_per_pixel * compression_ratio # where width_px = width_inch * dpi, height_px = height_inch * dpi + if file_size <= 0: + raise ValueError(f"file_size must be positive, got {file_size}") + if width_inch <= 0 or height_inch <= 0: + raise ValueError( + f"page dimensions must be positive, got {width_inch}x{height_inch}" + ) computed = ( file_size / (width_inch * height_inch * _BYTES_PER_PIXEL * _PNG_COMPRESSION_RATIO) ) ** 0.5 - return min(computed, _MAX_DPI) + return max(1.0, min(computed, _MAX_DPI))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

computed = (

file_size

/ (width_inch * height_inch * _BYTES_PER_PIXEL * _PNG_COMPRESSION_RATIO)

) ** 0.5

return min(computed, _MAX_DPI)

if file_size <= 0:

raise ValueError(f"file_size must be positive, got {file_size}")

if width_inch <= 0 or height_inch <= 0:

raise ValueError(

f"page dimensions must be positive, got {width_inch}x{height_inch}"

)

computed = (

file_size

/ (width_inch * height_inch * _BYTES_PER_PIXEL * _PNG_COMPRESSION_RATIO)

) ** 0.5

return max(1.0, min(computed, _MAX_DPI))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pdf_craft/pdf/page_ref.py` around lines 119 - 123, The DPI computation must reject or clamp non-positive inputs: before computing `computed` in the DPI calculation (where `file_size`, `width_inch`, `height_inch` and constants `_BYTES_PER_PIXEL`, `_PNG_COMPRESSION_RATIO`, `_MAX_DPI` are used), validate that `file_size > 0` and `width_inch > 0` and `height_inch > 0`; if any are non-positive, either raise a clear ValueError or return a clamped minimum DPI (e.g., 1) so `render()` never receives 0 or triggers a divide-by-zero — then proceed to compute `computed` and return `min(max(computed, 1), _MAX_DPI)` (or raise) to ensure a valid DPI is always produced.

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Mistermopilas and others added 2 commits May 19, 2026 18:08

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add safe_root path traversal guard for analysing_path#361

fix: add safe_root path traversal guard for analysing_path#361
shamario13 wants to merge 4 commits into
oomol-lab:mainfrom
shamario13:fix/path-traversal-safe-root

shamario13 commented May 19, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		f" {idx}: [Group:{group_num}, Page:{title.ref[0]}, Size:{title.height:.1f}] <text>{title.text}</text>"
		)

	key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]),
	key=os.getenv("PDF_CRAFT_API_KEY") or llm_config["key"],

Conversation

shamario13 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

HIGH

Path traversal via user-supplied analysing_path

MEDIUM

Unpinned GitHub Actions (supply chain)

LLM log files world-readable

API key in plaintext config file

LLM prompt injection via OCR-extracted text

XML entity expansion (billion laughs)

LOW

Unvalidated det coordinates reach Image.crop()

DPI amplification from crafted page dimensions

Fixed-name temp file race in save_xml()

Truncated UUID context ID

INFORMATIONAL

O(n³) worst-case in has_repetitive_ngrams()

No size limit before XML parsing

json-repair silently reconstructed malformed LLM output

Test plan

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shamario13 commented May 19, 2026 •

edited

Loading

Path traversal via user-supplied `analysing_path`

Unvalidated det coordinates reach `Image.crop()`

Fixed-name temp file race in `save_xml()`

O(n³) worst-case in `has_repetitive_ngrams()`

`json-repair` silently reconstructed malformed LLM output

coderabbitai Bot commented May 19, 2026 •

edited

Loading