Skip to content

fix: add safe_root path traversal guard for analysing_path#361

Open
shamario13 wants to merge 4 commits into
oomol-lab:mainfrom
shamario13:fix/path-traversal-safe-root
Open

fix: add safe_root path traversal guard for analysing_path#361
shamario13 wants to merge 4 commits into
oomol-lab:mainfrom
shamario13:fix/path-traversal-safe-root

Conversation

@shamario13

@shamario13 shamario13 commented May 19, 2026

Copy link
Copy Markdown

Summary

Comprehensive security hardening based on a full audit of the codebase. Fixes span four severity levels: HIGH, MEDIUM, LOW, and INFORMATIONAL.


HIGH

Path traversal via user-supplied analysing_path

Files: pdf_craft/common/folder.py, pdf_craft/common/asset.py, pdf_craft/transform.py

When pdf-craft is wrapped in a service that accepts a user-controlled analysing_path, an attacker could point it at an arbitrary writable location (e.g. /var/www/html/), causing the OCR pipeline to write files outside the intended working directory.

  • Added assert_within_root(path, root) helper that resolves both paths before comparing (symlinks and .. components cannot bypass it)
  • Extended EnsureFolder with an optional safe_root parameter that validates analysing_path before any directory is created
  • Added safe_root parameter to transform_markdown and transform_epub (backward-compatible, defaults to None)
  • Resolved AssetHub._asset_path to an absolute canonical path at construction time
# Usage in a service
transform.transform_markdown(
    pdf_path=user_pdf,
    markdown_path=out_md,
    analysing_path=user_supplied_path,  # untrusted
    safe_root=/var/app/workdir,       # enforced boundary
)
# raises ValueError if user_supplied_path resolves outside /var/app/workdir

MEDIUM

Unpinned GitHub Actions (supply chain)

Files: .github/workflows/merge-build.yml, .github/workflows/pr-check.yml

Floating version tags (@v4, @v5, @v3) allow a tag-hijack on any of these actions to run attacker-controlled code in CI with full GITHUB_TOKEN access. Pinned all three to full commit SHAs:

  • actions/checkout34e114876b0b11c390a56381ad16ebd13914f8d5 (v4)
  • actions/setup-pythona26af69be951a213d495a4c3e4e4022e16d87065 (v5)
  • abatilo/actions-poetryfd0e6716a0de25ef6ade151b8b53190b0376acfd (v3)

LLM log files world-readable

File: pdf_craft/llm/core.py

Log files contain verbatim PDF text and were created with the process umask (typically world-readable on multi-user systems). Added os.chmod(file_path, 0o600) after FileHandler creation on POSIX systems.

API key in plaintext config file

Files: scripts/gen_md.py, scripts/gen_epub.py

Both scripts now check the PDF_CRAFT_API_KEY environment variable first and fall back to format.json only if it is not set, providing a safer alternative to storing credentials on disk.

LLM prompt injection via OCR-extracted text

File: pdf_craft/toc/llm_analyser.py

PDF-extracted heading and TOC text was embedded verbatim into LLM prompts. Wrapped all PDF-derived text in <text>...</text> delimiters and updated both system prompts to instruct the model that only content inside those tags is PDF data.

XML entity expansion (billion laughs)

Files: pdf_craft/common/xml.py, pyproject.toml

Replaced xml.etree.ElementTree.fromstring with defusedxml.ElementTree.fromstring throughout. Added defusedxml>=0.7.1 as a dependency.


LOW

Unvalidated det coordinates reach Image.crop()

File: pdf_craft/sequence/chapter.py

_parse_det() now rejects negative values and degenerate bounding boxes (x1 >= x2 or y1 >= y2) before they reach Pillow's Image.crop().

DPI amplification from crafted page dimensions

File: pdf_craft/pdf/page_ref.py

A PDF with unrealistically large declared page dimensions could produce an unbounded DPI value, causing Poppler to attempt excessive memory allocation. Added _MAX_DPI = 600 cap in _dpi_with_size().

Fixed-name temp file race in save_xml()

File: pdf_craft/common/xml.py

Replaced the fixed .xml.tmp suffix with tempfile.mkstemp(dir=file_path.parent), giving each writer a unique random name while keeping the write-then-rename atomic (same directory = same filesystem).

Truncated UUID context ID

File: pdf_craft/llm/context.py

Changed uuid4().hex[:12] (48-bit) to uuid4().hex (128-bit). No reason to truncate.


INFORMATIONAL

O(n³) worst-case in has_repetitive_ngrams()

File: pdf_craft/pdf/ngrams.py

Replaced per-comparison tuple(chars[i:i+n]) creation (O(n) per comparison) with a Rabin-Karp rolling hash. N-gram comparisons are now O(1) (hash check) with O(n) string verification only on hash matches, eliminating the CPU spike risk from crafted OCR output.

No size limit before XML parsing

File: pdf_craft/common/xml.py

Added a _MAX_XML_BYTES = 100 MB gate in read_xml() that checks file_path.stat().st_size before reading the file into memory, preventing OOM from unexpectedly large intermediate files.

json-repair silently reconstructed malformed LLM output

File: pdf_craft/toc/llm_analyser.py

Removed the json-repair dependency entirely. Both _validate_title_response and _validate_toc_response now use strict json.loads() — malformed JSON in the RESULT section triggers a model retry with a descriptive error message instead of being silently reconstructed (which could mask prompt-injection payloads). Also removed json-repair from pyproject.toml.


Test plan

  • transform_markdown / transform_epub with analysing_path inside safe_root — works normally
  • Same with analysing_path outside safe_root — raises ValueError before any filesystem writes
  • Same without safe_root — no behavioural change (backward-compatible)
  • PDF with unrealistically large page dimensions — DPI capped at 600, no OOM
  • XML det attribute with negative or degenerate values — ValueError raised at parse time
  • LLM response with malformed JSON in RESULT — triggers retry with clear error, not silent repair
  • Log files created with log_dir_path set — chmod 600 on POSIX, owner-only readable
  • PDF_CRAFT_API_KEY env var set — scripts use it instead of format.json key

🤖 Generated with Claude Code

Introduces assert_within_root() in EnsureFolder and a safe_root
parameter on transform_markdown/transform_epub so callers wrapping
the library in a service can enforce that the working directory stays
inside an expected filesystem boundary. Also resolves AssetHub._asset_path
to an absolute canonical path at construction time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 19, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

This PR adds multiple safety and behavior changes across the codebase: it enforces path containment via a new assert_within_root and EnsureFolder.safe_root (wired through Transform.transform_markdown/transform_epub), normalizes AssetHub paths with Path.resolve(), tightens XML parsing and atomic save behavior, implements rolling-hash n-gram repetition detection, caps computed DPI, validates rectangle coordinates, updates LLM prompts and strict JSON parsing, allows runtime LLM key overrides, tightens per-request logfile permissions, and pins CI action versions.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Transform
  participant EnsureFolder
  participant assert_within_root
  Client->>Transform: transform_markdown(..., safe_root)
  Transform->>EnsureFolder: EnsureFolder(path=..., safe_root=to_path(safe_root))
  Client->>EnsureFolder: __enter__()
  EnsureFolder->>assert_within_root: assert_within_root(path, safe_root)
  assert_within_root->>assert_within_root: resolve & check containment
  alt contained
    assert_within_root-->>EnsureFolder: validation passes
    EnsureFolder-->>Transform: context ready
  else not contained
    assert_within_root-->>EnsureFolder: raise ValueError
    EnsureFolder-->>Transform: propagate error
  end
Loading

Possibly related PRs

  • oomol-lab/pdf-craft#274: Also modifies Transform.transform_epub signature and plumbing (overlaps at the same method level).
  • oomol-lab/pdf-craft#266: Changes the EPUB transformation flow including Transform.transform_epub, intersecting at the same function.
  • oomol-lab/pdf-craft#339: Related changes to pdf_craft/toc/llm_analyser prompt construction and response handling.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title follows the required format with type and subject, and is closely related to the main security improvement: adding a safe_root path traversal guard.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description comprehensively relates to the changeset, detailing security hardening across multiple severity levels with clear file-by-file mapping.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pdf_craft/common/folder.py`:
- Around line 5-12: In assert_within_root, capture the caught ValueError and
explicitly chain the new ValueError to clarify intent: change the except block
to "except ValueError as err:" and re-raise the new ValueError using "raise
ValueError(f\"Path '{path}' must be within the safe root '{root}'\") from err"
so the original exception is preserved and Ruff B904 is satisfied.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 118f3d30-e5d3-4da0-bacc-315a610f96b3

📥 Commits

Reviewing files that changed from the base of the PR and between d593ad6 and bd8d0de.

📒 Files selected for processing (3)
  • pdf_craft/common/asset.py
  • pdf_craft/common/folder.py
  • pdf_craft/transform.py

Comment on lines +5 to +12
def assert_within_root(path: Path, root: Path) -> None:
"""Raise ValueError if path does not resolve to a location inside root."""
try:
path.resolve().relative_to(root.resolve())
except ValueError:
raise ValueError(
f"Path '{path}' must be within the safe root '{root}'"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Chain the exception to clarify intent.

The Ruff B904 warning is valid. When re-raising in an except block, use from None to indicate intentional replacement (suppressing the original traceback) or from err to preserve the chain.

Proposed fix
 def assert_within_root(path: Path, root: Path) -> None:
     """Raise ValueError if path does not resolve to a location inside root."""
     try:
         path.resolve().relative_to(root.resolve())
-    except ValueError:
+    except ValueError as err:
         raise ValueError(
             f"Path '{path}' must be within the safe root '{root}'"
-        )
+        ) from err
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def assert_within_root(path: Path, root: Path) -> None:
"""Raise ValueError if path does not resolve to a location inside root."""
try:
path.resolve().relative_to(root.resolve())
except ValueError:
raise ValueError(
f"Path '{path}' must be within the safe root '{root}'"
)
def assert_within_root(path: Path, root: Path) -> None:
"""Raise ValueError if path does not resolve to a location inside root."""
try:
path.resolve().relative_to(root.resolve())
except ValueError as err:
raise ValueError(
f"Path '{path}' must be within the safe root '{root}'"
) from err
🧰 Tools
🪛 Ruff (0.15.13)

[warning] 10-12: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pdf_craft/common/folder.py` around lines 5 - 12, In assert_within_root,
capture the caught ValueError and explicitly chain the new ValueError to clarify
intent: change the except block to "except ValueError as err:" and re-raise the
new ValueError using "raise ValueError(f\"Path '{path}' must be within the safe
root '{root}'\") from err" so the original exception is preserved and Ruff B904
is satisfied.

Mistermopilas and others added 2 commits May 19, 2026 18:08
- Pin GitHub Actions to commit SHAs (actions/checkout, setup-python,
  abatilo/actions-poetry) to prevent tag-hijack supply chain attacks
- Restrict LLM log files to owner-only (chmod 600) on POSIX; logs
  contain verbatim PDF text and must not be world-readable
- Support PDF_CRAFT_API_KEY env var in scripts as a safer alternative
  to storing the key in format.json
- Wrap PDF-derived text in <text>...</text> delimiters in LLM prompts
  and instruct the model to treat tagged content as data, reducing
  prompt injection impact from crafted PDFs
- Replace xml.etree.ElementTree.fromstring with defusedxml to prevent
  XML entity expansion (billion-laughs) attacks on intermediate files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Validate det coordinates in _parse_det(): reject negative values and
  degenerate boxes (x1>=x2 or y1>=y2) before they reach Image.crop()
- Cap computed DPI at 600 in _dpi_with_size() to prevent excessive
  Poppler memory allocation from PDFs with crafted page dimensions
- Replace fixed .xml.tmp temp file with mkstemp() so concurrent writers
  get unique names and the write-then-rename stays atomic
- Use full uuid4().hex (128-bit) as LLM context ID instead of the
  truncated 12-char (48-bit) variant

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pdf_craft/llm/core.py`:
- Around line 128-130: The os.chmod call that tightens log file permissions
(sys.platform != "win32" and os.chmod(file_path, 0o600)) should be made
best-effort so it cannot raise and break request handling during logger
creation: wrap the os.chmod(file_path, 0o600) call in a try/except (catch
OSError/PermissionError or Exception) and swallow the error (optionally log a
debug/warn via the module logger) so failures to change permissions do not
propagate out of the logger creation path in pdf_craft/llm/core.py.

In `@pdf_craft/toc/llm_analyser.py`:
- Around line 232-233: Escape PDF-extracted text before embedding inside the
"<text>...</text>" f-strings to prevent injection/termination (e.g., replace &,
<, > and specifically "</text>"). Create or use an escape helper (e.g.,
xml.sax.saxutils.escape or html.escape) and apply it to title.text (and the
other occurrences) in the f-strings: replace title.text with
escape_for_xml(title.text) in the f-string shown and in the similar
constructions around lines 401-402 and 412-413; also add the appropriate import
for the escape utility.

In `@scripts/gen_epub.py`:
- Line 28: The current call uses key=os.environ.get("PDF_CRAFT_API_KEY",
llm_config["key"]) which treats an empty string as a present value; change it to
use a truthy fallback so empty env vars don't override llm_config. Replace that
expression with key=(os.environ.get("PDF_CRAFT_API_KEY") or llm_config["key"])
so PDF_CRAFT_API_KEY is only used when non-empty; reference the env var name
PDF_CRAFT_API_KEY and the llm_config mapping to locate the change.

In `@scripts/gen_md.py`:
- Line 22: The current assignment of the PDF Craft API key uses
os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]) which treats an empty env
var as a valid override; change it to use a truthy fallback so empty strings
don't override the config (e.g., use the env value only if truthy, otherwise
fallback to llm_config["key"]); update the expression around the key parameter
where key=... is set and ensure llm_config["key"] remains the default when the
environment variable is empty or missing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bed78abe-bf64-4d31-b742-fbb52904f50f

📥 Commits

Reviewing files that changed from the base of the PR and between bd8d0de and f6e2249.

📒 Files selected for processing (8)
  • .github/workflows/merge-build.yml
  • .github/workflows/pr-check.yml
  • pdf_craft/common/xml.py
  • pdf_craft/llm/core.py
  • pdf_craft/toc/llm_analyser.py
  • pyproject.toml
  • scripts/gen_epub.py
  • scripts/gen_md.py
✅ Files skipped from review due to trivial changes (2)
  • .github/workflows/merge-build.yml
  • .github/workflows/pr-check.yml

Comment thread pdf_craft/llm/core.py
Comment on lines +128 to +130
# Restrict log file to owner-only on POSIX; logs contain verbatim PDF text.
if sys.platform != "win32":
os.chmod(file_path, 0o600)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make log permission tightening best-effort.

At Lines 129-130, os.chmod exceptions can bubble up and break request handling during logger creation. Permission hardening should not take down the LLM request path.

Proposed fix
         # Restrict log file to owner-only on POSIX; logs contain verbatim PDF text.
         if sys.platform != "win32":
-            os.chmod(file_path, 0o600)
+            try:
+                os.chmod(file_path, 0o600)
+            except OSError:
+                # Keep request path available even when chmod is unsupported.
+                pass
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Restrict log file to owner-only on POSIX; logs contain verbatim PDF text.
if sys.platform != "win32":
os.chmod(file_path, 0o600)
# Restrict log file to owner-only on POSIX; logs contain verbatim PDF text.
if sys.platform != "win32":
try:
os.chmod(file_path, 0o600)
except OSError:
# Keep request path available even when chmod is unsupported.
pass
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pdf_craft/llm/core.py` around lines 128 - 130, The os.chmod call that
tightens log file permissions (sys.platform != "win32" and os.chmod(file_path,
0o600)) should be made best-effort so it cannot raise and break request handling
during logger creation: wrap the os.chmod(file_path, 0o600) call in a try/except
(catch OSError/PermissionError or Exception) and swallow the error (optionally
log a debug/warn via the module logger) so failures to change permissions do not
propagate out of the logger creation path in pdf_craft/llm/core.py.

Comment on lines +232 to 233
f" {idx}: [Group:{group_num}, Page:{title.ref[0]}, Size:{title.height:.1f}] <text>{title.text}</text>"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Escape PDF text before wrapping with <text> delimiters.

Lines 232/401/412 inject raw source text inside <text>...</text>. If extracted text contains </text>, it can terminate the boundary and bypass the delimiter-based protection.

Proposed fix
+import html
@@
+def _wrap_pdf_text(text: str) -> str:
+    return f"<text>{html.escape(text, quote=False)}</text>"
+
@@
-            f"  {idx}: [Group:{group_num}, Page:{title.ref[0]}, Size:{title.height:.1f}] <text>{title.text}</text>"
+            f"  {idx}: [Group:{group_num}, Page:{title.ref[0]}, Size:{title.height:.1f}] {_wrap_pdf_text(title.text)}"
@@
-            f"  {idx}: [Indent:{toc_entry.indent:.1f}, Size:{toc_entry.font_size:.1f}] <text>{toc_entry.text}</text>"
+            f"  {idx}: [Indent:{toc_entry.indent:.1f}, Size:{toc_entry.font_size:.1f}] {_wrap_pdf_text(toc_entry.text)}"
@@
-        prompt_lines.append(f"  {letter_id}: <text>{title}</text>")
+        prompt_lines.append(f"  {letter_id}: {_wrap_pdf_text(title)}")

Also applies to: 401-402, 412-413

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pdf_craft/toc/llm_analyser.py` around lines 232 - 233, Escape PDF-extracted
text before embedding inside the "<text>...</text>" f-strings to prevent
injection/termination (e.g., replace &, <, > and specifically "</text>"). Create
or use an escape helper (e.g., xml.sax.saxutils.escape or html.escape) and apply
it to title.text (and the other occurrences) in the f-strings: replace
title.text with escape_for_xml(title.text) in the f-string shown and in the
similar constructions around lines 401-402 and 412-413; also add the appropriate
import for the escape utility.

Comment thread scripts/gen_epub.py
llm_config = json.load(f)
toc_llm = LLM(
key=llm_config["key"],
key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle empty PDF_CRAFT_API_KEY values safely.

At Line 28, PDF_CRAFT_API_KEY="" still wins over llm_config["key"], causing avoidable auth failures. Prefer a truthy fallback.

Proposed fix
-            key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]),
+            key=os.getenv("PDF_CRAFT_API_KEY") or llm_config["key"],
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]),
key=os.getenv("PDF_CRAFT_API_KEY") or llm_config["key"],
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/gen_epub.py` at line 28, The current call uses
key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]) which treats an empty
string as a present value; change it to use a truthy fallback so empty env vars
don't override llm_config. Replace that expression with
key=(os.environ.get("PDF_CRAFT_API_KEY") or llm_config["key"]) so
PDF_CRAFT_API_KEY is only used when non-empty; reference the env var name
PDF_CRAFT_API_KEY and the llm_config mapping to locate the change.

Comment thread scripts/gen_md.py
llm_config = json.load(f)
toc_llm = LLM(
key=llm_config["key"],
key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle empty PDF_CRAFT_API_KEY values safely.

At Line 22, an empty env var still overrides format.json, which can silently pass an invalid key downstream. Use a truthy fallback instead.

Proposed fix
-            key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]),
+            key=os.getenv("PDF_CRAFT_API_KEY") or llm_config["key"],
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]),
key=os.getenv("PDF_CRAFT_API_KEY") or llm_config["key"],
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/gen_md.py` at line 22, The current assignment of the PDF Craft API
key uses os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]) which treats an
empty env var as a valid override; change it to use a truthy fallback so empty
strings don't override the config (e.g., use the env value only if truthy,
otherwise fallback to llm_config["key"]); update the expression around the key
parameter where key=... is set and ensure llm_config["key"] remains the default
when the environment variable is empty or missing.

- Replace O(n) tuple-per-comparison in has_repetitive_ngrams() with a
  Rabin-Karp rolling hash; ngram comparisons are now O(1) (hash) with
  O(n) string verification only on hash matches, eliminating the CPU
  spike risk from crafted OCR output
- Add 100 MB size gate in read_xml() before reading the file into memory
  to prevent OOM from multi-thousand-page PDF intermediate files
- Remove json-repair dependency from both LLM validators; malformed JSON
  in the RESULT section now triggers a retry with a descriptive error
  instead of being silently reconstructed, closing the prompt-injection
  masking vector

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pdf_craft/common/xml.py`:
- Around line 29-34: The size check using file_path.stat() has a TOCTOU gap;
instead open the file and read at most _MAX_XML_BYTES + 1 bytes (e.g. use
file_path.open(...) or open(file_path, "r", encoding="utf-8") and read with a
size limit) then validate the buffer length and raise the same ValueError if it
exceeds _MAX_XML_BYTES, and finally call fromstring on the read buffer; update
the logic that currently uses file_path.stat(), file_path.read_text(), and
fromstring to use the buffered content and _MAX_XML_BYTES for the enforcement.

In `@pdf_craft/pdf/page_ref.py`:
- Around line 119-123: The DPI computation must reject or clamp non-positive
inputs: before computing `computed` in the DPI calculation (where `file_size`,
`width_inch`, `height_inch` and constants `_BYTES_PER_PIXEL`,
`_PNG_COMPRESSION_RATIO`, `_MAX_DPI` are used), validate that `file_size > 0`
and `width_inch > 0` and `height_inch > 0`; if any are non-positive, either
raise a clear ValueError or return a clamped minimum DPI (e.g., 1) so `render()`
never receives 0 or triggers a divide-by-zero — then proceed to compute
`computed` and return `min(max(computed, 1), _MAX_DPI)` (or raise) to ensure a
valid DPI is always produced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b326c661-e1fe-49c3-b02c-281815e02a9c

📥 Commits

Reviewing files that changed from the base of the PR and between f6e2249 and d98411f.

📒 Files selected for processing (7)
  • pdf_craft/common/xml.py
  • pdf_craft/llm/context.py
  • pdf_craft/pdf/ngrams.py
  • pdf_craft/pdf/page_ref.py
  • pdf_craft/sequence/chapter.py
  • pdf_craft/toc/llm_analyser.py
  • pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • pyproject.toml

Comment thread pdf_craft/common/xml.py
Comment on lines +29 to 34
size = file_path.stat().st_size
if size > _MAX_XML_BYTES:
raise ValueError(
f"XML file exceeds size limit ({size} > {_MAX_XML_BYTES} bytes): {file_path}"
)
return fromstring(file_path.read_text(encoding="utf-8"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce the XML byte cap on the bytes actually read.

Line 29 checks st_size before the file is opened, then Line 34 does a separate full read_text(). That leaves a TOCTOU gap where the file can be swapped or grown after the size check, defeating the new 100 MB guard and still pulling an oversized payload into memory. Read at most _MAX_XML_BYTES + 1 from the opened file and validate that buffer instead.

Proposed fix
 def read_xml(file_path: Path) -> Element:
     try:
-        size = file_path.stat().st_size
-        if size > _MAX_XML_BYTES:
+        with file_path.open("rb") as f:
+            xml_bytes = f.read(_MAX_XML_BYTES + 1)
+        if len(xml_bytes) > _MAX_XML_BYTES:
             raise ValueError(
-                f"XML file exceeds size limit ({size} > {_MAX_XML_BYTES} bytes): {file_path}"
+                f"XML file exceeds size limit (> {_MAX_XML_BYTES} bytes): {file_path}"
             )
-        return fromstring(file_path.read_text(encoding="utf-8"))
+        return fromstring(xml_bytes)
     except ValueError:
         raise
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
size = file_path.stat().st_size
if size > _MAX_XML_BYTES:
raise ValueError(
f"XML file exceeds size limit ({size} > {_MAX_XML_BYTES} bytes): {file_path}"
)
return fromstring(file_path.read_text(encoding="utf-8"))
def read_xml(file_path: Path) -> Element:
try:
with file_path.open("rb") as f:
xml_bytes = f.read(_MAX_XML_BYTES + 1)
if len(xml_bytes) > _MAX_XML_BYTES:
raise ValueError(
f"XML file exceeds size limit (> {_MAX_XML_BYTES} bytes): {file_path}"
)
return fromstring(xml_bytes)
except ValueError:
raise
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pdf_craft/common/xml.py` around lines 29 - 34, The size check using
file_path.stat() has a TOCTOU gap; instead open the file and read at most
_MAX_XML_BYTES + 1 bytes (e.g. use file_path.open(...) or open(file_path, "r",
encoding="utf-8") and read with a size limit) then validate the buffer length
and raise the same ValueError if it exceeds _MAX_XML_BYTES, and finally call
fromstring on the read buffer; update the logic that currently uses
file_path.stat(), file_path.read_text(), and fromstring to use the buffered
content and _MAX_XML_BYTES for the enforcement.

Comment thread pdf_craft/pdf/page_ref.py
Comment on lines +119 to +123
computed = (
file_size
/ (width_inch * height_inch * _BYTES_PER_PIXEL * _PNG_COMPRESSION_RATIO)
) ** 0.5
return min(computed, _MAX_DPI)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject non-positive inputs before computing the capped DPI.

This only caps the upper bound. If file_size <= 0 or either page dimension is non-positive, the method can still divide by zero or return a value that rounds down to 0, which render() then forwards as an invalid DPI. Fail fast here, or clamp to a minimum of 1, so malformed PDFs/config don’t surface as opaque render errors.

Proposed fix
     def _dpi_with_size(
         self, file_size: int, width_inch: float, height_inch: float
     ) -> float:
         # Formula: file_size = width_px * height_px * bytes_per_pixel * compression_ratio
         # where width_px = width_inch * dpi, height_px = height_inch * dpi
+        if file_size <= 0:
+            raise ValueError(f"file_size must be positive, got {file_size}")
+        if width_inch <= 0 or height_inch <= 0:
+            raise ValueError(
+                f"page dimensions must be positive, got {width_inch}x{height_inch}"
+            )
         computed = (
             file_size
             / (width_inch * height_inch * _BYTES_PER_PIXEL * _PNG_COMPRESSION_RATIO)
         ) ** 0.5
-        return min(computed, _MAX_DPI)
+        return max(1.0, min(computed, _MAX_DPI))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
computed = (
file_size
/ (width_inch * height_inch * _BYTES_PER_PIXEL * _PNG_COMPRESSION_RATIO)
) ** 0.5
return min(computed, _MAX_DPI)
if file_size <= 0:
raise ValueError(f"file_size must be positive, got {file_size}")
if width_inch <= 0 or height_inch <= 0:
raise ValueError(
f"page dimensions must be positive, got {width_inch}x{height_inch}"
)
computed = (
file_size
/ (width_inch * height_inch * _BYTES_PER_PIXEL * _PNG_COMPRESSION_RATIO)
) ** 0.5
return max(1.0, min(computed, _MAX_DPI))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pdf_craft/pdf/page_ref.py` around lines 119 - 123, The DPI computation must
reject or clamp non-positive inputs: before computing `computed` in the DPI
calculation (where `file_size`, `width_inch`, `height_inch` and constants
`_BYTES_PER_PIXEL`, `_PNG_COMPRESSION_RATIO`, `_MAX_DPI` are used), validate
that `file_size > 0` and `width_inch > 0` and `height_inch > 0`; if any are
non-positive, either raise a clear ValueError or return a clamped minimum DPI
(e.g., 1) so `render()` never receives 0 or triggers a divide-by-zero — then
proceed to compute `computed` and return `min(max(computed, 1), _MAX_DPI)` (or
raise) to ensure a valid DPI is always produced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants