fix: add safe_root path traversal guard for analysing_path#361
fix: add safe_root path traversal guard for analysing_path#361shamario13 wants to merge 4 commits into
Conversation
Introduces assert_within_root() in EnsureFolder and a safe_root parameter on transform_markdown/transform_epub so callers wrapping the library in a service can enforce that the working directory stays inside an expected filesystem boundary. Also resolves AssetHub._asset_path to an absolute canonical path at construction time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WalkthroughThis PR adds multiple safety and behavior changes across the codebase: it enforces path containment via a new assert_within_root and EnsureFolder.safe_root (wired through Transform.transform_markdown/transform_epub), normalizes AssetHub paths with Path.resolve(), tightens XML parsing and atomic save behavior, implements rolling-hash n-gram repetition detection, caps computed DPI, validates rectangle coordinates, updates LLM prompts and strict JSON parsing, allows runtime LLM key overrides, tightens per-request logfile permissions, and pins CI action versions. Sequence Diagram(s)sequenceDiagram
participant Client
participant Transform
participant EnsureFolder
participant assert_within_root
Client->>Transform: transform_markdown(..., safe_root)
Transform->>EnsureFolder: EnsureFolder(path=..., safe_root=to_path(safe_root))
Client->>EnsureFolder: __enter__()
EnsureFolder->>assert_within_root: assert_within_root(path, safe_root)
assert_within_root->>assert_within_root: resolve & check containment
alt contained
assert_within_root-->>EnsureFolder: validation passes
EnsureFolder-->>Transform: context ready
else not contained
assert_within_root-->>EnsureFolder: raise ValueError
EnsureFolder-->>Transform: propagate error
end
Possibly related PRs
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pdf_craft/common/folder.py`:
- Around line 5-12: In assert_within_root, capture the caught ValueError and
explicitly chain the new ValueError to clarify intent: change the except block
to "except ValueError as err:" and re-raise the new ValueError using "raise
ValueError(f\"Path '{path}' must be within the safe root '{root}'\") from err"
so the original exception is preserved and Ruff B904 is satisfied.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 118f3d30-e5d3-4da0-bacc-315a610f96b3
📒 Files selected for processing (3)
pdf_craft/common/asset.pypdf_craft/common/folder.pypdf_craft/transform.py
| def assert_within_root(path: Path, root: Path) -> None: | ||
| """Raise ValueError if path does not resolve to a location inside root.""" | ||
| try: | ||
| path.resolve().relative_to(root.resolve()) | ||
| except ValueError: | ||
| raise ValueError( | ||
| f"Path '{path}' must be within the safe root '{root}'" | ||
| ) |
There was a problem hiding this comment.
Chain the exception to clarify intent.
The Ruff B904 warning is valid. When re-raising in an except block, use from None to indicate intentional replacement (suppressing the original traceback) or from err to preserve the chain.
Proposed fix
def assert_within_root(path: Path, root: Path) -> None:
"""Raise ValueError if path does not resolve to a location inside root."""
try:
path.resolve().relative_to(root.resolve())
- except ValueError:
+ except ValueError as err:
raise ValueError(
f"Path '{path}' must be within the safe root '{root}'"
- )
+ ) from err📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def assert_within_root(path: Path, root: Path) -> None: | |
| """Raise ValueError if path does not resolve to a location inside root.""" | |
| try: | |
| path.resolve().relative_to(root.resolve()) | |
| except ValueError: | |
| raise ValueError( | |
| f"Path '{path}' must be within the safe root '{root}'" | |
| ) | |
| def assert_within_root(path: Path, root: Path) -> None: | |
| """Raise ValueError if path does not resolve to a location inside root.""" | |
| try: | |
| path.resolve().relative_to(root.resolve()) | |
| except ValueError as err: | |
| raise ValueError( | |
| f"Path '{path}' must be within the safe root '{root}'" | |
| ) from err |
🧰 Tools
🪛 Ruff (0.15.13)
[warning] 10-12: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pdf_craft/common/folder.py` around lines 5 - 12, In assert_within_root,
capture the caught ValueError and explicitly chain the new ValueError to clarify
intent: change the except block to "except ValueError as err:" and re-raise the
new ValueError using "raise ValueError(f\"Path '{path}' must be within the safe
root '{root}'\") from err" so the original exception is preserved and Ruff B904
is satisfied.
- Pin GitHub Actions to commit SHAs (actions/checkout, setup-python, abatilo/actions-poetry) to prevent tag-hijack supply chain attacks - Restrict LLM log files to owner-only (chmod 600) on POSIX; logs contain verbatim PDF text and must not be world-readable - Support PDF_CRAFT_API_KEY env var in scripts as a safer alternative to storing the key in format.json - Wrap PDF-derived text in <text>...</text> delimiters in LLM prompts and instruct the model to treat tagged content as data, reducing prompt injection impact from crafted PDFs - Replace xml.etree.ElementTree.fromstring with defusedxml to prevent XML entity expansion (billion-laughs) attacks on intermediate files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Validate det coordinates in _parse_det(): reject negative values and degenerate boxes (x1>=x2 or y1>=y2) before they reach Image.crop() - Cap computed DPI at 600 in _dpi_with_size() to prevent excessive Poppler memory allocation from PDFs with crafted page dimensions - Replace fixed .xml.tmp temp file with mkstemp() so concurrent writers get unique names and the write-then-rename stays atomic - Use full uuid4().hex (128-bit) as LLM context ID instead of the truncated 12-char (48-bit) variant Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pdf_craft/llm/core.py`:
- Around line 128-130: The os.chmod call that tightens log file permissions
(sys.platform != "win32" and os.chmod(file_path, 0o600)) should be made
best-effort so it cannot raise and break request handling during logger
creation: wrap the os.chmod(file_path, 0o600) call in a try/except (catch
OSError/PermissionError or Exception) and swallow the error (optionally log a
debug/warn via the module logger) so failures to change permissions do not
propagate out of the logger creation path in pdf_craft/llm/core.py.
In `@pdf_craft/toc/llm_analyser.py`:
- Around line 232-233: Escape PDF-extracted text before embedding inside the
"<text>...</text>" f-strings to prevent injection/termination (e.g., replace &,
<, > and specifically "</text>"). Create or use an escape helper (e.g.,
xml.sax.saxutils.escape or html.escape) and apply it to title.text (and the
other occurrences) in the f-strings: replace title.text with
escape_for_xml(title.text) in the f-string shown and in the similar
constructions around lines 401-402 and 412-413; also add the appropriate import
for the escape utility.
In `@scripts/gen_epub.py`:
- Line 28: The current call uses key=os.environ.get("PDF_CRAFT_API_KEY",
llm_config["key"]) which treats an empty string as a present value; change it to
use a truthy fallback so empty env vars don't override llm_config. Replace that
expression with key=(os.environ.get("PDF_CRAFT_API_KEY") or llm_config["key"])
so PDF_CRAFT_API_KEY is only used when non-empty; reference the env var name
PDF_CRAFT_API_KEY and the llm_config mapping to locate the change.
In `@scripts/gen_md.py`:
- Line 22: The current assignment of the PDF Craft API key uses
os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]) which treats an empty env
var as a valid override; change it to use a truthy fallback so empty strings
don't override the config (e.g., use the env value only if truthy, otherwise
fallback to llm_config["key"]); update the expression around the key parameter
where key=... is set and ensure llm_config["key"] remains the default when the
environment variable is empty or missing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bed78abe-bf64-4d31-b742-fbb52904f50f
📒 Files selected for processing (8)
.github/workflows/merge-build.yml.github/workflows/pr-check.ymlpdf_craft/common/xml.pypdf_craft/llm/core.pypdf_craft/toc/llm_analyser.pypyproject.tomlscripts/gen_epub.pyscripts/gen_md.py
✅ Files skipped from review due to trivial changes (2)
- .github/workflows/merge-build.yml
- .github/workflows/pr-check.yml
| # Restrict log file to owner-only on POSIX; logs contain verbatim PDF text. | ||
| if sys.platform != "win32": | ||
| os.chmod(file_path, 0o600) |
There was a problem hiding this comment.
Make log permission tightening best-effort.
At Lines 129-130, os.chmod exceptions can bubble up and break request handling during logger creation. Permission hardening should not take down the LLM request path.
Proposed fix
# Restrict log file to owner-only on POSIX; logs contain verbatim PDF text.
if sys.platform != "win32":
- os.chmod(file_path, 0o600)
+ try:
+ os.chmod(file_path, 0o600)
+ except OSError:
+ # Keep request path available even when chmod is unsupported.
+ pass📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Restrict log file to owner-only on POSIX; logs contain verbatim PDF text. | |
| if sys.platform != "win32": | |
| os.chmod(file_path, 0o600) | |
| # Restrict log file to owner-only on POSIX; logs contain verbatim PDF text. | |
| if sys.platform != "win32": | |
| try: | |
| os.chmod(file_path, 0o600) | |
| except OSError: | |
| # Keep request path available even when chmod is unsupported. | |
| pass |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pdf_craft/llm/core.py` around lines 128 - 130, The os.chmod call that
tightens log file permissions (sys.platform != "win32" and os.chmod(file_path,
0o600)) should be made best-effort so it cannot raise and break request handling
during logger creation: wrap the os.chmod(file_path, 0o600) call in a try/except
(catch OSError/PermissionError or Exception) and swallow the error (optionally
log a debug/warn via the module logger) so failures to change permissions do not
propagate out of the logger creation path in pdf_craft/llm/core.py.
| f" {idx}: [Group:{group_num}, Page:{title.ref[0]}, Size:{title.height:.1f}] <text>{title.text}</text>" | ||
| ) |
There was a problem hiding this comment.
Escape PDF text before wrapping with <text> delimiters.
Lines 232/401/412 inject raw source text inside <text>...</text>. If extracted text contains </text>, it can terminate the boundary and bypass the delimiter-based protection.
Proposed fix
+import html
@@
+def _wrap_pdf_text(text: str) -> str:
+ return f"<text>{html.escape(text, quote=False)}</text>"
+
@@
- f" {idx}: [Group:{group_num}, Page:{title.ref[0]}, Size:{title.height:.1f}] <text>{title.text}</text>"
+ f" {idx}: [Group:{group_num}, Page:{title.ref[0]}, Size:{title.height:.1f}] {_wrap_pdf_text(title.text)}"
@@
- f" {idx}: [Indent:{toc_entry.indent:.1f}, Size:{toc_entry.font_size:.1f}] <text>{toc_entry.text}</text>"
+ f" {idx}: [Indent:{toc_entry.indent:.1f}, Size:{toc_entry.font_size:.1f}] {_wrap_pdf_text(toc_entry.text)}"
@@
- prompt_lines.append(f" {letter_id}: <text>{title}</text>")
+ prompt_lines.append(f" {letter_id}: {_wrap_pdf_text(title)}")Also applies to: 401-402, 412-413
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pdf_craft/toc/llm_analyser.py` around lines 232 - 233, Escape PDF-extracted
text before embedding inside the "<text>...</text>" f-strings to prevent
injection/termination (e.g., replace &, <, > and specifically "</text>"). Create
or use an escape helper (e.g., xml.sax.saxutils.escape or html.escape) and apply
it to title.text (and the other occurrences) in the f-strings: replace
title.text with escape_for_xml(title.text) in the f-string shown and in the
similar constructions around lines 401-402 and 412-413; also add the appropriate
import for the escape utility.
| llm_config = json.load(f) | ||
| toc_llm = LLM( | ||
| key=llm_config["key"], | ||
| key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]), |
There was a problem hiding this comment.
Handle empty PDF_CRAFT_API_KEY values safely.
At Line 28, PDF_CRAFT_API_KEY="" still wins over llm_config["key"], causing avoidable auth failures. Prefer a truthy fallback.
Proposed fix
- key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]),
+ key=os.getenv("PDF_CRAFT_API_KEY") or llm_config["key"],📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]), | |
| key=os.getenv("PDF_CRAFT_API_KEY") or llm_config["key"], |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/gen_epub.py` at line 28, The current call uses
key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]) which treats an empty
string as a present value; change it to use a truthy fallback so empty env vars
don't override llm_config. Replace that expression with
key=(os.environ.get("PDF_CRAFT_API_KEY") or llm_config["key"]) so
PDF_CRAFT_API_KEY is only used when non-empty; reference the env var name
PDF_CRAFT_API_KEY and the llm_config mapping to locate the change.
| llm_config = json.load(f) | ||
| toc_llm = LLM( | ||
| key=llm_config["key"], | ||
| key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]), |
There was a problem hiding this comment.
Handle empty PDF_CRAFT_API_KEY values safely.
At Line 22, an empty env var still overrides format.json, which can silently pass an invalid key downstream. Use a truthy fallback instead.
Proposed fix
- key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]),
+ key=os.getenv("PDF_CRAFT_API_KEY") or llm_config["key"],📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| key=os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]), | |
| key=os.getenv("PDF_CRAFT_API_KEY") or llm_config["key"], |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/gen_md.py` at line 22, The current assignment of the PDF Craft API
key uses os.environ.get("PDF_CRAFT_API_KEY", llm_config["key"]) which treats an
empty env var as a valid override; change it to use a truthy fallback so empty
strings don't override the config (e.g., use the env value only if truthy,
otherwise fallback to llm_config["key"]); update the expression around the key
parameter where key=... is set and ensure llm_config["key"] remains the default
when the environment variable is empty or missing.
- Replace O(n) tuple-per-comparison in has_repetitive_ngrams() with a Rabin-Karp rolling hash; ngram comparisons are now O(1) (hash) with O(n) string verification only on hash matches, eliminating the CPU spike risk from crafted OCR output - Add 100 MB size gate in read_xml() before reading the file into memory to prevent OOM from multi-thousand-page PDF intermediate files - Remove json-repair dependency from both LLM validators; malformed JSON in the RESULT section now triggers a retry with a descriptive error instead of being silently reconstructed, closing the prompt-injection masking vector Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pdf_craft/common/xml.py`:
- Around line 29-34: The size check using file_path.stat() has a TOCTOU gap;
instead open the file and read at most _MAX_XML_BYTES + 1 bytes (e.g. use
file_path.open(...) or open(file_path, "r", encoding="utf-8") and read with a
size limit) then validate the buffer length and raise the same ValueError if it
exceeds _MAX_XML_BYTES, and finally call fromstring on the read buffer; update
the logic that currently uses file_path.stat(), file_path.read_text(), and
fromstring to use the buffered content and _MAX_XML_BYTES for the enforcement.
In `@pdf_craft/pdf/page_ref.py`:
- Around line 119-123: The DPI computation must reject or clamp non-positive
inputs: before computing `computed` in the DPI calculation (where `file_size`,
`width_inch`, `height_inch` and constants `_BYTES_PER_PIXEL`,
`_PNG_COMPRESSION_RATIO`, `_MAX_DPI` are used), validate that `file_size > 0`
and `width_inch > 0` and `height_inch > 0`; if any are non-positive, either
raise a clear ValueError or return a clamped minimum DPI (e.g., 1) so `render()`
never receives 0 or triggers a divide-by-zero — then proceed to compute
`computed` and return `min(max(computed, 1), _MAX_DPI)` (or raise) to ensure a
valid DPI is always produced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b326c661-e1fe-49c3-b02c-281815e02a9c
📒 Files selected for processing (7)
pdf_craft/common/xml.pypdf_craft/llm/context.pypdf_craft/pdf/ngrams.pypdf_craft/pdf/page_ref.pypdf_craft/sequence/chapter.pypdf_craft/toc/llm_analyser.pypyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
- pyproject.toml
| size = file_path.stat().st_size | ||
| if size > _MAX_XML_BYTES: | ||
| raise ValueError( | ||
| f"XML file exceeds size limit ({size} > {_MAX_XML_BYTES} bytes): {file_path}" | ||
| ) | ||
| return fromstring(file_path.read_text(encoding="utf-8")) |
There was a problem hiding this comment.
Enforce the XML byte cap on the bytes actually read.
Line 29 checks st_size before the file is opened, then Line 34 does a separate full read_text(). That leaves a TOCTOU gap where the file can be swapped or grown after the size check, defeating the new 100 MB guard and still pulling an oversized payload into memory. Read at most _MAX_XML_BYTES + 1 from the opened file and validate that buffer instead.
Proposed fix
def read_xml(file_path: Path) -> Element:
try:
- size = file_path.stat().st_size
- if size > _MAX_XML_BYTES:
+ with file_path.open("rb") as f:
+ xml_bytes = f.read(_MAX_XML_BYTES + 1)
+ if len(xml_bytes) > _MAX_XML_BYTES:
raise ValueError(
- f"XML file exceeds size limit ({size} > {_MAX_XML_BYTES} bytes): {file_path}"
+ f"XML file exceeds size limit (> {_MAX_XML_BYTES} bytes): {file_path}"
)
- return fromstring(file_path.read_text(encoding="utf-8"))
+ return fromstring(xml_bytes)
except ValueError:
raise📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| size = file_path.stat().st_size | |
| if size > _MAX_XML_BYTES: | |
| raise ValueError( | |
| f"XML file exceeds size limit ({size} > {_MAX_XML_BYTES} bytes): {file_path}" | |
| ) | |
| return fromstring(file_path.read_text(encoding="utf-8")) | |
| def read_xml(file_path: Path) -> Element: | |
| try: | |
| with file_path.open("rb") as f: | |
| xml_bytes = f.read(_MAX_XML_BYTES + 1) | |
| if len(xml_bytes) > _MAX_XML_BYTES: | |
| raise ValueError( | |
| f"XML file exceeds size limit (> {_MAX_XML_BYTES} bytes): {file_path}" | |
| ) | |
| return fromstring(xml_bytes) | |
| except ValueError: | |
| raise |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pdf_craft/common/xml.py` around lines 29 - 34, The size check using
file_path.stat() has a TOCTOU gap; instead open the file and read at most
_MAX_XML_BYTES + 1 bytes (e.g. use file_path.open(...) or open(file_path, "r",
encoding="utf-8") and read with a size limit) then validate the buffer length
and raise the same ValueError if it exceeds _MAX_XML_BYTES, and finally call
fromstring on the read buffer; update the logic that currently uses
file_path.stat(), file_path.read_text(), and fromstring to use the buffered
content and _MAX_XML_BYTES for the enforcement.
| computed = ( | ||
| file_size | ||
| / (width_inch * height_inch * _BYTES_PER_PIXEL * _PNG_COMPRESSION_RATIO) | ||
| ) ** 0.5 | ||
| return min(computed, _MAX_DPI) |
There was a problem hiding this comment.
Reject non-positive inputs before computing the capped DPI.
This only caps the upper bound. If file_size <= 0 or either page dimension is non-positive, the method can still divide by zero or return a value that rounds down to 0, which render() then forwards as an invalid DPI. Fail fast here, or clamp to a minimum of 1, so malformed PDFs/config don’t surface as opaque render errors.
Proposed fix
def _dpi_with_size(
self, file_size: int, width_inch: float, height_inch: float
) -> float:
# Formula: file_size = width_px * height_px * bytes_per_pixel * compression_ratio
# where width_px = width_inch * dpi, height_px = height_inch * dpi
+ if file_size <= 0:
+ raise ValueError(f"file_size must be positive, got {file_size}")
+ if width_inch <= 0 or height_inch <= 0:
+ raise ValueError(
+ f"page dimensions must be positive, got {width_inch}x{height_inch}"
+ )
computed = (
file_size
/ (width_inch * height_inch * _BYTES_PER_PIXEL * _PNG_COMPRESSION_RATIO)
) ** 0.5
- return min(computed, _MAX_DPI)
+ return max(1.0, min(computed, _MAX_DPI))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| computed = ( | |
| file_size | |
| / (width_inch * height_inch * _BYTES_PER_PIXEL * _PNG_COMPRESSION_RATIO) | |
| ) ** 0.5 | |
| return min(computed, _MAX_DPI) | |
| if file_size <= 0: | |
| raise ValueError(f"file_size must be positive, got {file_size}") | |
| if width_inch <= 0 or height_inch <= 0: | |
| raise ValueError( | |
| f"page dimensions must be positive, got {width_inch}x{height_inch}" | |
| ) | |
| computed = ( | |
| file_size | |
| / (width_inch * height_inch * _BYTES_PER_PIXEL * _PNG_COMPRESSION_RATIO) | |
| ) ** 0.5 | |
| return max(1.0, min(computed, _MAX_DPI)) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pdf_craft/pdf/page_ref.py` around lines 119 - 123, The DPI computation must
reject or clamp non-positive inputs: before computing `computed` in the DPI
calculation (where `file_size`, `width_inch`, `height_inch` and constants
`_BYTES_PER_PIXEL`, `_PNG_COMPRESSION_RATIO`, `_MAX_DPI` are used), validate
that `file_size > 0` and `width_inch > 0` and `height_inch > 0`; if any are
non-positive, either raise a clear ValueError or return a clamped minimum DPI
(e.g., 1) so `render()` never receives 0 or triggers a divide-by-zero — then
proceed to compute `computed` and return `min(max(computed, 1), _MAX_DPI)` (or
raise) to ensure a valid DPI is always produced.
Summary
Comprehensive security hardening based on a full audit of the codebase. Fixes span four severity levels: HIGH, MEDIUM, LOW, and INFORMATIONAL.
HIGH
Path traversal via user-supplied
analysing_pathFiles:
pdf_craft/common/folder.py,pdf_craft/common/asset.py,pdf_craft/transform.pyWhen pdf-craft is wrapped in a service that accepts a user-controlled
analysing_path, an attacker could point it at an arbitrary writable location (e.g./var/www/html/), causing the OCR pipeline to write files outside the intended working directory.assert_within_root(path, root)helper that resolves both paths before comparing (symlinks and..components cannot bypass it)EnsureFolderwith an optionalsafe_rootparameter that validatesanalysing_pathbefore any directory is createdsafe_rootparameter totransform_markdownandtransform_epub(backward-compatible, defaults toNone)AssetHub._asset_pathto an absolute canonical path at construction timeMEDIUM
Unpinned GitHub Actions (supply chain)
Files:
.github/workflows/merge-build.yml,.github/workflows/pr-check.ymlFloating version tags (
@v4,@v5,@v3) allow a tag-hijack on any of these actions to run attacker-controlled code in CI with fullGITHUB_TOKENaccess. Pinned all three to full commit SHAs:actions/checkout→34e114876b0b11c390a56381ad16ebd13914f8d5(v4)actions/setup-python→a26af69be951a213d495a4c3e4e4022e16d87065(v5)abatilo/actions-poetry→fd0e6716a0de25ef6ade151b8b53190b0376acfd(v3)LLM log files world-readable
File:
pdf_craft/llm/core.pyLog files contain verbatim PDF text and were created with the process umask (typically world-readable on multi-user systems). Added
os.chmod(file_path, 0o600)afterFileHandlercreation on POSIX systems.API key in plaintext config file
Files:
scripts/gen_md.py,scripts/gen_epub.pyBoth scripts now check the
PDF_CRAFT_API_KEYenvironment variable first and fall back toformat.jsononly if it is not set, providing a safer alternative to storing credentials on disk.LLM prompt injection via OCR-extracted text
File:
pdf_craft/toc/llm_analyser.pyPDF-extracted heading and TOC text was embedded verbatim into LLM prompts. Wrapped all PDF-derived text in
<text>...</text>delimiters and updated both system prompts to instruct the model that only content inside those tags is PDF data.XML entity expansion (billion laughs)
Files:
pdf_craft/common/xml.py,pyproject.tomlReplaced
xml.etree.ElementTree.fromstringwithdefusedxml.ElementTree.fromstringthroughout. Addeddefusedxml>=0.7.1as a dependency.LOW
Unvalidated det coordinates reach
Image.crop()File:
pdf_craft/sequence/chapter.py_parse_det()now rejects negative values and degenerate bounding boxes (x1 >= x2ory1 >= y2) before they reach Pillow'sImage.crop().DPI amplification from crafted page dimensions
File:
pdf_craft/pdf/page_ref.pyA PDF with unrealistically large declared page dimensions could produce an unbounded DPI value, causing Poppler to attempt excessive memory allocation. Added
_MAX_DPI = 600cap in_dpi_with_size().Fixed-name temp file race in
save_xml()File:
pdf_craft/common/xml.pyReplaced the fixed
.xml.tmpsuffix withtempfile.mkstemp(dir=file_path.parent), giving each writer a unique random name while keeping the write-then-rename atomic (same directory = same filesystem).Truncated UUID context ID
File:
pdf_craft/llm/context.pyChanged
uuid4().hex[:12](48-bit) touuid4().hex(128-bit). No reason to truncate.INFORMATIONAL
O(n³) worst-case in
has_repetitive_ngrams()File:
pdf_craft/pdf/ngrams.pyReplaced per-comparison
tuple(chars[i:i+n])creation (O(n) per comparison) with a Rabin-Karp rolling hash. N-gram comparisons are now O(1) (hash check) with O(n) string verification only on hash matches, eliminating the CPU spike risk from crafted OCR output.No size limit before XML parsing
File:
pdf_craft/common/xml.pyAdded a
_MAX_XML_BYTES = 100 MBgate inread_xml()that checksfile_path.stat().st_sizebefore reading the file into memory, preventing OOM from unexpectedly large intermediate files.json-repairsilently reconstructed malformed LLM outputFile:
pdf_craft/toc/llm_analyser.pyRemoved the
json-repairdependency entirely. Both_validate_title_responseand_validate_toc_responsenow use strictjson.loads()— malformed JSON in the RESULT section triggers a model retry with a descriptive error message instead of being silently reconstructed (which could mask prompt-injection payloads). Also removedjson-repairfrompyproject.toml.Test plan
transform_markdown/transform_epubwithanalysing_pathinsidesafe_root— works normallyanalysing_pathoutsidesafe_root— raisesValueErrorbefore any filesystem writessafe_root— no behavioural change (backward-compatible)detattribute with negative or degenerate values —ValueErrorraised at parse timelog_dir_pathset —chmod 600on POSIX, owner-only readablePDF_CRAFT_API_KEYenv var set — scripts use it instead offormat.jsonkey🤖 Generated with Claude Code