Skip to content

Handle .puz files with malformed header data#75

Merged
orta merged 1 commit into
mainfrom
handle-malformed-puz
May 29, 2026
Merged

Handle .puz files with malformed header data#75
orta merged 1 commit into
mainfrom
handle-malformed-puz

Conversation

@samanpwbb

Copy link
Copy Markdown
Contributor

This PR adds a fixture that has a mismatch between header data and content, which, before this PR, would cause an infinite loop

Context for LLM reviewers

Worth a close look

  • packages/xd-crossword-tools/src/vendor/puzjs.ts readString() — the infinite loop was here: bytes[ibyte] past EOF returned undefined, undefined !== 0 is true, and String.fromCharCode(undefined) is a NUL char, so the loop appended NULs to result forever until Node OOM'd at ~4 GB. The new EOF check throws with the grid-scan counts so future malformed files surface a diagnosable error instead of a hung process.
  • The earlier minBytes = 52 + 2*ncol*nrow precheck is the cheap structural guard; the readString guard is the backstop for files that pass the size check but still have inconsistent header vs. body data (which is exactly what the fixture does — it has the right number of bytes, but the header's grid size disagrees with the actual content).

Decisions

  • Threw from the decoder rather than adding a separate validator pass. The decoder already had enough context (grid-scan clue counts, declared dimensions) to give a useful error message, and any external validator would either need to re-implement the parse or run after the decoder has already crashed.
  • Did not try to recover by guessing the real grid dimensions. The fixture's header says 10x10 but the content looks 15-wide; rather than heuristically re-deriving width, surfacing the corruption is the safer call.

Out of scope

  • puzjs.ts is vendored from downforacross/puzjs and has other latent issues (e.g. getExtension does an unbounded substring scan that can match inside binary payloads). Not touched here — only the specific OOM path is fixed.
  • The fixture's underlying header bug (likely an exporter writing the wrong width) is not investigated upstream.

Already verified

  • Reproduced the OOM on main with the fixture (Node grew to ~4 GB over ~45s before crashing).
  • After the fix, puzToXD throws synchronously with the descriptive message.
  • yarn test puz2XD — 3/3 pass including the new regression test.
  • yarn type-check errors that surface are pre-existing in website/ and unrelated to this change.

…ying to parse.

readString() in the .puz decoder had no bounds check: once ibyte walked
past bytes.length, bytes[ibyte] was undefined, the `b !== 0` loop
condition stayed true, and the loop appended String.fromCharCode of
undefined (a NUL char) forever — Node OOM'd at ~4 GB instead of
reporting a corrupt file.

PUZtoJSON now rejects files where the header reports a zero-sized grid
or where the file is shorter than 52 + 2*ncol*nrow (header + solution +
progress), and readString throws a descriptive error with the grid-scan
clue counts when it hits EOF mid-string. Adds a fixture
(truncated-strings.puz — header says 10x10 / 34 clues but only 38
strings are present) and a regression test that asserts puzToXD throws
instead of looping.
@orta

orta commented May 28, 2026

Copy link
Copy Markdown
Member

👍🏻

@orta orta merged commit fb07de2 into main May 29, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants