fix: avoid stack overflows on deeply nested HTML#421
Open
noahskelton wants to merge 4 commits into
Open
Conversation
The Subcommand import is only used by the mcp-gated Commands enum, so building without the mcp feature failed under -D warnings with an unused import error.
a1c7ebc to
422e1c6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
AI Disclosure
Hello - I made this with the full help of Claude Opus 4.8 and then finished, improved and reviewed with Codex GPT 5.5 - I do not know Rust at all, but i've made every attempt to get this pull request in clean, good working order with robust fixes and regression tests. These bugs were encountered in the wild and I'd like to contribute a fix.
Summary
This fixes native stack overflows when converting pathologically deep or malformed HTML. In particular, pages with very deep DOM chains, including malformed table markup with many unclosed cells, could abort the process while the converter was doing recursive auxiliary walks before or during conversion.
The fix makes the affected whole-subtree traversals iterative and adds a native stack safety limit for the remaining recursive conversion walks.
Original Reproducers
This was originally found from
@kreuzberg/html-to-markdown-nodeaborting insideconvert()on production workers. The affected checkout wasdd59eaf, corresponding to the publishedv3.7.2release that production was running. The crash was a native Rust stack overflow, not a catchable panic:The real pages used to reproduce the issue locally were:
https://authenticator.2stable.com/services/— local saved sample was about 2.45 MB.http://infolab.stanford.edu/pub/movies/actors.html— local saved sample was about 811 KB and contains tens of thousands of unclosed<td>elements.I have attached both saved HTML files to this PR as reference material. The regression test in this PR uses synthetic deeply nested/unclosed markup so the suite does not depend on external websites or checked-in large HTML fixtures.
The failing production options were equivalent to
extract_metadata=true,skip_images=true,strip_tags=["script", "style"], andmax_depth=Some(200).The investigation used a temporary example harness, not included in this PR, that reads a saved HTML file and runs
convert()inside a thread with a configurable stack size. That made the overflow deterministic on macOS, where the default stack is otherwise larger than the stack seen in the failing Linux pods.Exact local repro commands from the investigation:
Expected behavior on the unfixed baseline:
Expected behavior with this fix:
Observed fixed output lengths from the original local repro were:
2stable.html:OK content_len=721actors.html:OK content_len=2486344Locally the unfixed repro aborted with exit 134; in the production pod it surfaced as exit 139.
What Changed
<head>metadata search with an iterative traversal.Behavior Note
Previously,
max_depth: Nonemeant unlimited traversal. This change makes the default use an internal native-stack safety limit instead, and explicitmax_depthvalues above that limit are clamped.That changes the behavior for extremely deep DOMs, but prevents hostile or malformed input from aborting the process. Ordinary document nesting below the safety limit is still converted normally.
The native stack safety limit is currently
64. This is intentionally conservative: the remaining recursive paths carry non-trivial conversion state, and the originalmax_depth=200production setting was still high enough to expose stack-overflow risk on constrained stacks. A cap of64keeps ordinary document nesting intact while leaving headroom for Rust frame size, platform stack differences, and language-binding hosts such as Node. This value is a guardrail rather than a content-model limit; if the main traversal is made fully iterative in a future change, the cap can likely be revisited or removed.Testing
Validated locally with:
Notes for Reviewers
This PR intentionally does not add a public option to disable the native stack guard. That kind of escape hatch would be a broader API decision, especially for Node/WASM bindings where disabling the guard can abort the host process. The safer default is to truncate pathological depth rather than expose process-fatal behavior through language bindings.
The most robust longer-term fix would be to remove native recursion from the main DOM conversion walk as well, using an explicit traversal stack/state machine throughout the converter. This PR does not attempt that larger rewrite because
walk_nodethreads conversion depth, context, visitor hooks, and many tag-specific handlers, so changing it wholesale would carry a much larger regression surface. The current patch hardens the known unbounded auxiliary traversals and adds a safety limit around the remaining recursive paths; a future follow-up could make the traversal architecture fully iterative and then relax the need for an internal native-stack cap.