Skip to content

fix: avoid stack overflows on deeply nested HTML#421

Open
noahskelton wants to merge 4 commits into
xberg-io:mainfrom
noahskelton:codex/fix-deep-html-stack-overflow
Open

fix: avoid stack overflows on deeply nested HTML#421
noahskelton wants to merge 4 commits into
xberg-io:mainfrom
noahskelton:codex/fix-deep-html-stack-overflow

Conversation

@noahskelton

@noahskelton noahskelton commented Jun 24, 2026

Copy link
Copy Markdown

AI Disclosure

Hello - I made this with the full help of Claude Opus 4.8 and then finished, improved and reviewed with Codex GPT 5.5 - I do not know Rust at all, but i've made every attempt to get this pull request in clean, good working order with robust fixes and regression tests. These bugs were encountered in the wild and I'd like to contribute a fix.

Summary

This fixes native stack overflows when converting pathologically deep or malformed HTML. In particular, pages with very deep DOM chains, including malformed table markup with many unclosed cells, could abort the process while the converter was doing recursive auxiliary walks before or during conversion.

The fix makes the affected whole-subtree traversals iterative and adds a native stack safety limit for the remaining recursive conversion walks.

Original Reproducers

This was originally found from @kreuzberg/html-to-markdown-node aborting inside convert() on production workers. The affected checkout was dd59eaf, corresponding to the published v3.7.2 release that production was running. The crash was a native Rust stack overflow, not a catchable panic:

fatal runtime error: stack overflow, aborting

The real pages used to reproduce the issue locally were:

  • https://authenticator.2stable.com/services/ — local saved sample was about 2.45 MB.
  • http://infolab.stanford.edu/pub/movies/actors.html — local saved sample was about 811 KB and contains tens of thousands of unclosed <td> elements.

I have attached both saved HTML files to this PR as reference material. The regression test in this PR uses synthetic deeply nested/unclosed markup so the suite does not depend on external websites or checked-in large HTML fixtures.

The failing production options were equivalent to extract_metadata=true, skip_images=true, strip_tags=["script", "style"], and max_depth=Some(200).

The investigation used a temporary example harness, not included in this PR, that reads a saved HTML file and runs convert() inside a thread with a configurable stack size. That made the overflow deterministic on macOS, where the default stack is otherwise larger than the stack seen in the failing Linux pods.

Exact local repro commands from the investigation:

cd ~/dev/cloned_repo
mkdir -p repro_samples
curl -L --compressed "https://authenticator.2stable.com/services/" -o repro_samples/2stable.html
curl -L --compressed "http://infolab.stanford.edu/pub/movies/actors.html" -o repro_samples/actors.html

cat > crates/html-to-markdown/examples/repro.rs <<'RS'
use html_to_markdown_rs::convert;
use html_to_markdown_rs::options::ConversionOptions;
use std::{env, fs, thread};

fn main() {
    let path = env::args().nth(1).expect("usage: repro <html-file>");
    let html = fs::read_to_string(path).expect("read html");
    let stack_kb = env::var("STACK_KB")
        .ok()
        .and_then(|value| value.parse::<usize>().ok())
        .unwrap_or(8192);
    let max_depth = env::var("MAXDEPTH")
        .ok()
        .and_then(|value| value.parse::<usize>().ok());

    let options = ConversionOptions::builder()
        .extract_metadata(true)
        .skip_images(true)
        .strip_tags(vec!["script".into(), "style".into()])
        .max_depth(max_depth)
        .build();

    thread::Builder::new()
        .stack_size(stack_kb * 1024)
        .spawn(move || {
            let result = convert(&html, Some(options)).expect("convert html");
            println!("OK content_len={}", result.content.unwrap_or_default().len());
        })
        .expect("spawn conversion thread")
        .join()
        .expect("conversion thread overflowed");
}
RS

cargo build --example repro --release

BIN=target/release/examples/repro
STACK_KB=256 MAXDEPTH=200 "$BIN" repro_samples/2stable.html
STACK_KB=256 MAXDEPTH=200 "$BIN" repro_samples/actors.html

rm crates/html-to-markdown/examples/repro.rs

Expected behavior on the unfixed baseline:

fatal runtime error: stack overflow, aborting

Expected behavior with this fix:

OK content_len=...

Observed fixed output lengths from the original local repro were:

  • 2stable.html: OK content_len=721
  • actors.html: OK content_len=2486344

Locally the unfixed repro aborted with exit 134; in the production pod it surfaced as exit 139.

What Changed

  • Replaced recursive DOM hierarchy caching with an explicit stack.
  • Replaced recursive <head> metadata search with an iterative traversal.
  • Replaced recursive table scanning with an iterative traversal.
  • Made descendant text extraction iterative.
  • Made plain-text descendant collection iterative and bounded the plain-text walk.
  • Made document structure text, annotation, and table-row collection iterative, and bounded the remaining structure walk.
  • Applied an internal native stack safety depth to normal DOM conversion traversal.
  • Added regression tests that run conversions on constrained thread stacks to catch stack-overflow regressions.

Behavior Note

Previously, max_depth: None meant unlimited traversal. This change makes the default use an internal native-stack safety limit instead, and explicit max_depth values above that limit are clamped.

That changes the behavior for extremely deep DOMs, but prevents hostile or malformed input from aborting the process. Ordinary document nesting below the safety limit is still converted normally.

The native stack safety limit is currently 64. This is intentionally conservative: the remaining recursive paths carry non-trivial conversion state, and the original max_depth=200 production setting was still high enough to expose stack-overflow risk on constrained stacks. A cap of 64 keeps ordinary document nesting intact while leaving headroom for Rust frame size, platform stack differences, and language-binding hosts such as Node. This value is a guardrail rather than a content-model limit; if the main traversal is made fully iterative in a future change, the cap can likely be revisited or removed.

Testing

Validated locally with:

cargo fmt --all
git diff --check
cargo clippy -p html-to-markdown-rs --all-features -- -D warnings
cargo check -p html-to-markdown-rs --no-default-features
cargo check -p html-to-markdown-rs --no-default-features --features visitor
cargo check -p html-to-markdown-rs --no-default-features --features metadata
cargo check -p html-to-markdown-rs --no-default-features --features inline-images
cargo test -p html-to-markdown-rs --all-features --test deep_nesting_overflow --test test_max_depth
cargo test -p html-to-markdown-rs --all-features
task rust:lint:check
task rust:test:ci
task rust:e2e:test
task lint:check
task format:check

Notes for Reviewers

This PR intentionally does not add a public option to disable the native stack guard. That kind of escape hatch would be a broader API decision, especially for Node/WASM bindings where disabling the guard can abort the host process. The safer default is to truncate pathological depth rather than expose process-fatal behavior through language bindings.

The most robust longer-term fix would be to remove native recursion from the main DOM conversion walk as well, using an explicit traversal stack/state machine throughout the converter. This PR does not attempt that larger rewrite because walk_node threads conversion depth, context, visitor hooks, and many tag-specific handlers, so changing it wholesale would carry a much larger regression surface. The current patch hardens the known unbounded auxiliary traversals and adds a safety limit around the remaining recursive paths; a future follow-up could make the traversal architecture fully iterative and then relax the need for an internal native-stack cap.

@noahskelton noahskelton marked this pull request as ready for review June 24, 2026 20:09
@noahskelton noahskelton marked this pull request as draft June 24, 2026 20:16
@noahskelton noahskelton marked this pull request as ready for review June 24, 2026 20:17
@noahskelton noahskelton marked this pull request as draft June 24, 2026 20:30
@noahskelton noahskelton marked this pull request as ready for review June 24, 2026 20:31
The Subcommand import is only used by the mcp-gated Commands enum, so
building without the mcp feature failed under -D warnings with an unused
import error.
@noahskelton noahskelton force-pushed the codex/fix-deep-html-stack-overflow branch from a1c7ebc to 422e1c6 Compare June 25, 2026 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant