I tested Defuddle against a large Angular SPA page source and observed extraction failure when the HTML primarily contained framework boilerplate and massive inline CSS/font definitions.
Observed Behavior
Defuddle returned empty/invalid markdown output and extraction failed because the HTML contained very little semantic readable content compared to DOM noise.
The HTML included:
large inline <style> blocks
thousands of @font-face declarations
bootstrap/material CSS
Angular app shell markup
tracking scripts and metadata
Expected Behavior
Defuddle should ideally:
ignore noisy/non-semantic nodes during preprocessing
or provide a preprocessing option for frontend-heavy SPA HTML
Suggested Improvement
A preprocessing step before readability extraction could help significantly, for example removing:
script
style
noscript
svg
stylesheet-related nodes
before running extraction.
Additional Context
The issue was reproduced consistently using fixture-based testing with a saved HTML payload from an Angular application page source.
I tested Defuddle against a large Angular SPA page source and observed extraction failure when the HTML primarily contained framework boilerplate and massive inline CSS/font definitions.
Observed Behavior
Defuddle returned empty/invalid markdown output and extraction failed because the HTML contained very little semantic readable content compared to DOM noise.
The HTML included:
large inline <style> blocks
thousands of @font-face declarations
bootstrap/material CSS
Angular app shell markup
tracking scripts and metadata
Expected Behavior
Defuddle should ideally:
ignore noisy/non-semantic nodes during preprocessing
or provide a preprocessing option for frontend-heavy SPA HTML
Suggested Improvement
A preprocessing step before readability extraction could help significantly, for example removing:
script
style
noscript
svg
stylesheet-related nodes
before running extraction.
Additional Context
The issue was reproduced consistently using fixture-based testing with a saved HTML payload from an Angular application page source.