Use case
When redacting a PDF, removing the visible content is not enough — the same sensitive information often persists in hidden parts of the file and is trivially recoverable. A defensible redaction has to strip those hidden vectors too. EmbedPDF can already clear the /Info dictionary (EPDF_SetMetaText) and remove attachments (removeAttachment), but three common vectors have no removal path:
- XMP metadata — the catalog
/Metadata stream. It is stored separately from /Info, so clearing Info leaves author, title, and edit history intact in XMP. This is the single most common real-world sanitization miss.
- Document JavaScript — the catalog
/Names /JavaScript name tree, a JavaScript /OpenAction, and the catalog /AA additional-actions.
- Embedded thumbnails — per-page
/Thumb images, which can retain a pre-redaction snapshot of the page.
Who it benefits
Anyone using EmbedPDF to redact, export, or share sensitive documents — legal, healthcare, finance, FOIA/disclosure workflows. Today a developer building redaction on EmbedPDF has no way to guarantee these vectors are gone, so a "redacted" file can still leak privileged data through metadata, scripts, or a thumbnail. Giving the engine first-class removal makes complete, defensible sanitization possible without bolting on a second PDF library.
Proposed implementation
Three granular extension functions, in the style of the existing EPDF_* API, plus an engine method that composes them:
EPDF_RemoveXMPMetadata(doc) — remove the catalog /Metadata stream
EPDF_RemoveEmbeddedThumbnails(doc) — remove every page /Thumb
EPDF_RemoveAllJavaScript(doc) — remove /Names /JavaScript, a JavaScript /OpenAction (a plain GoTo OpenAction is preserved), and catalog /AA
sanitizeDocument(doc, options) on PdfEngine — composes the three with the existing removeAttachment loop and a non-incremental save; each vector is opt-out via options
I have this implemented against the PDFium fork and the engine, with Node tests asserting each vector is removed (and that unrelated content is preserved) on a crafted fixture. Happy to open the PRs (the EPDF_* exports in embedpdf/pdfium, then the engine method + tests here).
A couple of questions before I do:
- Do you want the granular exports above, or a single
EPDF_SanitizeDocument(doc, flags)?
- A fourth vector — content hidden behind OFF optional-content groups (layers) — needs more care (it has to excise the marked content, not just drop
/OCProperties). I would propose that as a separate follow-up PR to keep this one focused. Does that split work for you?
Use case
When redacting a PDF, removing the visible content is not enough — the same sensitive information often persists in hidden parts of the file and is trivially recoverable. A defensible redaction has to strip those hidden vectors too. EmbedPDF can already clear the
/Infodictionary (EPDF_SetMetaText) and remove attachments (removeAttachment), but three common vectors have no removal path:/Metadatastream. It is stored separately from/Info, so clearing Info leaves author, title, and edit history intact in XMP. This is the single most common real-world sanitization miss./Names /JavaScriptname tree, a JavaScript/OpenAction, and the catalog/AAadditional-actions./Thumbimages, which can retain a pre-redaction snapshot of the page.Who it benefits
Anyone using EmbedPDF to redact, export, or share sensitive documents — legal, healthcare, finance, FOIA/disclosure workflows. Today a developer building redaction on EmbedPDF has no way to guarantee these vectors are gone, so a "redacted" file can still leak privileged data through metadata, scripts, or a thumbnail. Giving the engine first-class removal makes complete, defensible sanitization possible without bolting on a second PDF library.
Proposed implementation
Three granular extension functions, in the style of the existing
EPDF_*API, plus an engine method that composes them:EPDF_RemoveXMPMetadata(doc)— remove the catalog/MetadatastreamEPDF_RemoveEmbeddedThumbnails(doc)— remove every page/ThumbEPDF_RemoveAllJavaScript(doc)— remove/Names /JavaScript, a JavaScript/OpenAction(a plain GoTo OpenAction is preserved), and catalog/AAsanitizeDocument(doc, options)onPdfEngine— composes the three with the existingremoveAttachmentloop and a non-incremental save; each vector is opt-out viaoptionsI have this implemented against the PDFium fork and the engine, with Node tests asserting each vector is removed (and that unrelated content is preserved) on a crafted fixture. Happy to open the PRs (the
EPDF_*exports inembedpdf/pdfium, then the engine method + tests here).A couple of questions before I do:
EPDF_SanitizeDocument(doc, flags)?/OCProperties). I would propose that as a separate follow-up PR to keep this one focused. Does that split work for you?