Skip to content

Document sanitization: remove XMP metadata, JavaScript, and embedded thumbnails (redaction defensibility) #673

Description

@Phauks

Use case

When redacting a PDF, removing the visible content is not enough — the same sensitive information often persists in hidden parts of the file and is trivially recoverable. A defensible redaction has to strip those hidden vectors too. EmbedPDF can already clear the /Info dictionary (EPDF_SetMetaText) and remove attachments (removeAttachment), but three common vectors have no removal path:

  • XMP metadata — the catalog /Metadata stream. It is stored separately from /Info, so clearing Info leaves author, title, and edit history intact in XMP. This is the single most common real-world sanitization miss.
  • Document JavaScript — the catalog /Names /JavaScript name tree, a JavaScript /OpenAction, and the catalog /AA additional-actions.
  • Embedded thumbnails — per-page /Thumb images, which can retain a pre-redaction snapshot of the page.

Who it benefits

Anyone using EmbedPDF to redact, export, or share sensitive documents — legal, healthcare, finance, FOIA/disclosure workflows. Today a developer building redaction on EmbedPDF has no way to guarantee these vectors are gone, so a "redacted" file can still leak privileged data through metadata, scripts, or a thumbnail. Giving the engine first-class removal makes complete, defensible sanitization possible without bolting on a second PDF library.

Proposed implementation

Three granular extension functions, in the style of the existing EPDF_* API, plus an engine method that composes them:

  • EPDF_RemoveXMPMetadata(doc) — remove the catalog /Metadata stream
  • EPDF_RemoveEmbeddedThumbnails(doc) — remove every page /Thumb
  • EPDF_RemoveAllJavaScript(doc) — remove /Names /JavaScript, a JavaScript /OpenAction (a plain GoTo OpenAction is preserved), and catalog /AA
  • sanitizeDocument(doc, options) on PdfEngine — composes the three with the existing removeAttachment loop and a non-incremental save; each vector is opt-out via options

I have this implemented against the PDFium fork and the engine, with Node tests asserting each vector is removed (and that unrelated content is preserved) on a crafted fixture. Happy to open the PRs (the EPDF_* exports in embedpdf/pdfium, then the engine method + tests here).

A couple of questions before I do:

  1. Do you want the granular exports above, or a single EPDF_SanitizeDocument(doc, flags)?
  2. A fourth vector — content hidden behind OFF optional-content groups (layers) — needs more care (it has to excise the marked content, not just drop /OCProperties). I would propose that as a separate follow-up PR to keep this one focused. Does that split work for you?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions