Document sanitization: remove XMP metadata, JavaScript, and embedded thumbnails (redaction defensibility)

### Use case

When redacting a PDF, removing the visible content is not enough — the same sensitive information often persists in hidden parts of the file and is trivially recoverable. A defensible redaction has to strip those hidden vectors too. EmbedPDF can already clear the `/Info` dictionary (`EPDF_SetMetaText`) and remove attachments (`removeAttachment`), but three common vectors have no removal path:

- **XMP metadata** — the catalog `/Metadata` stream. It is stored separately from `/Info`, so clearing Info leaves author, title, and edit history intact in XMP. This is the single most common real-world sanitization miss.
- **Document JavaScript** — the catalog `/Names /JavaScript` name tree, a JavaScript `/OpenAction`, and the catalog `/AA` additional-actions.
- **Embedded thumbnails** — per-page `/Thumb` images, which can retain a pre-redaction snapshot of the page.

### Who it benefits

Anyone using EmbedPDF to redact, export, or share sensitive documents — legal, healthcare, finance, FOIA/disclosure workflows. Today a developer building redaction on EmbedPDF has no way to guarantee these vectors are gone, so a "redacted" file can still leak privileged data through metadata, scripts, or a thumbnail. Giving the engine first-class removal makes complete, defensible sanitization possible without bolting on a second PDF library.

### Proposed implementation

Three granular extension functions, in the style of the existing `EPDF_*` API, plus an engine method that composes them:

- `EPDF_RemoveXMPMetadata(doc)` — remove the catalog `/Metadata` stream
- `EPDF_RemoveEmbeddedThumbnails(doc)` — remove every page `/Thumb`
- `EPDF_RemoveAllJavaScript(doc)` — remove `/Names /JavaScript`, a JavaScript `/OpenAction` (a plain GoTo OpenAction is preserved), and catalog `/AA`
- `sanitizeDocument(doc, options)` on `PdfEngine` — composes the three with the existing `removeAttachment` loop and a non-incremental save; each vector is opt-out via `options`

I have this implemented against the PDFium fork and the engine, with Node tests asserting each vector is removed (and that unrelated content is preserved) on a crafted fixture. Happy to open the PRs (the `EPDF_*` exports in `embedpdf/pdfium`, then the engine method + tests here).

A couple of questions before I do:

1. Do you want the **granular exports** above, or a single `EPDF_SanitizeDocument(doc, flags)`?
2. A fourth vector — content hidden behind **OFF optional-content groups (layers)** — needs more care (it has to excise the marked content, not just drop `/OCProperties`). I would propose that as a **separate** follow-up PR to keep this one focused. Does that split work for you?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Document sanitization: remove XMP metadata, JavaScript, and embedded thumbnails (redaction defensibility) #673

Use case

Who it benefits

Proposed implementation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Document sanitization: remove XMP metadata, JavaScript, and embedded thumbnails (redaction defensibility) #673

Description

Use case

Who it benefits

Proposed implementation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions