Sanitize database dumps before they go anywhere.
Turn huge pg_dump / SQLite / SQL Server exports into shareable, test-friendly snapshots — no DB connection, no secrets left by accident.
Disclaimer: This project is entirely vibe-coded, but with strong human guidance, review, and attention to quality and safety.
Dumpling reads plain-text SQL dumps (PostgreSQL pg_dump, SQLite .dump, SQL Server / MSSQL scripts) and rewrites sensitive columns using rules you define in TOML. Everything runs offline on files — ideal for CI, staging share-outs, and compliance-minded workflows.
- Rich built-in strategies — from fast clears (
null,redact,blank,empty_array/empty_object) and bounded fakes (int_range,decimal,string) to realistic stand-ins (email,name,payment_card,faker, date/time fuzz), with optionaldomainso the same source value stays consistent across tables. - JSON inside columns — target paths inside
json/jsonbtext with the same dot or__syntax you use elsewhere; pair with row filters on nested fields. - Row-level control —
retainanddeletepredicates (including nested JSON paths) drop or keep whole rows before transforms run. - Offline by design — works on dump files only; nothing connects to your database.
- Streams giant files — line-by-line processing keeps multi‑GB dumps reasonable on modest hardware.
- Fails loud, not silent — missing config exits non‑zero and lists where Dumpling looked; use
--allow-nooponly when you mean it. - Pipeline-ready —
--check, strict coverage, JSON reports, and residual PII scans fit pre-merge gates and release automation. - Configure once —
.dumplingconfor[tool.dumpling]inpyproject.toml; install via Rust (cargo) orpip install dumpling-cli.
cargo build --release
./target/release/dumpling --helpDumpling is also published as a pip-installable CLI package:
pip install dumpling-cliOr install from local source (requires maturin as PEP 517 backend):
pip install .After install the CLI command is the same:
dumpling --helpFollow these steps once; you will have a working path from “raw dump” to “first sanitized output,” then you can deepen coverage using the rest of this README and the documentation site.
- Generate a draft policy (recommended) — Run
dumpling scaffold-config -i dump.sql -o .dumplingconfto emit a beta starter TOML with inferred[rules]from column names inCREATE TABLE,INSERT, and (PostgreSQL)COPYheaders. Heuristics are English-oriented; treat the file as draft only—review every rule before production or compliance workflows. Add a globalsalt(for examplesalt = "${DUMPLING_SALT}") and resolve${…}references before anonymizing. Optionally pass--infer-json-pathsto sample up to five rows per table (reservoir) and suggest nested JSON keys ascolumn.path.to.leaf; use--max-json-depthif you need a different walk depth (default 24). For PostgreSQL custom-format or directory-format archives, pass--inputpointing at the archive with--format postgres(default); Dumpling auto-detects and runspg_restore(optional--pg-restore-path/--pg-restore-arg). Seedumpling scaffold-config --help. - Or start from the example policy — Copy
.dumplingconf.exampleto.dumplingconf(or merge under[tool.dumpling]inpyproject.toml) and edit[rules]by hand. Set environment variables forsaltand any${…}references so Dumpling can resolve secrets at startup. - Align rules with your dump — If you did not use
scaffold-config, open the dump beside the config:CREATE TABLE,COPY … (…), andINSERT INTO … (…)lines list identifiers for[rules."table"]or[rules."schema.table"](see Configuration (TOML)). Trim rules to the tables you care about first, then extend columns and strategies as you go. - Run Dumpling —
dumpling -i dump.sql -o sanitized.sql(add-c pathif the config is not in the default search path). Usedumpling --check -i dump.sqlwhen you only want to know whether anything would change. - Tighten the policy — Run
dumpling lint-policyon your config. When you are ready for stricter gates, add[sensitive_columns]and use--strict-coverage/--report/--scan-outputas described under Usage.
The same flow is spelled out in the docs: Getting started.
dumpling -i dump.sql -o sanitized.sql # read from file, write to file
dumpling -i dump.sql --in-place # overwrite the input file (atomic swap)
cat dump.sql | dumpling > sanitized.sql # stream from stdin to stdout
dumpling -i dump.sql -c .dumplingconf # use explicit config path
dumpling --check -i dump.sql # exit 1 if changes would occur, no output
dumpling --stats -i dump.sql -o out.sql # print summary to stderr
dumpling --report report.json -i dump.sql # write detailed JSON report of changes/drops
dumpling --strict-coverage --report report.json -i dump.sql --check # fail on uncovered sensitive columns
dumpling --scan-output --report report.json -i dump.sql # scan transformed output for residual PII-like patterns
dumpling --scan-output --fail-on-findings --report report.json -i dump.sql --check # fail if scan thresholds are exceeded
dumpling --allow-ext dmp -i data.dmp # restrict processing to specific extensions
dumpling --allow-noop -i dump.sql -o out.sql # explicitly allow no-op when config is missing
dumpling --format sqlite -i data.db.sql -o out.sql # process a SQLite .dump file
dumpling --format mssql -i backup.sql -o out.sql # process a SQL Server plain-SQL dump
dumpling lint-policy # lint the anonymization policy config
dumpling lint-policy --config .dumplingconf # lint with explicit config path
dumpling scaffold-config -i dump.sql -o .dumplingconf # draft [rules] from column names (beta)
dumpling scaffold-config -i dump.sql -o draft.toml --infer-json-paths # include JSON path hints (beta)Configuration is loaded in this order:
--config <path>if provided.dumplingconfin the current directorypyproject.toml[tool.dumpling]section
If no configuration is found, Dumpling fails closed by default and exits non-zero.
The error output lists every checked location. Use --allow-noop to explicitly
permit no-op behavior.
The dump seal comment prefixed to successful output and --security-profile hardened are documented in the configuration guide (see Dump seal and Hardened security profile).
Column rules live under [rules."schema.table"] (or [rules."table"]) as inline tables: { strategy = "<name>", ... }. Strategy-specific keys are documented next to the strategy that accepts them. A few keys apply across many strategies; see Cross-cutting options below.
Prefer lightweight strategies when nothing downstream requires lifelike values: null, redact, blank, empty_array, empty_object, string, int_range, and decimal are cheap to generate (simple constants, random digits/alnum, or bounded numeric shapes). Use blank for NOT NULL text where you must clear content without SQL NULL; use empty_array / empty_object on JSON path rules (or text columns holding JSON) when the document must keep [] / {} instead of null or "".
Reach for richer strategies when realism matters for restores, demos, or tests that exercise parsers and validators: email, name, first_name, last_name, phone, faker, uuid, hash, payment_card, and the date_fuzz / time_fuzz / datetime_fuzz family do more work (formatting, parsing, digest, or upstream generators). If a cheap strategy would break CHECK constraints, NOT NULL, foreign-key shape, or import tooling that validates formats, switch to a strategy that emits compatible values—or keep domain on the heavier strategy so referential consistency is preserved where you need it.
- Behavior: emit SQL
NULLfor the cell. - Options: none. (
domainis rejected.)
- Behavior: replace with the literal
REDACTED. as_string: iftrue, the replacement is always a single-quoted SQL string; iffalse, it is emitted without quotes (still valid as an identifier-like token in many dumps). When the original cell was already a quoted string, Dumpling quotes the output even whenas_stringis omitted—see Cross-cutting options.
- Behavior: replace with an empty string (
''in SQL when quoted). If the source cell is SQLNULL, the cell staysNULL(same asnull/redactsemantics for missing values). - Options: none. (
domainis rejected.)as_stringis ignored; output is always the empty string literal when non-NULL.
- Behavior: replace with the JSON tokens
[]and{}as unquoted SQL/COPY tokens (so they parse as JSON when the column holds JSON). If the source cell is SQLNULL, the cell staysNULL. - JSON path rules: use these on leaves that are JSON arrays or objects when you need a typed empty container instead of
nullor"". - Options: none. (
domainis rejected.)
- Behavior: random UUIDv4-like hyphenated hex string.
as_string: same meaning as forredact/hash(force quoted literal vs. unquoted token).
- Behavior: salted digest of the original cell value (SHA-256 by default; see configuration guide — Hardened security profile for HMAC mode).
salt: optional per-column salt; otherwise the top-levelsaltor registry default applies.as_string: iftrue, force a quoted string literal; iffalse, unquoted hex. Quoted source cells are still written quoted whenas_stringis omitted.
- Behavior: locale-aware fake values (same underlying generators as the matching
fakertargets). locale: optional; one ofen,fr_fr,de_de,it_it,pt_br,pt_pt,ar_sa,zh_cn,zh_tw,ja_jp,cy_gb(defaulten).- Output: always emitted as a quoted string replacement.
- Behavior: random integer in the inclusive range
[min, max](defaultsmin = 0,max = 1_000_000). min/max: inclusive bounds;minmust be ≤max.- Output: always unquoted digits (suitable for integer / JSON number columns).
- Behavior: random decimal with integer part in
[min, max]and fractional part ofscaledigits (defaultsmin = 0,max = 1_000_000,scale = 2). Usescale = 0for a plain integer string in the same range. min/max: inclusive integer-part bounds.scale: number of digits after.(0–38).as_string: same ashash/redactfor quoting the full literal.
- Behavior: random digit string of length
length(default 16) with a valid Luhn check digit, so--scan-outputPAN detection treats synthetic values like test cards, not arbitrary digit runs. length: total digit count including check digit; must be 13–19 (PAN lengths).- Output: always a quoted string of digits (no separators).
- Behavior: random alphanumeric string.
length: character count (default 12); must be ≥ 1 when set.
- Behavior: values from the Rust
fakecrate (fakermodules), selected only by the stringfaker = "module::Type"(e.g.internet::SafeEmail). Config is data only—nothing from TOML is compiled as Rust. Unsupported pairs fail at config load; new generators require a new Dumpling release (or a fork), not config-side code. faker: required; maps to a built-in allowlist insrc/faker_dispatch.rs.locale: optional; same set as the built-in PII strategies when the upstream generator is locale-aware.min/max/length/format: only for/faker combinations that upstream supports (e.g.number::NumberWithFormatusesformat:#= any digit,^= 1–9 perfakedocs).
Upstream reference: docs.rs — fake, docs.rs — fake::faker, GitHub — cksac/fake-rs.
- Behavior: parse the existing value when possible and shift by a random offset; on parse failure the original string is kept.
date_fuzz:min_days/max_days(defaults-30…30).time_fuzz/datetime_fuzz:min_seconds/max_seconds(time_fuzzdefaults-300…300;datetime_fuzzdefaults-86400…86400).as_string: force quoted literal vs. unquoted token for the emitted date/time/timestamp string.
These keys are valid on multiple strategies (unless validation says otherwise):
domain: deterministic mapping bucket. The same non-NULL source value maps to the same pseudonym for that strategy inside the domain (across tables/columns). SQLNULLis always preserved—no fabricated FK targets.unique_within_domain: whentrue(requiresdomain), different source values are assigned distinct pseudonyms within the domain.as_string: whentrue, force the replacement to render as a single-quoted SQL string literal. Whenfalseor omitted, Dumpling still quotes the output if the original cell was quoted (render_cellusesforce_quoted || original.was_quoted). Setas_string = truewhen the source may be unquoted (numeric-looking literals, someCOPYshapes) but you need a string literal in the dump.
Define default strategies in rules."<table>" and add ordered per-column cases in column_cases."<table>"."<column>". For each row and column, Dumpling applies the first matching case; if none match, it falls back to the default from rules.
[rules."public.users"]
email = { strategy = "hash", as_string = true } # default
name = { strategy = "name" }
[[column_cases."public.users".email]]
when.any = [{ column = "is_admin", op = "eq", value = "true" }]
strategy = { strategy = "redact", as_string = true }
[[column_cases."public.users".email]]
when.any = [{ column = "country", op = "in", values = ["DE","FR","GB"] }]
strategy = { strategy = "hash", salt = "eu-salt", as_string = true }when.anyis OR,when.allis AND; you can use either or both. If both are empty, the case matches unconditionally.- First-match-wins per column; there is no merge or fallthrough.
- Row filtering (
row_filters) is evaluated before cases; deleted rows are not transformed.
When a column stores JSON as text (json / jsonb dumped as a string), you can target fields inside the document with the same path syntax as row filters — but as keys under [rules."<table>"]. Use quoted TOML keys when the path contains dots.
- Dot notation:
"payload.profile.email" = { strategy = "email", domain = "orders_email", as_string = true } - Django-style:
"payload__profile__email" = { strategy = "hash", salt = "${env:ORDER_SECRET_SALT}", as_string = true }
The segment before the first . or __ is the SQL column name; the rest is the path inside the parsed JSON. You can use either path-level rules for a column or one whole-column rule for that column’s base name, not both (Dumpling rejects the conflict at startup). If a path is missing in a row, that rule is skipped for that row. When only path rules apply, the rest of the JSON is left unchanged. Path rules run in longest-path-first order. column_cases still match the SQL column name only; use when predicates with nested column paths to branch on JSON content.
You can retain or delete rows for specific tables using explicit predicate lists.
- If
retainis non-empty, a row is kept only if it matches at least one predicate. - Regardless of
retain, a row is dropped if it matches any predicate indelete.
Supported predicate operators:
| Operator | Description |
|---|---|
eq / neq |
String compare (case-insensitive if case_insensitive = true) |
in / not_in |
List of values (string compare) |
like / ilike |
SQL-like patterns (% and _) |
regex / iregex |
Rust regex (iregex is case-insensitive) |
lt / lte / gt / gte |
Numeric compare (values parsed as numbers) |
is_null / not_null |
No value needed |
Predicates can target nested JSON values using dot notation (payload.profile.tier) or Django-style notation (payload__profile__tier). For JSON arrays, path segments are evaluated against each element, so list-of-dicts structures can be matched naturally.
JSON list/array traversal is automatic once a path segment resolves to an array.
- All elements in an array: use the next field name directly.
payload.items.kindorpayload__items__kind- Matches/rewrites
kindfor every object initems.
- Specific array index: use a numeric segment.
payload.items.0.kindorpayload__items__0__kind- Targets only the first element.
- Nested arrays: combine field and index segments as needed.
payload.groups.members.emailpayload.groups.1.members.0.email
This path behavior is shared by both row_filters predicates and JSON-path anonymization rules in [rules].
[row_filters."public.users"]
retain = [
{ column = "country", op = "eq", value = "US" },
{ column = "email", op = "ilike", value = "%@myco.com" },
{ column = "profile.flags.plan", op = "eq", value = "gold" }
]
delete = [
{ column = "is_admin", op = "eq", value = "true" },
{ column = "email", op = "ilike", value = "%@example.com" },
{ column = "devices__platform", op = "eq", value = "android" }
]Row filtering works for both INSERT ... VALUES (...) and COPY ... FROM stdin rows.
Both .dumplingconf and [tool.dumpling] inside pyproject.toml use the same schema:
# Optional global salt for strategies that support it (e.g. hash)
# Prefer env-backed secret references over plaintext.
salt = "${DUMPLING_GLOBAL_SALT}"
# Rules are keyed by either "table" or "schema.table"
[rules."public.users"]
email = { strategy = "email", domain = "customer_identity", unique_within_domain = true }
name = { strategy = "name", locale = "de_de" } # German-locale name
ssn = { strategy = "hash", salt = "${env:DUMPLING_USERS_SSN_SALT}", as_string = true } # SHA-256 of original (salted)
age = { strategy = "int_range", min = 18, max = 90 }
[rules."orders"]
credit_card = { strategy = "payment_card", length = 16, domain = "order_pan" }
amount = { strategy = "decimal", min = 0, max = 9999, scale = 2, domain = "order_amount" }
# Optional explicit sensitive columns policy list (for strict coverage)
[sensitive_columns]
"public.users" = ["employee_number", "tax_id"]
[output_scan]
# optional allowlist; if omitted, all built-in categories are enabled
enabled_categories = ["email", "ssn", "pan", "token"]
default_threshold = 0
default_severity = "high"
fail_on_severity = "low"
sample_limit_per_category = 5
[output_scan.thresholds]
email = 0
ssn = 0
pan = 0
token = 0
[output_scan.severities]
email = "medium"
ssn = "high"
pan = "critical"
token = "high"Dumpling resolves secret references in string config fields so plaintext salts/keys never need to be committed to version control.
| Syntax | Description |
|---|---|
${ENV_VAR} |
Value of environment variable ENV_VAR |
${env:ENV_VAR} |
Value of environment variable ENV_VAR (explicit provider prefix) |
${file:/path/to/secret} |
Contents of a file (trailing newlines stripped); works with Docker Swarm secrets, Kubernetes mounted secrets, and Vault Agent injected files |
- Missing env references and unreadable/empty files fail fast with a non-zero startup error that includes the config path.
- Plaintext
saltvalues still work for backwards compatibility, but Dumpling prints a startup warning because plaintext secrets are insecure.
# .dumplingconf — keep salts out of source control
salt = "${DUMPLING_GLOBAL_SALT}"
[rules."public.users"]
ssn = { strategy = "hash", salt = "${env:DUMPLING_USERS_SSN_SALT}" }
email = { strategy = "hash", salt = "${file:/run/secrets/dumpling_email_salt}" }# Local dev
export DUMPLING_GLOBAL_SALT='local-dev-salt'
export DUMPLING_USERS_SSN_SALT='users-ssn-salt'
dumpling --input dump.sql --check
# CI (injected from your secret store)
export DUMPLING_GLOBAL_SALT="$CI_DUMPLING_GLOBAL_SALT"
export DUMPLING_USERS_SSN_SALT="$CI_DUMPLING_USERS_SSN_SALT"
dumpling --input dump.sql --check --strict-coverage --report coverage.json
# Docker / Kubernetes (file-mounted secrets)
# salt = "${file:/run/secrets/dumpling_hmac_key}" in .dumplingconf
# secret mounted at /run/secrets/dumpling_hmac_key by the orchestrator
dumpling --security-profile hardened --input dump.sql --checkNote:
table_optionsare no longer supported; use explicitrulesand optionalcolumn_cases.
--strict-coverage enforces that all detected sensitive columns have an explicit anonymization rule.
Sensitive columns are detected via:
- Built-in column-name heuristics (the same patterns used by auto-detection).
- Explicit lists under
[sensitive_columns].
A column is considered covered only when it has an explicit rules entry or at least one column_cases entry. When strict coverage fails, Dumpling exits non-zero and reports the uncovered columns.
When --report <file> is used, the JSON output includes:
sensitive_columns_detectedsensitive_columns_coveredsensitive_columns_uncovereddeterministic_mapping_domains(columns configured with deterministic domain mapping)output_scan(when--scan-outputis enabled), including category counts and sample locations
dumpling --input dump.sql --check --strict-coverage --report coverage.jsonThis command exits non-zero if:
- Data changes/drops are detected (
--checksemantics), or - Strict coverage finds uncovered sensitive columns.
dumpling \
--input dump.sql \
--check \
--scan-output \
--fail-on-findings \
--report scan-report.json--scan-output scans the transformed output for built-in detector categories:
email: email-address-like stringsssn: U.S. SSN-like valuespan: payment-card-like numbers (Luhn validated)token: common secret/token formats (JWT, AWS access key IDs, GitHub PAT prefixes, etc.)
When --fail-on-findings is set, Dumpling exits non-zero if any configured category exceeds its threshold and meets the configured severity gate.
Dumpling processes plain-text SQL dump files from multiple sources. Use --format to select the dialect (default: postgres).
Produced by pg_dump --format=plain. Handles:
INSERT INTO schema.table (col1, col2, ...) VALUES (...), (...), ...;COPY schema.table (col1, col2, ...) FROM stdin; ... \.(tab-delimited with\Nas NULL)"double-quoted"identifiers''-escaped string literals
Binary, custom, and directory formats from pg_dump are not parsed directly — Dumpling’s SQL pipeline expects plain text. Use either:
pg_dump --format=plainwhen you control capture, or- Auto-detected PostgreSQL archives with
--format postgres(default): if--inputis a custom-format file (begins withPGDMP) or a directory-format dump (folder containingtoc.dat), Dumpling runspg_restore -f -and streams the resulting SQL (same as a manualpg_restore“script” output; no database required). Requires PostgreSQL client tools onPATH(pg_restore), or--pg-restore-path. Extra flags:--pg-restore-arg(repeatable), or defaults from[pg_restore]in.dumplingconf/pyproject.toml(CLI wins when set).
Compressed inputs: .gz files whose payload is plain SQL are decompressed in-process (no temporary file). ZIP archives (and gzip wrapping PGDMP or an inner ZIP) are expanded under the system temp directory; those paths are removed when the run finishes. --in-place is rejected when Dumpling had to materialize a temp file for compression or when the input is a PostgreSQL archive path that must go through pg_restore (use --output or stdout instead).
Keeping archives: By default the --input archive path (file or directory-format folder) is removed after a fully successful run. Pass --keep-original or set keep_original = true in config to retain it. --check against an archive requires an effective keep-original (CLI or config); --keep-original cannot be combined with --in-place.
Example (e.g. after heroku pg:backups:download):
dumpling -i latest.dump -c .dumplingconf -o anonymized.sqlProduced by the SQLite CLI .dump command or equivalent. Handles:
- Standard
INSERT INTO table (col1, ...) VALUES (...); INSERT OR REPLACE INTO table (...) VALUES (...);INSERT OR IGNORE INTO table (...) VALUES (...);"double-quoted"identifiers''-escaped string literals
The OR REPLACE / OR IGNORE variant keyword is preserved verbatim in the output.
Produced by SSMS "Script Table as → INSERT To", mssql-scripter, or similar tools. Handles:
INSERT INTO [schema].[table] ([col1], [col2], ...) VALUES (...), ...;[bracket]-quoted identifiers (stripped to unquoted names in output)N'...'Unicode string literals (theNprefix is transparently discarded; value is preserved)nvarchar(n)andnchar(n)column-length declarations (used to truncate generated values)''-escaped string literals
The lint-policy subcommand statically analyses your configuration and flags common issues before they affect a production pipeline.
dumpling lint-policy # auto-discover config
dumpling lint-policy --config .dumplingconf # explicit config path| Check | Severity | Description |
|---|---|---|
empty-rules-table |
warning | A [rules] entry has no column rules |
empty-column-cases-table |
warning | A [column_cases] entry has no column cases |
unsalted-hash |
warning | hash strategy used without any salt — reversible for low-entropy inputs |
inconsistent-domain-strategy |
error | Same domain name used with different strategies — breaks referential integrity |
uncovered-sensitive-column |
error | A column in [sensitive_columns] has no matching rule or case |
Exits 0 if no violations are found, 1 if any violations exist. Plug it into CI as a pre-merge gate:
- run: ./target/release/dumpling lint-policySee the CI guardrails documentation for full pipeline recipes including strict-coverage enforcement, residual PII scan gating, and report diffing.
- This is a streaming transformer; memory usage stays small even for large dumps.
- For CI/CD and production-like workflows, prefer the default fail-closed mode and avoid
--allow-noopunless a no-op run is intentional. - For best results, configure strategies compatible with column data types. If you hash an integer column, Dumpling will render a string; most databases can coerce this, but explicit
as_string = falsemay help in some cases. - For length-restricted text columns (
varchar(n),character varying(n),char(n),character(n)), Dumpling readsCREATE TABLEdefinitions and truncates generated text values to fit within the declared limit. - Deterministic anonymization for tests: pass
--seed <u64>or set envDUMPLING_SEEDto make fuzz strategies reproducible across runs. In hardened security profile, seeds are ignored; see the configuration guide. - Domain mappings (
domain = "...") are deterministic by source value + domain (+ optional salt), so referential joins stay stable across tables within the same dump.
Detailed docs, including the configuration reference and release process, are available at the project's GitHub Pages site (built from docs/src/).