Skip to content

fix: improve extraction engine, validation mechanisms and security#7

Merged
nabroleonx merged 1 commit into
mainfrom
fix/v0.5.1-improvements-and-bug-fixes
Apr 6, 2026
Merged

fix: improve extraction engine, validation mechanisms and security#7
nabroleonx merged 1 commit into
mainfrom
fix/v0.5.1-improvements-and-bug-fixes

Conversation

@nabroleonx

Copy link
Copy Markdown
Owner
  • feat: add support for unsafe WHERE subqueries in seed specifications #5 added --allow-unsafe-where with proper validation in config.py, but input_validators.py kept its own weaker regex copy. Patching one would leave the other vulnerable. input_validators now delegates to config.py.
  • feat: add compliance profiles, PII scanning, and column mapping UI #6 added custom transformers (hipaa_zip3, year_only, etc) but typos like hipaa_zip_3 silently fell back to generic Faker output. Provider names are validated at configure() time now.
  • feat: add compliance profiles, PII scanning, and column mapping UI #6 also added audit manifests that log what got anonymized but never verified completeness. Added a post-extraction check that compares extracted columns against compliance profile patterns and flags what slipped through.
  • chore: add config file merging, wildcard anonymization, and secure file output #3 added passthrough tables but quietly dropped any without a PK. You'd ask for a table, it wouldn't appear, no error. Raises ExtractionError now. (breaking)
  • fetch_referencing_pks wasn't filtering NULL FK values while fetch_fk_values was, same adapter two behaviors. Aligned them.
  • No query timeout on the PostgreSQL connection. Added --statement-timeout and statement_timeout_ms config, defaults to 0.
  • JSON serialized Decimal with float(), which rounds large financial values. Switched to str(). Downstream gets "99.99" not 99.99. (breaking)
  • CSV used empty string for NULL so you couldn't tell it apart from actual empty string on reimport. Uses \N now per PostgreSQL COPY convention. (breaking)
  • Seed queries had no cardinality limit. Something like status='pending' on a big table would load the whole thing. Added --max-seed-rows, 10K default, warns at 1K.
  • Streaming mode called list(fetch_by_pk(...)) for broken FK tables, which defeats the point. Uses fetch_by_pk_chunked now.
  • _extract_fk_values used row.get(col), quietly turning missing columns into None. Uses row[col] so it blows up.
  • ExtractionResult had list[Any] for broken_fks, deferred_updates, cycle_infos, profiler. Typed properly.
  • 702 tests, ruff, mypy all clean.

Breaking changes:

  • JSON Decimal values are now strings, not floats. Do Decimal(row["amount"]) instead of assuming a number.
  • CSV NULL is \N instead of empty string. Set your importer to treat \N as NULL.
  • Passthrough tables without a PK now fail instead of vanishing silently.

Consolidate duplicate WHERE clause validation, fix NULL filtering inconsistency in FK traversal, add seed cardinality limits, fix Decimal precision loss in JSON output, type ExtractionResult fields, and refactor streaming deferred updates to use chunked fetching.

Additionally addresses five previously unimplemented review findings:

- Use \N sentinel for NULL in CSV output to distinguish from empty string
- Raise error when passthrough table has no primary key (was silent skip)
- Add --statement-timeout CLI flag for PostgreSQL query timeout
- Validate anonymizer provider names at configure time (catch typos)
- Add post-extraction compliance manifest validation

BREAKING CHANGE: JSON output now serializes Decimal values as strings
(e.g., "99.99") instead of floats to preserve exact precision. CSV
output now uses \N for NULL values instead of empty string. Passthrough
tables without a primary key now raise an error instead of being
silently skipped.
@nabroleonx nabroleonx merged commit 18c1545 into main Apr 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant