fix(pii): remove overly broad context keywords from PII recognizers#29050
fix(pii): remove overly broad context keywords from PII recognizers#29050edg956 wants to merge 8 commits into
Conversation
✅ PR checks passedThe linked issue has a description and all required Shipping project fields set. Thanks! |
🟡 Playwright Results — all passed (15 flaky)✅ 4283 passed · ❌ 0 failed · 🟡 15 flaky · ⏭️ 88 skipped
🟡 15 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
f17e4bd to
fc025ac
Compare
Broad keywords like "code", "security", "address", "name", "social", "check", "save", and "call" caused false-positive PII classification. Example: ACADEMIC_YEAR_CODE was tagged PII.Sensitive because CvvRecognizer has "code" as a context keyword and 1999/2000 match the CVV digit pattern. Removes the broad terms from 6 recognizers in the seed data and adds idempotent data migrations for 1.13.1 and 1.12.13. Migrations skip recognizers the user deleted and are no-ops if keywords are already absent. Fixes #29049 Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Move the broad-keyword removal logic into a single shared class so v1131 and v11213 don't duplicate it. Versioned MigrationUtils become one-line delegates; future migrations reuse the same entry point. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
… test Reverts 031424c (fix: remove overly broad context keywords) and 16fc529 (refactor: extract PiiRecognizerMigrationUtil). The integration test was testing PIIProcessor behaviour (hardcoded recognizers → hardcoded tags) rather than the TagProcessor, which fetches its tags and recognizers from the server. Replace the explicit PII classification/tag creation fixtures in conftest.py with nothing — the test now relies on the server's seeded PII tags, which is what the TagProcessor actually uses at runtime. Add academic_year_code INTEGER column (values 1999–2006) to the test table. Assert it receives no tags, covering the false-positive case where CvvRecognizer broad context keywords ("code") caused year-valued columns to be labelled PII.Sensitive. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…heck hardening Reapplies 68480d9/88d6b192f3 after they were accidentally reverted alongside the TagProcessor test refactor. Addresses prior review feedback: PiiRecognizerMigrationUtil.removeBroadPiiContextKeywords now takes a version label so v1131/v11213 log lines stay attributable, and migrateTag guards against a null tag.json column before toString(). Updates the academic_year_code_column assertion to match actual behavior: the CVV/broad-keyword false positive is fixed, but a separate SpacyRecognizer DATE_TIME false positive on year-like integers remains (tracked in #29083). Fixes #29049 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or commit The version-prefixed logging and null-json guard in PiiRecognizerMigrationUtil were written and spotless-formatted before d78dfeb but never staged — that commit captured the pre-edit content checked out from 16fc529. This adds the actual diff. Also drops the unused pii_classification/sensitive_pii_tag/ non_sensitive_pii_tag parameters from test_global_sample_data_config.py; the databases/ conftest.py no longer defines them (server-seeded tags replaced the fixture-created ones), so collection failed with "fixture 'pii_classification' not found". The fixtures were never referenced in the test bodies. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
….sql Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
cdbe40a to
7b413da
Compare
Code Review 👍 Approved with suggestions 6 resolved / 7 findingsReduces false-positive PII classification by removing overly broad context keywords from recognizers and adds idempotent migrations for existing installations. Note that the current regression test suite requires minor updates to resolve fixture collection failures in 💡 Quality: Regression test codifies an incomplete fix for the CVV false positive📄 ingestion/tests/integration/auto_classification/databases/test_tag_processor.py:184-192 The stated goal of this PR (issue #29049) is that a column like Additionally, the seed ✅ 6 resolved✅ Quality: Unused constant TAG_TABLE in MigrationUtil
✅ Quality: No automated test for migration / idempotency
✅ Quality: Migration silently swallows all exceptions
✅ Edge Case: Possible NPE if tag json column is null
✅ Quality: Migration logs lost version prefix, ambiguous which migration ran
...and 1 more resolved from earlier reviews 🤖 Prompt for agentsOptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|



Fixes #29049
Summary
piiTagsWithRecognizers.json) for fresh installationsRoot cause
PII auto-classification boosted scores when a context keyword appeared in a column name. Keywords like
"code","security","address", and"name"are far too generic — they matched completely unrelated columns (e.g.ACADEMIC_YEAR_CODEtriggeredCvvRecognizerbecause values1999/2000look like 4-digit CVV numbers and the column name containscode).Keywords removed per recognizer
CvvRecognizer"code","security","verification","card"UsBankRecognizer"check","save"UsSsnRecognizer"social","security","id_number"CryptoRecognizer"address"SpacyRecognizer(PERSON)"name"PhoneRecognizer"call"Migration safety
The Java migrations (
v1131/v11213) query thetagtable by FQN hash, walk the recognizers array, and surgically remove only the listed keywords from each recognizer's context array. If a recognizer was deleted by the user or the keyword was already removed, the migration is a no-op for that entry. Everything else the user configured is untouched.Known limitation
This PR fixes the keyword-driven false positive described in #29049:
ACADEMIC_YEAR_CODEis no longer taggedPII.SensitiveviaCvvRecognizer.It does not fully eliminate the false positive end-to-end. The same column is still tagged
PII.NonSensitive, via a separate, unrelated recognizer (SpacyRecognizer'sDATE_TIMEentity, which flags any 4-digit year-like integer independent of column type or context keywords). This is a different bug — model/threshold behavior rather than an overly broad keyword list — and is tracked separately in #29083. The regression test (test_tag_processor.py) asserts the actual current behavior (taggedNonSensitive) with an inline comment pointing to #29083, so the partial nature of the fix is explicit rather than silently masked.Test plan
CvvRecognizercontext no longer includes"code","security","verification","card"ACADEMIC_YEAR_CODEwith values1999/2000is no longer classified asPII.Sensitive(still classifiedPII.NonSensitivevia the unrelated issue tracked in SpacyRecognizer DATE_TIME entity false-positives on year-like integer codes #29083)CVV,CVC_CODE,cvv2are still correctly classified as PII.Sensitive (specific keywords remain)🤖 Generated with Claude Code
Greptile Summary
This PR fixes false-positive PII classification caused by overly broad context keywords in six recognizers, and ships idempotent data migrations for existing 1.12.12 and 1.13.1 installations so that no user customizations are overwritten.
piiTagsWithRecognizers.json): Removes generic keywords ("code","security","address","name","call","check","save","social") fromCvvRecognizer,UsBankRecognizer,UsSsnRecognizer,CryptoRecognizer,SpacyRecognizer(PERSON), andPhoneRecognizer; specific, high-signal keywords remain intact.v11212/v1131): A sharedPiiRecognizerMigrationUtilwalks each tag's recognizer array, surgically removes only the targeted keywords fromrecognizerConfig.context, and writes back only if a change was made; the migration is a no-op when keywords are already absent.ACADEMIC_YEAR_CODE → PII.NonSensitivefalse positive (viaSpacyRecognizerDATE_TIME) with an explicit reference to SpacyRecognizer DATE_TIME entity false-positives on year-like integer codes #29083.Confidence Score: 5/5
Safe to merge — the migrations are idempotent, touch only the targeted keyword arrays, and leave all other user configuration untouched.
The change is narrowly scoped: the seed data removes specific strings from specific arrays, the Java migration surgically patches those same arrays in-place and skips cleanly when nothing matches, and the integration tests now exercise the real seeded PII tags rather than synthetic fixtures. No schema changes, no destructive operations, and the remaining known false positive is explicitly documented rather than silently ignored.
No files require special attention.
Important Files Changed
Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant MF as MigrationFile (v11212/v1131) participant MU as MigrationUtil (versioned) participant PU as PiiRecognizerMigrationUtil participant DB as tag table MF->>MU: runDataMigration() MU->>PU: removeBroadPiiContextKeywords(handle, version) PU->>DB: "SELECT json WHERE fqnHash = hash("PII.Sensitive")" DB-->>PU: tag JSON PU->>PU: processRecognizers() → removeKeywordsFromContext() alt keywords removed PU->>DB: "UPDATE tag SET json = patched JSON" else nothing changed (idempotent) PU-->>MU: no-op log end PU->>DB: "SELECT json WHERE fqnHash = hash("PII.NonSensitive")" DB-->>PU: tag JSON PU->>PU: processRecognizers() → removeKeywordsFromContext() alt keywords removed PU->>DB: "UPDATE tag SET json = patched JSON" else nothing changed (idempotent) PU-->>MU: no-op log end%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant MF as MigrationFile (v11212/v1131) participant MU as MigrationUtil (versioned) participant PU as PiiRecognizerMigrationUtil participant DB as tag table MF->>MU: runDataMigration() MU->>PU: removeBroadPiiContextKeywords(handle, version) PU->>DB: "SELECT json WHERE fqnHash = hash("PII.Sensitive")" DB-->>PU: tag JSON PU->>PU: processRecognizers() → removeKeywordsFromContext() alt keywords removed PU->>DB: "UPDATE tag SET json = patched JSON" else nothing changed (idempotent) PU-->>MU: no-op log end PU->>DB: "SELECT json WHERE fqnHash = hash("PII.NonSensitive")" DB-->>PU: tag JSON PU->>PU: processRecognizers() → removeKeywordsFromContext() alt keywords removed PU->>DB: "UPDATE tag SET json = patched JSON" else nothing changed (idempotent) PU-->>MU: no-op log endReviews (6): Last reviewed commit: "Update bootstrap/sql/migrations/native/1..." | Re-trigger Greptile