Skip to content

fix(pii): remove overly broad context keywords from PII recognizers#29050

Open
edg956 wants to merge 8 commits into
mainfrom
fix-cvv-recognizer-broad-keywords
Open

fix(pii): remove overly broad context keywords from PII recognizers#29050
edg956 wants to merge 8 commits into
mainfrom
fix-cvv-recognizer-broad-keywords

Conversation

@edg956

@edg956 edg956 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Fixes #29049

Summary

  • Removes overly broad context keywords from 6 PII recognizers in the seed data (piiTagsWithRecognizers.json) for fresh installations
  • Adds idempotent data migrations for 1.13.1 and 1.12.13 to fix existing installations without overwriting user customisations

Root cause

PII auto-classification boosted scores when a context keyword appeared in a column name. Keywords like "code", "security", "address", and "name" are far too generic — they matched completely unrelated columns (e.g. ACADEMIC_YEAR_CODE triggered CvvRecognizer because values 1999/2000 look like 4-digit CVV numbers and the column name contains code).

Keywords removed per recognizer

Recognizer Tag Removed
CvvRecognizer Sensitive "code", "security", "verification", "card"
UsBankRecognizer Sensitive "check", "save"
UsSsnRecognizer Sensitive "social", "security", "id_number"
CryptoRecognizer Sensitive "address"
SpacyRecognizer (PERSON) Sensitive "name"
PhoneRecognizer NonSensitive "call"

Migration safety

The Java migrations (v1131 / v11213) query the tag table by FQN hash, walk the recognizers array, and surgically remove only the listed keywords from each recognizer's context array. If a recognizer was deleted by the user or the keyword was already removed, the migration is a no-op for that entry. Everything else the user configured is untouched.

Known limitation

This PR fixes the keyword-driven false positive described in #29049: ACADEMIC_YEAR_CODE is no longer tagged PII.Sensitive via CvvRecognizer.

It does not fully eliminate the false positive end-to-end. The same column is still tagged PII.NonSensitive, via a separate, unrelated recognizer (SpacyRecognizer's DATE_TIME entity, which flags any 4-digit year-like integer independent of column type or context keywords). This is a different bug — model/threshold behavior rather than an overly broad keyword list — and is tracked separately in #29083. The regression test (test_tag_processor.py) asserts the actual current behavior (tagged NonSensitive) with an inline comment pointing to #29083, so the partial nature of the fix is explicit rather than silently masked.

Test plan

  • Fresh install: verify CvvRecognizer context no longer includes "code", "security", "verification", "card"
  • Upgrade path: verify migration runs and removes broad keywords from existing PII.Sensitive and PII.NonSensitive tags
  • Column ACADEMIC_YEAR_CODE with values 1999/2000 is no longer classified as PII.Sensitive (still classified PII.NonSensitive via the unrelated issue tracked in SpacyRecognizer DATE_TIME entity false-positives on year-like integer codes #29083)
  • Columns like CVV, CVC_CODE, cvv2 are still correctly classified as PII.Sensitive (specific keywords remain)
  • Verify migration is idempotent (safe to run twice)

🤖 Generated with Claude Code

Greptile Summary

This PR fixes false-positive PII classification caused by overly broad context keywords in six recognizers, and ships idempotent data migrations for existing 1.12.12 and 1.13.1 installations so that no user customizations are overwritten.

  • Seed data (piiTagsWithRecognizers.json): Removes generic keywords ("code", "security", "address", "name", "call", "check", "save", "social") from CvvRecognizer, UsBankRecognizer, UsSsnRecognizer, CryptoRecognizer, SpacyRecognizer (PERSON), and PhoneRecognizer; specific, high-signal keywords remain intact.
  • Java migrations (v11212 / v1131): A shared PiiRecognizerMigrationUtil walks each tag's recognizer array, surgically removes only the targeted keywords from recognizerConfig.context, and writes back only if a change was made; the migration is a no-op when keywords are already absent.
  • Integration tests: The test suite is refactored to rely on the server-seeded PII tags rather than hand-crafted fixtures, and a new assertion documents the remaining ACADEMIC_YEAR_CODE → PII.NonSensitive false positive (via SpacyRecognizer DATE_TIME) with an explicit reference to SpacyRecognizer DATE_TIME entity false-positives on year-like integer codes #29083.

Confidence Score: 5/5

Safe to merge — the migrations are idempotent, touch only the targeted keyword arrays, and leave all other user configuration untouched.

The change is narrowly scoped: the seed data removes specific strings from specific arrays, the Java migration surgically patches those same arrays in-place and skips cleanly when nothing matches, and the integration tests now exercise the real seeded PII tags rather than synthetic fixtures. No schema changes, no destructive operations, and the remaining known false positive is explicitly documented rather than silently ignored.

No files require special attention.

Important Files Changed

Filename Overview
openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/PiiRecognizerMigrationUtil.java Core shared migration utility; correctly matches recognizers by top-level name, idempotently removes targeted keywords from context arrays, and handles missing/null data gracefully.
openmetadata-service/src/main/resources/json/data/tags/piiTagsWithRecognizers.json Seed data correctly removes the targeted broad keywords from six recognizers; remaining keywords are appropriately specific (e.g. "cvv", "cvc2", "cvv2" remain for CvvRecognizer).
ingestion/tests/integration/auto_classification/databases/test_tag_processor.py Adds academic_year_code column to the integration test and asserts it gets PII.NonSensitive via SpacyRecognizer; correctly documents the remaining DATE_TIME false-positive with a reference to #29083.
ingestion/tests/integration/auto_classification/databases/conftest.py Removes all custom recognizer/tag fixtures — tests now rely on the server-seeded PII tags from piiTagsWithRecognizers.json, which is the correct approach for integration tests against a real server.
openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/v1131/MigrationUtil.java Thin wrapper delegating to the shared PiiRecognizerMigrationUtil with correct version tag "v1131".
openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/v11212/MigrationUtil.java Thin wrapper delegating to the shared PiiRecognizerMigrationUtil with correct version tag "v11212".
bootstrap/sql/migrations/native/1.12.12/mysql/schemaChanges.sql New SQL migration file with a comment pointing to the Java data migration; references the correct v11212 package.
bootstrap/sql/migrations/native/1.13.1/mysql/schemaChanges.sql Appends a comment block pointing to the v1131 Java data migration; existing schema changes are untouched.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant MF as MigrationFile (v11212/v1131)
    participant MU as MigrationUtil (versioned)
    participant PU as PiiRecognizerMigrationUtil
    participant DB as tag table

    MF->>MU: runDataMigration()
    MU->>PU: removeBroadPiiContextKeywords(handle, version)
    PU->>DB: "SELECT json WHERE fqnHash = hash("PII.Sensitive")"
    DB-->>PU: tag JSON
    PU->>PU: processRecognizers() → removeKeywordsFromContext()
    alt keywords removed
        PU->>DB: "UPDATE tag SET json = patched JSON"
    else nothing changed (idempotent)
        PU-->>MU: no-op log
    end
    PU->>DB: "SELECT json WHERE fqnHash = hash("PII.NonSensitive")"
    DB-->>PU: tag JSON
    PU->>PU: processRecognizers() → removeKeywordsFromContext()
    alt keywords removed
        PU->>DB: "UPDATE tag SET json = patched JSON"
    else nothing changed (idempotent)
        PU-->>MU: no-op log
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant MF as MigrationFile (v11212/v1131)
    participant MU as MigrationUtil (versioned)
    participant PU as PiiRecognizerMigrationUtil
    participant DB as tag table

    MF->>MU: runDataMigration()
    MU->>PU: removeBroadPiiContextKeywords(handle, version)
    PU->>DB: "SELECT json WHERE fqnHash = hash("PII.Sensitive")"
    DB-->>PU: tag JSON
    PU->>PU: processRecognizers() → removeKeywordsFromContext()
    alt keywords removed
        PU->>DB: "UPDATE tag SET json = patched JSON"
    else nothing changed (idempotent)
        PU-->>MU: no-op log
    end
    PU->>DB: "SELECT json WHERE fqnHash = hash("PII.NonSensitive")"
    DB-->>PU: tag JSON
    PU->>PU: processRecognizers() → removeKeywordsFromContext()
    alt keywords removed
        PU->>DB: "UPDATE tag SET json = patched JSON"
    else nothing changed (idempotent)
        PU-->>MU: no-op log
    end
Loading

Reviews (6): Last reviewed commit: "Update bootstrap/sql/migrations/native/1..." | Re-trigger Greptile

@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

✅ PR checks passed

The linked issue has a description and all required Shipping project fields set. Thanks!

@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Jun 15, 2026
Comment thread ingestion/tests/integration/orm_profiler/test_pii_processor.py Outdated
@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (15 flaky)

✅ 4283 passed · ❌ 0 failed · 🟡 15 flaky · ⏭️ 88 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 300 0 2 4
🟡 Shard 2 806 0 3 9
🟡 Shard 3 808 0 3 8
🟡 Shard 4 846 0 3 12
🟡 Shard 5 732 0 1 47
🟡 Shard 6 791 0 3 8
🟡 15 flaky test(s) (passed on retry)
  • Features/EntityRenameConsolidation.spec.ts › Classification - multiple rename + update cycles should preserve tags (shard 1, 1 retry)
  • Pages/Roles.spec.ts › Roles page should work properly (shard 1, 2 retries)
  • Features/AdvancedSearch.spec.ts › Verify Rule functionality for field Display Name with OR operator (shard 2, 1 retry)
  • Features/DataQuality/ColumnLevelTests.spec.ts › Column Values To Not Match Regex (shard 2, 1 retry)
  • Features/DataQuality/ColumnLevelTests.spec.ts › Column Values Sum To Be Between (shard 2, 1 retry)
  • Features/KnowledgeCenterList.spec.ts › Knowledge Center List - Verify Recently Viewed widget (shard 3, 1 retry)
  • Features/KnowledgeCenterRoleBasedAccess.spec.ts › Data Steward can edit content, title, owners, tags, and glossary terms but cannot add article, domain, reviewer, data product, or data assets (shard 3, 1 retry)
  • Flow/FrequentlyJoined.spec.ts › should display frequently joined table (shard 3, 1 retry)
  • Pages/CustomProperties.spec.ts › Entity Reference (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for SearchIndex (shard 4, 1 retry)
  • Pages/Entity.spec.ts › User as Owner Add, Update and Remove (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
  • Pages/GlossaryImportExport.spec.ts › Glossary Bulk Import Export (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/UserDetails.spec.ts › Admin user can get all the roles hierarchy and edit roles (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Comment thread bootstrap/sql/migrations/native/1.12.12/mysql/schemaChanges.sql Outdated
Comment thread bootstrap/sql/migrations/native/1.12.12/postgres/schemaChanges.sql Outdated
edg956 and others added 8 commits June 16, 2026 23:01
Broad keywords like "code", "security", "address", "name", "social",
"check", "save", and "call" caused false-positive PII classification.
Example: ACADEMIC_YEAR_CODE was tagged PII.Sensitive because CvvRecognizer
has "code" as a context keyword and 1999/2000 match the CVV digit pattern.

Removes the broad terms from 6 recognizers in the seed data and adds
idempotent data migrations for 1.13.1 and 1.12.13. Migrations skip
recognizers the user deleted and are no-ops if keywords are already absent.

Fixes #29049

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Move the broad-keyword removal logic into a single shared class so
v1131 and v11213 don't duplicate it. Versioned MigrationUtils become
one-line delegates; future migrations reuse the same entry point.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
… test

Reverts 031424c (fix: remove overly broad context keywords) and
16fc529 (refactor: extract PiiRecognizerMigrationUtil).

The integration test was testing PIIProcessor behaviour (hardcoded
recognizers → hardcoded tags) rather than the TagProcessor, which
fetches its tags and recognizers from the server.

Replace the explicit PII classification/tag creation fixtures in
conftest.py with nothing — the test now relies on the server's seeded
PII tags, which is what the TagProcessor actually uses at runtime.

Add academic_year_code INTEGER column (values 1999–2006) to the test
table. Assert it receives no tags, covering the false-positive case
where CvvRecognizer broad context keywords ("code") caused year-valued
columns to be labelled PII.Sensitive.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…heck hardening

Reapplies 68480d9/88d6b192f3 after they were accidentally reverted
alongside the TagProcessor test refactor. Addresses prior review
feedback: PiiRecognizerMigrationUtil.removeBroadPiiContextKeywords now
takes a version label so v1131/v11213 log lines stay attributable, and
migrateTag guards against a null tag.json column before toString().

Updates the academic_year_code_column assertion to match actual
behavior: the CVV/broad-keyword false positive is fixed, but a
separate SpacyRecognizer DATE_TIME false positive on year-like
integers remains (tracked in #29083).

Fixes #29049

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or commit

The version-prefixed logging and null-json guard in
PiiRecognizerMigrationUtil were written and spotless-formatted before
d78dfeb but never staged — that commit captured the pre-edit
content checked out from 16fc529. This adds the actual diff.

Also drops the unused pii_classification/sensitive_pii_tag/
non_sensitive_pii_tag parameters from test_global_sample_data_config.py;
the databases/ conftest.py no longer defines them (server-seeded tags
replaced the fixture-created ones), so collection failed with
"fixture 'pii_classification' not found". The fixtures were never
referenced in the test bodies.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
….sql

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@edg956 edg956 force-pushed the fix-cvv-recognizer-broad-keywords branch from cdbe40a to 7b413da Compare June 16, 2026 21:01
@gitar-bot

gitar-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown
Code Review 👍 Approved with suggestions 6 resolved / 7 findings

Reduces false-positive PII classification by removing overly broad context keywords from recognizers and adds idempotent migrations for existing installations. Note that the current regression test suite requires minor updates to resolve fixture collection failures in test_global_sample_data_config.py.

💡 Quality: Regression test codifies an incomplete fix for the CVV false positive

📄 ingestion/tests/integration/auto_classification/databases/test_tag_processor.py:184-192

The stated goal of this PR (issue #29049) is that a column like academic_year_code holding 4-digit year values (1999–2006) should no longer be auto-classified as PII. The new assertion at test_tag_processor.py:186-192 instead asserts the column is STILL tagged — now as PII.NonSensitive via SpacyRecognizer — rather than tags == []. So the user-facing false positive persists (only downgraded from Sensitive to NonSensitive). This is acknowledged via the inline comment referencing #29083, so it is an accepted tradeoff, but reviewers/users should be aware the headline bug is only partially resolved.

Additionally, the seed SpacyRecognizer config only lists supportedEntities: ["PERSON"], yet the comment attributes the tag to spaCy's DATE_TIME entity. If the ingestion-side filtering ever honors supportedEntities strictly, this assertion could become flaky/incorrect. Consider a follow-up that links the test to #29083 in an xfail/skip marker or a clarifying assertion comment so the intent (this is a known-remaining false positive, not desired behavior) is unambiguous.

✅ 6 resolved
Quality: Unused constant TAG_TABLE in MigrationUtil

📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/v11213/MigrationUtil.java:21 📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/v11213/MigrationUtil.java:32-37 📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/v1131/MigrationUtil.java:21 📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/v1131/MigrationUtil.java:32-37
Both new MigrationUtil classes declare private static final String TAG_TABLE = "tag"; but never reference it — the table name is hard-coded directly inside the SQL string constants (UPDATE tag ..., SELECT json FROM tag ...). This dead constant will likely trip checkstyle/unused-field rules and is misleading (a reader may assume the SQL is built from it). Either remove the constant or use it when composing the SQL strings.

Quality: No automated test for migration / idempotency

📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/v11213/MigrationUtil.java:91-105
The test plan is a manual checklist; no unit/integration test is added for removeBroadPiiContextKeywords. The logic has several branches worth locking down with a test: keyword removal per recognizer, the SpacyRecognizer PERSON-only path, idempotency (running twice leaves context unchanged the second time), and the no-op path when a user already removed a keyword or deleted a recognizer. A small test feeding a sample PII.Sensitive/NonSensitive JSON through processRecognizers would protect this from regressions, especially since the logic is duplicated across the v1131 and v11213 copies.

Quality: Migration silently swallows all exceptions

📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/mysql/v11213/Migration.java:18-24
runDataMigration() wraps the cleanup in a try/catch that logs and discards every Exception. If the UPDATE fails (e.g. DB permissions, JSON parse/serialize problem, connection issue), the migration is still reported as successful and existing installations will continue to misclassify PII columns with no surfaced failure beyond a single ERROR log line. For a best-effort cleanup this is defensible, but consider whether a hard failure (rethrow) is preferable so operators are aware the data fix did not apply, or at minimum confirm this matches the convention used by other data migrations in the repo.

Edge Case: Possible NPE if tag json column is null

📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/v11213/MigrationUtil.java:63-69
migrateTag does rows.getFirst().get(JSON_COLUMN).toString() after only checking that rows is non-empty. If the json column value is null (or the mapped key is absent), get(...) returns null and .toString() throws an NPE, which is then swallowed by the caller's catch and silently aborts the migration for that tag. In practice the tag.json column should never be null, so this is low-likelihood, but a defensive null check (treat as skip with a warning, consistent with the other guarded paths) would make the behavior explicit.

Quality: Migration logs lost version prefix, ambiguous which migration ran

📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/PiiRecognizerMigrationUtil.java:57 📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/PiiRecognizerMigrationUtil.java:61 📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/PiiRecognizerMigrationUtil.java:70 📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/PiiRecognizerMigrationUtil.java:78 📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/PiiRecognizerMigrationUtil.java:89 📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/PiiRecognizerMigrationUtil.java:91 📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/PiiRecognizerMigrationUtil.java:167
The two original migration utils logged with distinct prefixes (v1131: and v11213:) so operators could tell which migration produced a given log line during an upgrade. The extracted PiiRecognizerMigrationUtil.removeBroadPiiContextKeywords now logs generic, prefix-less messages (e.g. "Removing overly broad context keywords from PII recognizers"). Since both the 1.13.1 and 1.12.13 migrations call the same shared method, their log output is now indistinguishable, making it harder to diagnose upgrade-path issues or confirm idempotency from logs. Consider passing a version/label argument into the shared method and including it in the log messages, or having each caller log a wrapping start/end line identifying its version.

...and 1 more resolved from earlier reviews

🤖 Prompt for agents
Code Review: Reduces false-positive PII classification by removing overly broad context keywords from recognizers and adds idempotent migrations for existing installations. Note that the current regression test suite requires minor updates to resolve fixture collection failures in `test_global_sample_data_config.py`.

1. 💡 Quality: Regression test codifies an incomplete fix for the CVV false positive
   Files: ingestion/tests/integration/auto_classification/databases/test_tag_processor.py:184-192

   The stated goal of this PR (issue #29049) is that a column like `academic_year_code` holding 4-digit year values (1999–2006) should no longer be auto-classified as PII. The new assertion at test_tag_processor.py:186-192 instead asserts the column is STILL tagged — now as `PII.NonSensitive` via `SpacyRecognizer` — rather than `tags == []`. So the user-facing false positive persists (only downgraded from Sensitive to NonSensitive). This is acknowledged via the inline comment referencing #29083, so it is an accepted tradeoff, but reviewers/users should be aware the headline bug is only partially resolved.
   
   Additionally, the seed `SpacyRecognizer` config only lists `supportedEntities: ["PERSON"]`, yet the comment attributes the tag to spaCy's `DATE_TIME` entity. If the ingestion-side filtering ever honors `supportedEntities` strictly, this assertion could become flaky/incorrect. Consider a follow-up that links the test to #29083 in an `xfail`/`skip` marker or a clarifying assertion comment so the intent (this is a known-remaining false positive, not desired behavior) is unambiguous.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PII auto-classification false positives due to overly broad recognizer context keywords

1 participant