fix(analyzer): stop UsSsnRecognizer from over-blocking 987-65-432X SSNs#2074
Open
AUTHENSOR wants to merge 1 commit into
Open
fix(analyzer): stop UsSsnRecognizer from over-blocking 987-65-432X SSNs#2074AUTHENSOR wants to merge 1 commit into
AUTHENSOR wants to merge 1 commit into
Conversation
The invalidate_result() sample-SSN blocklist was matched with str.startswith(), and the literal "98765432" is only 8 digits (a truncation of the canonical fake "987654320"). The prefix match therefore invalidated the entire 987-65-4320 .. 987-65-4329 family -- ten distinct SSN-shaped values, including the widely-printed 987-65-4321 -- so real SSNs were dropped to no result and left unredacted. Split the two concerns: keep 000/666 as a never-issued area-number (first-group) check, and match the full 9-digit sample SSNs (123456789 / 987654320 / 078051120) by exact equality so they no longer over-block neighbours. NeMo Guardrails' PII rail inherits this recognizer, so the leak reached that deployed guardrail too.
Contributor
Author
|
@microsoft-github-policy-service agree |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes a false-negative in the US SSN predefined recognizer where an 8-digit truncated sample value ("98765432") was used with startswith(), unintentionally invalidating the entire 987-65-432X family of SSN-shaped values. The fix separates “never-issued area numbers” (prefix/area check) from “canonical sample SSNs” (exact match), and updates unit tests to prevent regressions.
Changes:
- Refactors
UsSsnRecognizer.invalidate_result()to use an area-number prefix check for000/666and exact equality for canonical 9-digit sample SSNs (including correcting987654320). - Adds unit tests asserting
987-65-4321…987-65-4329are detected while987-65-4320,078-05-1120, and123-45-6789remain invalidated. - Adds explicit test coverage for
000/666area-number invalidation.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/us/us_ssn_recognizer.py | Splits SSN invalidation logic into area-number prefix checks vs exact-match canonical sample SSNs to avoid over-blocking. |
| presidio-analyzer/tests/test_us_ssn_recognizer.py | Expands test matrix to confirm the 987-65-432X family is no longer over-invalidated while canonical samples and invalid areas are still rejected. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Description
UsSsnRecognizer.invalidate_result()drops detected SSN candidates that match a hardcoded "sample SSN" blocklist. The check usesstr.startswith():The mixed list conflates two different intents — never-issued area numbers (
000,666, which are legitimately a 3-digit prefix/area check) and full 9-digit canonical sample SSNs — and matches both with a prefix test. The defect is that one of the 9-digit literals,"98765432", is only 8 digits: an apparent truncation of the canonical fake"987654320". Because the test isstartswith, that truncated literal over-blocks the entire987-65-4320…987-65-4329family — ten distinct SSN-shaped values — instead of the single intended sample.The practical impact is a false negative that leaks PII: real SSN-shaped strings such as
987-65-4321(a canonical example printed on countless real-world forms) are silently dropped to no result and pass through unredacted. This is a confidentiality failure, not a cosmetic one: the recognizer's whole job is to catch SSNs so a downstream anonymizer can redact them, and here it actively suppresses real ones.Fix
Split the two concerns that the single
startswithloop had merged:000or666is invalid — the SSA never issues those. This stays a first-three-digits check:only_digits[:3] in ("000", "666").only_digits in ("123456789", "987654320", "078051120"). The truncated"98765432"is corrected to its intended 9-digit canonical form"987654320", so it now invalidates only itself.The regex and scores are unchanged. SSNs the recognizer is meant to drop are still dropped:
000/666-area SSNs, the all-same-digit case, the all-zero-group case, and the three canonical sample SSNs (987-65-4320,078-05-1120,123-45-6789) all remain invalidated.Verification
A standalone repro instantiates the real
UsSsnRecognizerand runs.analyze(..., ["US_SSN"], nlp_artifacts=None).Before the fix — the whole family leaks:
After the fix:
ruff check/ruff format --checkon the changed source and the targeted SSN test file both pass:The test file gains:
987-65-4321…987-65-4329detected;987-65-4320/078-05-1120/123-45-6789still invalidated;000/666-area SSNs still invalidated; and a plain valid SSN (219-09-9999) still detected.Downstream blast radius: NVIDIA NeMo Guardrails' PII rail inherits this exact Presidio recognizer, so the same SSN leak reached that deployed guardrail too — real SSNs in the
987-65-432Xfamily flowed through a configured PII guardrail unredacted.Issue reference
Note on CHANGELOG
CHANGELOG entry omitted from this patch to avoid merge conflicts with sibling PRs that all insert at the same
#### Fixedanchor. Happy to add the entry once the merge order is known, or maintainers can squash it in.Checklist