Skip to content

fix(analyzer): stop UsSsnRecognizer from over-blocking 987-65-432X SSNs#2074

Open
AUTHENSOR wants to merge 1 commit into
data-privacy-stack:mainfrom
AUTHENSOR:fix/us-ssn-invalidator-overblock
Open

fix(analyzer): stop UsSsnRecognizer from over-blocking 987-65-432X SSNs#2074
AUTHENSOR wants to merge 1 commit into
data-privacy-stack:mainfrom
AUTHENSOR:fix/us-ssn-invalidator-overblock

Conversation

@AUTHENSOR

Copy link
Copy Markdown
Contributor

Change Description

UsSsnRecognizer.invalidate_result() drops detected SSN candidates that match a hardcoded "sample SSN" blocklist. The check uses str.startswith():

for sample_ssn in ("000", "666", "123456789", "98765432", "078051120"):
    if only_digits.startswith(sample_ssn):
        return True

The mixed list conflates two different intents — never-issued area numbers (000, 666, which are legitimately a 3-digit prefix/area check) and full 9-digit canonical sample SSNs — and matches both with a prefix test. The defect is that one of the 9-digit literals, "98765432", is only 8 digits: an apparent truncation of the canonical fake "987654320". Because the test is startswith, that truncated literal over-blocks the entire 987-65-4320987-65-4329 family — ten distinct SSN-shaped values — instead of the single intended sample.

The practical impact is a false negative that leaks PII: real SSN-shaped strings such as 987-65-4321 (a canonical example printed on countless real-world forms) are silently dropped to no result and pass through unredacted. This is a confidentiality failure, not a cosmetic one: the recognizer's whole job is to catch SSNs so a downstream anonymizer can redact them, and here it actively suppresses real ones.

Fix

Split the two concerns that the single startswith loop had merged:

  • Area-number check (kept as a prefix/area check): an SSN whose area number (first group) is 000 or 666 is invalid — the SSA never issues those. This stays a first-three-digits check: only_digits[:3] in ("000", "666").
  • Sample-SSN check (now exact equality): the full 9-digit canonical samples are matched by exact equality instead of prefix: only_digits in ("123456789", "987654320", "078051120"). The truncated "98765432" is corrected to its intended 9-digit canonical form "987654320", so it now invalidates only itself.

The regex and scores are unchanged. SSNs the recognizer is meant to drop are still dropped: 000/666-area SSNs, the all-same-digit case, the all-zero-group case, and the three canonical sample SSNs (987-65-4320, 078-05-1120, 123-45-6789) all remain invalidated.

Verification

A standalone repro instantiates the real UsSsnRecognizer and runs .analyze(..., ["US_SSN"], nlp_artifacts=None).

Before the fix — the whole family leaks:

987-65-4320 .. 987-65-4329: LEAKED (no result)   # all ten
219-09-9999 (control):       DETECTED

After the fix:

987-65-4321 .. 987-65-4329:  DETECTED            # nine real SSNs recovered
987-65-4320:                 invalidated         # the actual canonical sample, now by exact match
078-05-1120 / 123-45-6789:   invalidated         # other canonical samples, still dropped
000-12-3456 / 666-12-3456:   invalidated         # never-issued area numbers, still dropped
219-09-9999 (control):       DETECTED

ruff check / ruff format --check on the changed source and the targeted SSN test file both pass:

$ ruff check --force-exclude .../us_ssn_recognizer.py
All checks passed!
$ ruff format --check --force-exclude .../us_ssn_recognizer.py
1 file already formatted

$ PYTHONPATH=presidio-analyzer python -m pytest presidio-analyzer/tests/test_us_ssn_recognizer.py -q
............................                                             [100%]
28 passed in 0.03s

The test file gains: 987-65-4321987-65-4329 detected; 987-65-4320 / 078-05-1120 / 123-45-6789 still invalidated; 000/666-area SSNs still invalidated; and a plain valid SSN (219-09-9999) still detected.

Downstream blast radius: NVIDIA NeMo Guardrails' PII rail inherits this exact Presidio recognizer, so the same SSN leak reached that deployed guardrail too — real SSNs in the 987-65-432X family flowed through a configured PII guardrail unredacted.

Issue reference

Note on CHANGELOG

CHANGELOG entry omitted from this patch to avoid merge conflicts with sibling PRs that all insert at the same #### Fixed anchor. Happy to add the entry once the merge order is known, or maintainers can squash it in.

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

The invalidate_result() sample-SSN blocklist was matched with
str.startswith(), and the literal "98765432" is only 8 digits (a
truncation of the canonical fake "987654320"). The prefix match
therefore invalidated the entire 987-65-4320 .. 987-65-4329 family --
ten distinct SSN-shaped values, including the widely-printed 987-65-4321
-- so real SSNs were dropped to no result and left unredacted.

Split the two concerns: keep 000/666 as a never-issued area-number
(first-group) check, and match the full 9-digit sample SSNs
(123456789 / 987654320 / 078051120) by exact equality so they no longer
over-block neighbours. NeMo Guardrails' PII rail inherits this
recognizer, so the leak reached that deployed guardrail too.
Copilot AI review requested due to automatic review settings June 18, 2026 06:51
@AUTHENSOR

Copy link
Copy Markdown
Contributor Author

@microsoft-github-policy-service agree

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a false-negative in the US SSN predefined recognizer where an 8-digit truncated sample value ("98765432") was used with startswith(), unintentionally invalidating the entire 987-65-432X family of SSN-shaped values. The fix separates “never-issued area numbers” (prefix/area check) from “canonical sample SSNs” (exact match), and updates unit tests to prevent regressions.

Changes:

  • Refactors UsSsnRecognizer.invalidate_result() to use an area-number prefix check for 000/666 and exact equality for canonical 9-digit sample SSNs (including correcting 987654320).
  • Adds unit tests asserting 987-65-4321987-65-4329 are detected while 987-65-4320, 078-05-1120, and 123-45-6789 remain invalidated.
  • Adds explicit test coverage for 000/666 area-number invalidation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/us/us_ssn_recognizer.py Splits SSN invalidation logic into area-number prefix checks vs exact-match canonical sample SSNs to avoid over-blocking.
presidio-analyzer/tests/test_us_ssn_recognizer.py Expands test matrix to confirm the 987-65-432X family is no longer over-invalidated while canonical samples and invalid areas are still rejected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants