Skip to content

fix(es): upper-case NIF/NIE before mod-23 checksum so valid lowercase IDs are detected#2076

Open
AUTHENSOR wants to merge 2 commits into
data-privacy-stack:mainfrom
AUTHENSOR:fix/es-nif-nie-lowercase
Open

fix(es): upper-case NIF/NIE before mod-23 checksum so valid lowercase IDs are detected#2076
AUTHENSOR wants to merge 2 commits into
data-privacy-stack:mainfrom
AUTHENSOR:fix/es-nif-nie-lowercase

Conversation

@AUTHENSOR

Copy link
Copy Markdown
Contributor

Change Description

EsNifRecognizer and EsNieRecognizer were dropping genuinely valid,
lowercase
Spanish national identifiers, so the PII leaked unredacted
(analyze → anonymize).

Both recognizers compile their pattern with IGNORECASE, so a lowercase ID
such as 12345678z (NIF/DNI) or x1234567l (NIE) is matched as a candidate
span. But validate_result then compares the extracted control letter
against the uppercase mod-23 table "TRWAGMYFPDXBNJZSQVHLCKE" without
normalizing case:

  • NIF: letter = pattern_text[-1] is 'z', which never equals the
    uppercase letters[number % 23] → checksum "fails" → result dropped.
  • NIE: in addition to the control letter, the leading prefix check
    (pattern_text[:1] not in "XYZ") and the "XYZ".index(pattern_text[0])
    lookup also assume uppercase, so a lowercase x… is rejected outright.

The uppercase form of the same ID (12345678Z, X1234567L) is detected at
score 1.0, and for the NIE the lowercase form is missed by every default
recognizer, so this is purely a false-negative leak fix.

Fix

Upper-case the sanitized candidate text before the mod-23 lookup, mirroring how
sanitize_value already normalizes the string (it strips dashes/spaces):

  • es_nif_recognizer.py: ...sanitize_value(...).upper()
  • es_nie_recognizer.py: ...sanitize_value(...).upper() (this normalizes the
    control letter and the leading X/Y/Z prefix in one step).

The change is conservative: the mod-23 checksum still gates every match, so an
invalid-checksum ID is still rejected regardless of case — no new false
positives.

Verification

Targeted tests (run from the repo root):

$ PYTHONPATH=presidio-analyzer python -m pytest \
    presidio-analyzer/tests/test_es_nif_recognizer.py \
    presidio-analyzer/tests/test_es_nie_recognizer.py -q
.........................                                                [100%]
25 passed in 0.04s

Lint on the changed source files:

$ ruff check  .../spain/es_nif_recognizer.py .../spain/es_nie_recognizer.py
All checks passed!
$ ruff format --check .../spain/es_nif_recognizer.py .../spain/es_nie_recognizer.py
2 files already formatted

New parametrized cases assert:

  • Valid lowercase NIF (55555555k, 12345678z) and NIE (x9613851n,
    z8078221m) are now detected at score 1.0.
  • The uppercase forms (12345678Z, X9613851N) are still detected.
  • Invalid-checksum IDs (12345678a, x9613851q) stay rejected, and every
    previously detected/rejected case is unchanged.

Issue reference

Note on CHANGELOG

CHANGELOG entry omitted from this patch to avoid merge conflicts with sibling PRs that all insert at the same #### Fixed anchor. Happy to add the entry once the merge order is known, or maintainers can squash it in.

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

Copilot AI review requested due to automatic review settings June 18, 2026 07:04

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Ready to approve

The change is small, targeted, and backed by unit tests covering the reported false-negative scenarios.

Note: this review does not count toward required approvals for merging.

Pull request overview

Fixes false-negative leaks for Spanish NIF/NIE recognizers by normalizing candidate IDs to uppercase before performing the mod-23 checksum validation, ensuring valid lowercase identifiers are detected (analyzer → anonymizer flow).

Changes:

  • Uppercase sanitized NIF/NIE candidate text in validate_result() before checksum/prefix validation.
  • Add unit test cases covering valid lowercase NIF/NIE forms and invalid-checksum lowercase rejections.
File summaries
File Description
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/spain/es_nif_recognizer.py Uppercases sanitized NIF text before mod-23 control-letter verification.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/spain/es_nie_recognizer.py Uppercases sanitized NIE text before prefix handling and checksum verification.
presidio-analyzer/tests/test_es_nif_recognizer.py Adds test coverage for valid lowercase NIF detection and invalid-checksum rejection.
presidio-analyzer/tests/test_es_nie_recognizer.py Adds test coverage for valid lowercase NIE detection and invalid-checksum rejection.

Copilot's findings

  • Files reviewed: 4/4 changed files
  • Comments generated: 1

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.

Comment on lines +35 to +36
# uppercase still detected
("X9613851N", 1, ((0, 9),),),
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants