fix(es): upper-case NIF/NIE before mod-23 checksum so valid lowercase IDs are detected#2076
Open
AUTHENSOR wants to merge 2 commits into
Open
fix(es): upper-case NIF/NIE before mod-23 checksum so valid lowercase IDs are detected#2076AUTHENSOR wants to merge 2 commits into
AUTHENSOR wants to merge 2 commits into
Conversation
… IDs are detected
Contributor
There was a problem hiding this comment.
✅ Ready to approve
The change is small, targeted, and backed by unit tests covering the reported false-negative scenarios.
Note: this review does not count toward required approvals for merging.
Pull request overview
Fixes false-negative leaks for Spanish NIF/NIE recognizers by normalizing candidate IDs to uppercase before performing the mod-23 checksum validation, ensuring valid lowercase identifiers are detected (analyzer → anonymizer flow).
Changes:
- Uppercase sanitized NIF/NIE candidate text in
validate_result()before checksum/prefix validation. - Add unit test cases covering valid lowercase NIF/NIE forms and invalid-checksum lowercase rejections.
File summaries
| File | Description |
|---|---|
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/spain/es_nif_recognizer.py | Uppercases sanitized NIF text before mod-23 control-letter verification. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/spain/es_nie_recognizer.py | Uppercases sanitized NIE text before prefix handling and checksum verification. |
| presidio-analyzer/tests/test_es_nif_recognizer.py | Adds test coverage for valid lowercase NIF detection and invalid-checksum rejection. |
| presidio-analyzer/tests/test_es_nie_recognizer.py | Adds test coverage for valid lowercase NIE detection and invalid-checksum rejection. |
Copilot's findings
- Files reviewed: 4/4 changed files
- Comments generated: 1
Note
Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.
Comment on lines
+35
to
+36
| # uppercase still detected | ||
| ("X9613851N", 1, ((0, 9),),), |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Description
EsNifRecognizerandEsNieRecognizerwere dropping genuinely valid,lowercase Spanish national identifiers, so the PII leaked unredacted
(analyze → anonymize).
Both recognizers compile their pattern with
IGNORECASE, so a lowercase IDsuch as
12345678z(NIF/DNI) orx1234567l(NIE) is matched as a candidatespan. But
validate_resultthen compares the extracted control letteragainst the uppercase mod-23 table
"TRWAGMYFPDXBNJZSQVHLCKE"withoutnormalizing case:
letter = pattern_text[-1]is'z', which never equals theuppercase
letters[number % 23]→ checksum "fails" → result dropped.(
pattern_text[:1] not in "XYZ") and the"XYZ".index(pattern_text[0])lookup also assume uppercase, so a lowercase
x…is rejected outright.The uppercase form of the same ID (
12345678Z,X1234567L) is detected atscore
1.0, and for the NIE the lowercase form is missed by every defaultrecognizer, so this is purely a false-negative leak fix.
Fix
Upper-case the sanitized candidate text before the mod-23 lookup, mirroring how
sanitize_valuealready normalizes the string (it strips dashes/spaces):es_nif_recognizer.py:...sanitize_value(...).upper()es_nie_recognizer.py:...sanitize_value(...).upper()(this normalizes thecontrol letter and the leading
X/Y/Zprefix in one step).The change is conservative: the mod-23 checksum still gates every match, so an
invalid-checksum ID is still rejected regardless of case — no new false
positives.
Verification
Targeted tests (run from the repo root):
Lint on the changed source files:
New parametrized cases assert:
55555555k,12345678z) and NIE (x9613851n,z8078221m) are now detected at score1.0.12345678Z,X9613851N) are still detected.12345678a,x9613851q) stay rejected, and everypreviously detected/rejected case is unchanged.
Issue reference
Note on CHANGELOG
CHANGELOG entry omitted from this patch to avoid merge conflicts with sibling PRs that all insert at the same
#### Fixedanchor. Happy to add the entry once the merge order is known, or maintainers can squash it in.Checklist