Skip to content

Fix FI HETU regex accepting illegal century separators#2108

Open
jichaowang02-lang wants to merge 1 commit into
data-privacy-stack:mainfrom
jichaowang02-lang:fix/fi-hetu-separator-range
Open

Fix FI HETU regex accepting illegal century separators#2108
jichaowang02-lang wants to merge 1 commit into
data-privacy-stack:mainfrom
jichaowang02-lang:fix/fi-hetu-separator-range

Conversation

@jichaowang02-lang

Copy link
Copy Markdown
Contributor

Change Description

The Finnish personal identity code (HETU) patterns use this class for the century separator (7th character):

r"\b(\d{6})([+-ABCDEFYXWVU])(\d{3})([0123456789ABCDEFHJKLMNPRSTUVWXY])\b"
#               ^^^ "+-A" is a RANGE, not three literals

Because - sits between + (U+002B) and A (U+0041), the engine reads +-A as a character range, which silently admits , . / 0-9 : ; < = > ? @ as "separators". validate_result only re-checks the date and control character — never the separator — so an 11-character string with a junk separator but an otherwise-correct control digit is wrongly accepted:

FiPersonalIdentityCodeRecognizer().analyze("131052/308T", ["FI_PERSONAL_IDENTITY_CODE"])
#   -> recognized (false positive); "131052-308T" is the real code

Per the DVV spec the separator must be exactly one of + - A B C D E F Y X W V U (+ = 1800s; - Y X W V U = 1900s; A B C D E F = 2000s).

Fix

Move the - to the start of the class ([-+ABCDEFYXWVU]) so it is a literal, in both the Medium and Very Weak patterns.

Checklist

  • I have reviewed the contribution guidelines
  • I have added tests to cover my changes
  • All new and existing tests passed

Tests

$ pytest tests/test_fi_personal_identity_code_recognizer.py -q
35 passed

New cases 131052/308T, 131052:308T, 131052.308T (valid date + control digit, illegal separator) are now rejected — they fail before the fix — while the valid 131052-308T is still detected.

The Finnish personal identity code pattern used the character class
`[+-ABCDEFYXWVU]` for the century separator (the 7th character). Because the
`-` sits between `+` and `A`, the regex engine reads `+-A` as a character
*range* (U+002B..U+0041), which silently admits `, . / 0-9 : ; < = > ? @` as
"separators". validate_result only re-checks the date and control character,
never the separator, so an 11-char string with a junk separator but an
otherwise-correct control digit (e.g. `131052/308T`) is wrongly accepted —
a false positive.

Per the DVV spec the separator must be exactly one of `+ - A B C D E F Y X W
V U`. Move the `-` to the start of the class so it is a literal, in both the
Medium and Very Weak patterns.

Adds regression cases: `131052/308T`, `131052:308T`, `131052.308T` (valid
date+control, illegal separator) are now rejected; the valid `131052-308T`
is still detected.
Copilot AI review requested due to automatic review settings June 27, 2026 09:06

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants