Skip to content

fix(analyzer): detect Mastercard 2-series and 18-19 digit credit cards#2075

Open
AUTHENSOR wants to merge 1 commit into
data-privacy-stack:mainfrom
AUTHENSOR:fix/credit-card-2series-19digit
Open

fix(analyzer): detect Mastercard 2-series and 18-19 digit credit cards#2075
AUTHENSOR wants to merge 1 commit into
data-privacy-stack:mainfrom
AUTHENSOR:fix/credit-card-2series-19digit

Conversation

@AUTHENSOR

@AUTHENSOR AUTHENSOR commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Change Description

CreditCardRecognizer was missing two whole families of real, current, Luhn-valid cards. Because the PAN regex never matched them, the recognizer returned no CREDIT_CARD result, so these card numbers passed Presidio as "clean" and would leak unredacted downstream (analyze → anonymize):

  1. Mastercard 2-series — BIN range 2221-2720 (the first four digits), issued since 2017 (billions of live cards). The regex only had branches for leading digits 4/5/6/1/3, so e.g. 2221000000000009 (Luhn-valid) returned
    [].
  2. 18-19 digit PANs — ISO/IEC 7812 allows PANs up to 19 digits
    (UnionPay, Maestro, some Visa). The old length window
    (\d{3,4} \d{3,4} \d{3,5} after a 4-digit prefix) capped matches at 17
    digits, so a Luhn-valid 19-digit card returned [].

A leading-4 Visa such as 4111111111111111 was, and still is, detected at
score 1.0, so this is purely a false-negative fix on the leak surface.

Fix

Widen the PAN regex in
presidio_analyzer/predefined_recognizers/generic/credit_card_recognizer.py:

  • Add a Mastercard 2-series prefix branch,
    2(22[1-9]|2[3-9]\d|[3-6]\d\d|7[01]\d|720), which matches exactly the
    four-digit values 2221-2720 (same 4-digit shape as the existing
    4\d{3} / 5[0-5]\d{2} / 6\d{3} branches).
  • Raise the final group from \d{3,5} to \d{3,7} so the pattern spans
    13-19 digit PANs.

The change is conservative: Luhn (validate_result) still gates every
match
, and the existing negative lookahead that rejects 13-digit Unix
timestamps starting with 1 ((?!1\d{12}(?!\d))) is untouched. Widening the
range therefore does not flag non-card numbers — a 19-digit run that is not
Luhn-valid is still rejected, exactly as today.

Verification

Targeted tests (run from the presidio-analyzer source dir):

$ python -m pytest tests/test_credit_card_recognizer.py -q
...........................                                              [100%]
27 passed in 0.03s

Lint on the changed source file:

$ ruff check  .../generic/credit_card_recognizer.py   -> All checks passed!
$ ruff format --check .../generic/credit_card_recognizer.py -> 1 file already formatted

New parametrized cases assert:

  • Luhn-valid Mastercard 2-series PANs (2221000000000009, 2720990000000007,
    2223000048400011) are detected as CREDIT_CARD at score 1.0, including
    with context (my credit card: ...).
  • Luhn-valid 18- and 19-digit PANs (675919345145061238,
    4109906958483040118, 6298036494205552661) are detected.
  • Regression guards: every previously detected card is still detected, the
    Unix-timestamp case (1748503543012) is still not flagged, the
    Luhn-invalid Visa/Discover cases stay rejected, and a new Luhn-invalid
    2-series PAN (2221000000000001) is rejected.

Issue reference

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

The CreditCardRecognizer PAN regex only matched leading digits 4/5/6/1/3
and a 13-16 digit window, so two families of real, Luhn-valid cards
produced no CREDIT_CARD result and leaked unredacted:

- Mastercard 2-series (BIN range 2221-2720, issued since 2017)
- 18-19 digit PANs (ISO/IEC 7812 allows up to 19, e.g. UnionPay/Maestro)

Widen the regex to cover these ranges and raise the length ceiling to 19.
Luhn (validate_result) still gates every match, so non-card numbers are
not flagged. Existing detections and Unix-timestamp/Luhn-invalid
rejections are unchanged.
Copilot AI review requested due to automatic review settings June 18, 2026 06:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Presidio Analyzer’s CreditCardRecognizer to reduce false negatives by expanding the credit card PAN regex to detect Mastercard 2‑series BINs (2221–2720) and longer (18–19 digit) PANs, and it adds targeted unit tests plus a changelog entry documenting the fix.

Changes:

  • Expanded the credit card regex to include Mastercard 2‑series prefixes and allow matching up to 19-digit PANs.
  • Added unit tests covering Mastercard 2‑series and 18–19 digit Luhn-valid PANs (plus a Luhn-invalid regression case).
  • Documented the fix in CHANGELOG.md.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
presidio-analyzer/presidio_analyzer/predefined_recognizers/generic/credit_card_recognizer.py Widened PAN regex to match Mastercard 2‑series and longer PAN lengths.
presidio-analyzer/tests/test_credit_card_recognizer.py Added regression tests for newly supported PAN ranges and lengths.
CHANGELOG.md Added a “Fixed” entry describing the credit-card false-negative fix.

Pattern(
"All Credit Cards (weak)",
r"\b(?!1\d{12}(?!\d))((4\d{3})|(5[0-5]\d{2})|(6\d{3})|(1\d{3})|(3\d{3}))[- ]?(\d{3,4})[- ]?(\d{3,4})[- ]?(\d{3,5})\b", # noqa: E501
r"\b(?!1\d{12}(?!\d))((4\d{3})|(5[0-5]\d{2})|(2(22[1-9]|2[3-9]\d|[3-6]\d\d|7[01]\d|720))|(6\d{3})|(1\d{3})|(3\d{3}))[- ]?(\d{3,4})[- ]?(\d{3,4})[- ]?(\d{3,7})\b", # noqa: E501
Comment on lines +48 to +50
("4109906958483040118", 1, (1.0,), ((0, 19),),),
("6298036494205552661", 1, (1.0,), ((0, 19),),),
("675919345145061238", 1, (1.0,), ((0, 18),),),
Comment on lines +25 to +26
# Luhn (validate_result) still gates every match, so widening the
# range does not flag non-card numbers.
Comment thread CHANGELOG.md
- Added `supported_entity` parameter to `PhoneRecognizer`. Previously, this recognizer hard-coded `["PHONE_NUMBER"]` as the only possible supported entity.

#### Fixed
- Fixed a false-negative in `CreditCardRecognizer` where real, Luhn-valid cards were passing as clean (no `CREDIT_CARD` result) and leaking unredacted. The PAN regex now also matches Mastercard 2-series cards (BIN range 2221-2720, the first four digits; issued since 2017) and 18-19 digit PANs (ISO/IEC 7812 allows up to 19 digits, e.g. UnionPay/Maestro), in addition to the existing ranges. Luhn (`validate_result`) still gates every match, so the widened range does not introduce false positives on non-card numbers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants