fix(analyzer): stop quadratic backtracking in in_pan low pattern by uwezkhan · Pull Request #2068 · data-privacy-stack/presidio

uwezkhan · 2026-06-17T11:31:57Z

Change Description

in_pan_recognizer.py, PAN (Low) pattern, matched against a boundary-rich string with no four-digit run:

in_pan PAN (Low) regex, input = "a " repeated:
  length  25,000  ->   2.0 s
  length  50,000  ->   8.1 s
  length 100,000  ->  32.4 s   (larger input reaches REGEX_TIMEOUT_SECONDS)

Both lookaheads use .*?, so each one rescans the rest of the text at every word boundary and match time is quadratic in input length. IN_PAN is enabled by default for English, so this sits on the analyze path for any analyzed text.

Before: (?=.*?[a-zA-Z])(?=.*?[0-9]{4})[\w@#$%^?~-]{10}. . walks across non-token characters, so the lookahead scan is unbounded.

After: (?=[\w@#$%^?~-]*?[a-zA-Z])(?=[\w@#$%^?~-]*?[0-9]{4})[\w@#$%^?~-]{10}. The lookaheads only traverse the PAN token's own character class, so each stops at the first non-token character. Matching is linear (100 KB now ~9 ms) and every valid PAN whose four digits sit inside the token still matches at the same span and score.

Tradeoff: scoping the lookaheads also drops one previous match. A ten-letter token is no longer reported as a PAN when four digits happen to appear later in the text; a PAN's digits live inside the number, so that case was a false positive.

Issue reference

N/A

Checklist

I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests
All unit tests and lint checks pass locally
My PR contains documentation updates / additions if required

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR updates the India PAN low-confidence regex to avoid expensive scanning/backtracking behavior and adds regression tests to validate both correctness and runtime characteristics.

Changes:

Refined the low-confidence PAN regex lookaheads to scope matching to the token character class.
Added a regression test for low-confidence detection with embedded digits.
Added a performance/regression test intended to catch quadratic-time behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/india/in_pan_recognizer.py	Adjusts the low-confidence PAN regex lookaheads to avoid scanning beyond the candidate token.
presidio-analyzer/tests/test_in_pan_recognizer.py	Adds functional + runtime regression tests for the updated low-confidence PAN regex.

@@ -1,3 +1,5 @@
+import time


+    assert_result(results[0], entities[0], 0, 10, 0.01)
+
+
+def test_low_confidence_pattern_does_not_backtrack(recognizer, entities):


+    text = "a " * 50000
+    start = time.time()
+    results = recognizer.analyze(text, entities)
+    elapsed = time.time() - start
+    assert results == []
+    assert elapsed < 10


        Pattern(
            "PAN (Low)",
-            r"\b((?=.*?[a-zA-Z])(?=.*?[0-9]{4})[\w@#$%^?~-]{10})\b",
+            r"\b((?=[\w@#$%^?~-]*?[a-zA-Z])(?=[\w@#$%^?~-]*?[0-9]{4})[\w@#$%^?~-]{10})\b",


fix(analyzer): stop quadratic backtracking in in_pan low pattern

e9aa8fa

Copilot AI review requested due to automatic review settings June 17, 2026 11:31

github-actions Bot added the external label Jun 17, 2026

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(analyzer): stop quadratic backtracking in in_pan low pattern#2068

fix(analyzer): stop quadratic backtracking in in_pan low pattern#2068
uwezkhan wants to merge 1 commit into
data-privacy-stack:mainfrom
uwezkhan:in-pan-low-backtracking

uwezkhan commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		assert_result(results[0], entities[0], 0, 10, 0.01)


		def test_low_confidence_pattern_does_not_backtrack(recognizer, entities):

Uh oh!

Conversation

uwezkhan commented Jun 17, 2026

Change Description

Issue reference

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants