Skip to content

fix(analyzer): stop quadratic backtracking in in_pan low pattern#2068

Open
uwezkhan wants to merge 1 commit into
data-privacy-stack:mainfrom
uwezkhan:in-pan-low-backtracking
Open

fix(analyzer): stop quadratic backtracking in in_pan low pattern#2068
uwezkhan wants to merge 1 commit into
data-privacy-stack:mainfrom
uwezkhan:in-pan-low-backtracking

Conversation

@uwezkhan

Copy link
Copy Markdown
Contributor

Change Description

in_pan_recognizer.py, PAN (Low) pattern, matched against a boundary-rich string with no four-digit run:

in_pan PAN (Low) regex, input = "a " repeated:
  length  25,000  ->   2.0 s
  length  50,000  ->   8.1 s
  length 100,000  ->  32.4 s   (larger input reaches REGEX_TIMEOUT_SECONDS)

Both lookaheads use .*?, so each one rescans the rest of the text at every word boundary and match time is quadratic in input length. IN_PAN is enabled by default for English, so this sits on the analyze path for any analyzed text.

Before: (?=.*?[a-zA-Z])(?=.*?[0-9]{4})[\w@#$%^?~-]{10}. . walks across non-token characters, so the lookahead scan is unbounded.

After: (?=[\w@#$%^?~-]*?[a-zA-Z])(?=[\w@#$%^?~-]*?[0-9]{4})[\w@#$%^?~-]{10}. The lookaheads only traverse the PAN token's own character class, so each stops at the first non-token character. Matching is linear (100 KB now ~9 ms) and every valid PAN whose four digits sit inside the token still matches at the same span and score.

Tradeoff: scoping the lookaheads also drops one previous match. A ten-letter token is no longer reported as a PAN when four digits happen to appear later in the text; a PAN's digits live inside the number, so that case was a false positive.

Issue reference

N/A

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

Copilot AI review requested due to automatic review settings June 17, 2026 11:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR updates the India PAN low-confidence regex to avoid expensive scanning/backtracking behavior and adds regression tests to validate both correctness and runtime characteristics.

Changes:

  • Refined the low-confidence PAN regex lookaheads to scope matching to the token character class.
  • Added a regression test for low-confidence detection with embedded digits.
  • Added a performance/regression test intended to catch quadratic-time behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/india/in_pan_recognizer.py Adjusts the low-confidence PAN regex lookaheads to avoid scanning beyond the candidate token.
presidio-analyzer/tests/test_in_pan_recognizer.py Adds functional + runtime regression tests for the updated low-confidence PAN regex.

@@ -1,3 +1,5 @@
import time
assert_result(results[0], entities[0], 0, 10, 0.01)


def test_low_confidence_pattern_does_not_backtrack(recognizer, entities):
Comment on lines +68 to +73
text = "a " * 50000
start = time.time()
results = recognizer.analyze(text, entities)
elapsed = time.time() - start
assert results == []
assert elapsed < 10
Pattern(
"PAN (Low)",
r"\b((?=.*?[a-zA-Z])(?=.*?[0-9]{4})[\w@#$%^?~-]{10})\b",
r"\b((?=[\w@#$%^?~-]*?[a-zA-Z])(?=[\w@#$%^?~-]*?[0-9]{4})[\w@#$%^?~-]{10})\b",
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants