fix(analyzer): stop quadratic backtracking in in_pan low pattern#2068
Open
uwezkhan wants to merge 1 commit into
Open
fix(analyzer): stop quadratic backtracking in in_pan low pattern#2068uwezkhan wants to merge 1 commit into
uwezkhan wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR updates the India PAN low-confidence regex to avoid expensive scanning/backtracking behavior and adds regression tests to validate both correctness and runtime characteristics.
Changes:
- Refined the low-confidence PAN regex lookaheads to scope matching to the token character class.
- Added a regression test for low-confidence detection with embedded digits.
- Added a performance/regression test intended to catch quadratic-time behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/india/in_pan_recognizer.py | Adjusts the low-confidence PAN regex lookaheads to avoid scanning beyond the candidate token. |
| presidio-analyzer/tests/test_in_pan_recognizer.py | Adds functional + runtime regression tests for the updated low-confidence PAN regex. |
| @@ -1,3 +1,5 @@ | |||
| import time | |||
| assert_result(results[0], entities[0], 0, 10, 0.01) | ||
|
|
||
|
|
||
| def test_low_confidence_pattern_does_not_backtrack(recognizer, entities): |
Comment on lines
+68
to
+73
| text = "a " * 50000 | ||
| start = time.time() | ||
| results = recognizer.analyze(text, entities) | ||
| elapsed = time.time() - start | ||
| assert results == [] | ||
| assert elapsed < 10 |
| Pattern( | ||
| "PAN (Low)", | ||
| r"\b((?=.*?[a-zA-Z])(?=.*?[0-9]{4})[\w@#$%^?~-]{10})\b", | ||
| r"\b((?=[\w@#$%^?~-]*?[a-zA-Z])(?=[\w@#$%^?~-]*?[0-9]{4})[\w@#$%^?~-]{10})\b", |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Description
in_pan_recognizer.py, PAN (Low) pattern, matched against a boundary-rich string with no four-digit run:Both lookaheads use
.*?, so each one rescans the rest of the text at every word boundary and match time is quadratic in input length. IN_PAN is enabled by default for English, so this sits on the analyze path for any analyzed text.Before:
(?=.*?[a-zA-Z])(?=.*?[0-9]{4})[\w@#$%^?~-]{10}..walks across non-token characters, so the lookahead scan is unbounded.After:
(?=[\w@#$%^?~-]*?[a-zA-Z])(?=[\w@#$%^?~-]*?[0-9]{4})[\w@#$%^?~-]{10}. The lookaheads only traverse the PAN token's own character class, so each stops at the first non-token character. Matching is linear (100 KB now ~9 ms) and every valid PAN whose four digits sit inside the token still matches at the same span and score.Tradeoff: scoping the lookaheads also drops one previous match. A ten-letter token is no longer reported as a PAN when four digits happen to appear later in the text; a PAN's digits live inside the number, so that case was a false positive.
Issue reference
N/A
Checklist