fix: Return the leftmost match from ReverseSuffix#1355
Conversation
|
Hm, this might need more work. It returns the first match correctly, but matches get skipped when searching for all matches. Edit: This actually looks like the correct behavior for overlapping matches. |
|
That looks right to me for all non-overlapping successive matches. Overlapping is a totally different ball game and the meta regex engine (which is where this bug is) doesn't provide a way to do overlapping search. You can only get overlapping when you drop down to the lower layers: |
| input: &Input<'_>, | ||
| ) -> Result<Option<HalfMatch>, RetryError> { | ||
| let max_len = | ||
| self.core.info.props_union().maximum_len().unwrap_or(usize::MAX); |
There was a problem hiding this comment.
This is my main hesitation with this fix. In all but very simple cases for fixed length regexes, max_len is going to be usize::MAX here. Which means any branch below for x >= max_len is basically never going to be true.
I imagine this was added for performance reasons.
Anyway, I'm going to dig into benchmarking this a bit more tomorrow. Thank you for getting the ball rolling here.
There was a problem hiding this comment.
I think that's a fundamental issue with this strategy, you can't return a half-start until you know it's leftmost.
Previously, reverse suffix would often also bail early due to the quadratic behavior when the suffix could be matched elsewhere in the pattern, e.g.
regex-cli -- find match meta -p '[a-z]b' -y '000bbb', which I assume would be the case for many patterns that don't have max size.
Curious about the benchmarks.
There was a problem hiding this comment.
I ran the full suite of rebar benchmarks for rust/regex before and after this PR, and there aren't any regressions there from what I can see:
[andrew@duff rebar]$ rebar diff tmp/regex-master.csv tmp/regex-fix-reverse-suffix-leftmost.csv -t 1.2
benchmark engine tmp/regex-master.csv tmp/regex-fix-reverse-suffix-leftmost.csv
--------- ------ -------------------- -----------------------------------------
imported/sherlock/repeated-class-negation rust/regex 17.3 GB/s (1.39x) 24.0 GB/s (1.00x)
(Some benchmarks are noisy, so I'd ignore the improvement here.)
There was a problem hiding this comment.
OK, so I dove deeper into this and I flubbed the benchmark. This change has significant performance regressions unfortunately. I'm going to need to think about this one.
The reverse inner optimization has the same kind of bug too.
Disabling these optimizations would fix the bug, but these are critical load bearing optimizations.
There was a problem hiding this comment.
If you think about what this change is doing, it virtually guarantees that it will exhaustively search the entire remaining haystack for every occurrence of literal (assuming the existing quadratic trip-wire doesn't catch it first, which it won't in cases where the literal doesn't overlap with the first part of the regex). This is even worse than disabling the optimization in the first place and itself introduces more quadratic behavior when iterating over matches. So I think this particular fix is firmly bunk unfortunately.
There was a problem hiding this comment.
We might need to actually compute whether there is overlap between the suffix (or inner) literal and the parts of the pattern that come before that. My quadratic trip-wire was partially meant as a hack to avoid doing that.
There was a problem hiding this comment.
OK, so I dove deeper into this and I flubbed the benchmark.
Actual benchmark results:
[andrew@duff rebar]$ rebar diff tmp/regex-master.csv tmp/regex-fix-reverse-suffix-leftmost-04.csv -t 1.2
benchmark engine tmp/regex-master.csv tmp/regex-fix-reverse-suffix-leftmost-04.csv
--------- ------ -------------------- --------------------------------------------
curated/06-cloud-flare-redos/simplified-long rust/regex 83.9 GB/s (1.00x) 55.1 GB/s (1.52x)
hyperscan/literal-suffix-nosom rust/regex 19.2 GB/s (1.00x) 148.1 MB/s (132.48x)
hyperscan/literal-suffix-som rust/regex 19.4 GB/s (1.00x) 129.8 MB/s (153.06x)
hyperscan/literal-inner-nosom rust/regex 22.2 GB/s (1.00x) 311.7 MB/s (73.00x)
hyperscan/literal-inner-som rust/regex 22.2 GB/s (1.00x) 292.5 MB/s (77.79x)
imported/leipzig/shing rust/regex 25.7 GB/s (1.00x) 33.8 MB/s (776.71x)
imported/leipzig/word-ending-nn rust/regex 31.5 GB/s (1.00x) 348.3 MB/s (92.73x)
imported/leipzig/certain-long-strings-ending-x rust/regex 5.6 GB/s (1.00x) 330.9 MB/s (17.29x)
imported/leipzig/tom-sawyer-huckle-fin-prefix-short rust/regex 19.3 GB/s (1.00x) 377.8 MB/s (52.24x)
imported/leipzig/tom-sawyer-huckle-fin-prefix-long rust/regex 18.6 GB/s (1.00x) 410.9 MB/s (46.30x)
imported/leipzig/ing rust/regex 3.1 GB/s (1.00x) 99.5 MB/s (31.78x)
imported/leipzig/ing-whitespace rust/regex 3.0 GB/s (1.00x) 133.1 MB/s (22.76x)
imported/leipzig/awyer-inn rust/regex 14.6 GB/s (1.00x) 153.5 MB/s (97.55x)
imported/mariomka/email rust/regex 50.5 GB/s (1.00x) 1032.1 MB/s (50.11x)
imported/mariomka/uri rust/regex 9.2 GB/s (1.00x) 11.1 MB/s (848.70x)
imported/sherlock/before-holmes rust/regex 18.0 GB/s (1.00x) 147.8 MB/s (124.43x)
imported/sherlock/before-after-holmes rust/regex 20.9 GB/s (1.00x) 297.1 MB/s (71.94x)
imported/sherlock/word-ending-n rust/regex 838.4 MB/s (1.00x) 119.2 MB/s (7.03x)
imported/sherlock/repeated-class-negation rust/regex 17.3 GB/s (1.00x) 368.4 MB/s (48.05x)
imported/sherlock/ing-suffix rust/regex 3.5 GB/s (1.00x) 136.4 MB/s (25.91x)
imported/sherlock/ing-suffix-limited-space rust/regex 3.5 GB/s (1.00x) 166.9 MB/s (21.76x)
opt/reverse-inner/holmes rust/regex 19.1 GB/s (1.00x) 297.1 MB/s (65.75x)
opt/reverse-inner/email rust/regex 82.2 GB/s (1.00x) 29.4 GB/s (2.80x)
opt/reverse-inner/factored-prefix rust/regex 9.2 GB/s (1.00x) 40.0 MB/s (235.27x)
opt/reverse-suffix/holmes rust/regex 19.3 GB/s (1.00x) 148.5 MB/s (133.19x)
reported/i13-subset-regex/original-ascii rust/regex 5.6 GB/s (1.00x) 330.5 MB/s (17.37x)
reported/i13-subset-regex/original-unicode rust/regex 345.6 MB/s (1.00x) 17.9 MB/s (19.35x)
reported/i13-subset-regex/big-ascii rust/regex 92.1 MB/s (1.00x) 16.4 MB/s (5.62x)
reported/i13-subset-regex/big-unicode rust/regex 36.1 MB/s (1.00x) 15.6 MB/s (2.31x)
There was a problem hiding this comment.
Thanks for the updated benchmark!
This optimization looks for ends of the matches, but it needs to know the start to return the leftmost match, which sounds like quite a complex problem and might indeed need a change in approach or larger design changes rather than a "fix" like this PR.
I don't think I'm able to contribute further to resolve this issue (apart from high-level ideas) as I don't have enough experience with this crate. I appreciate you looking into it and I'm still curious about the eventual approach!
There was a problem hiding this comment.
Oh it's absolutely brutal. See: #1364
I will take it from here. Thank you for the report and attempt. This turned out to be quite the bug!
This changes the code organization a bit in a way that makes a subsequent bug fix change a little easier to reason about.
b2cf1a6 to
cf290f1
Compare
cf290f1 to
c8f0287
Compare
A minimal reproducer of this bug is on a haystack of `zabb` with the regex `.bb|b`. The `regex` crate will report a match at `2..3`, but the correct match is `1..4`. While this seems like a simple regex, there are a pretty specific set of circumstances required to trigger the bug: 1. There are no prefix literals that activate a standard prefix literal scan. 2. There needs to be an extractable *suffix* or *inner* literal. 3. An actual match needs to be present in the haystack. 4. The regex and haystack Crucially, note that because of (3), this bug will never lead to `Regex::is_match` providing a false positive *or* a false negative. This bug is strictly about leftmost-first match semantics being incorrect in some cases and will report an incorrect match span. (4) could do with a bit more explanation, since it's rather subtle. Let's trace the minimal example through the regex crate's "reverse suffix" optimization. During compilation, there is no prefix literal that can be extracted. The `.` defeats that class of optimization. Moreover, there is a suffix literal in the regex. That is, all matches for `.bb|b` must end with `b`. The regex crate sees this and will scan for matches of `b`. It will then attempt to match the regex in reverse at each candidate match of `b`. Let's see what happens: * Find first occurrence of `b` at offset `2` in `zabb`. * Start reverse confirmation step at offset `2`. * The second alternation branch, `b` in `.bb|b`, matches at `2..3`. * The second alternation branch is reported as the overall match. This happens because the first alternation branch, `.bb`, does _not_ have a match ending at offset `3`. The fundamental problem here is that there is an overlap between the reverse automaton for confirming the match and the literal scan. Small changes, even to the haystack, can result in the bug disappearing. For example, with a haystack of `zbb`, the correct match of `0..3` is reported. This occurs because there is a quadratic "trip wire" that triggers in this case that causes the search to bail out and fall back to a DFA without using any literal optimizations. This bug also applies to the "reverse inner" optimization. This can happen when the literal is extracted from inside the regex as opposed to it being a suffix literal. For example, the regex `(?:..acbb|b)a(?:c|d)` on the haystack `xzbacbbac` reported a match at `2..5`, but the correct match is `1..9`. Note that #1355 technically fixes this problem and is much simpler, but in so doing, makes the reverse suffix and inner optimizations completely ineffective. Fixes #1354, Closes #1355
This PR delays returning the first half-start match from
ReverseSuffixstrategy until we confirm that there isn't another match after it that would be more leftmost, which can happen for patterns where alternates are overlapping.I'm a first time contributor and went with a straightforward way to fix the issue, any feedback is welcome. If there's a better approach to fix it, I'm also happy to rework the PR.