Skip to content

fix: Return the leftmost match from ReverseSuffix#1355

Open
JakubValtar wants to merge 3 commits into
rust-lang:masterfrom
JakubValtar:fix-reverse-suffix-leftmost
Open

fix: Return the leftmost match from ReverseSuffix#1355
JakubValtar wants to merge 3 commits into
rust-lang:masterfrom
JakubValtar:fix-reverse-suffix-leftmost

Conversation

@JakubValtar

@JakubValtar JakubValtar commented Jun 2, 2026

Copy link
Copy Markdown

This PR delays returning the first half-start match from ReverseSuffix strategy until we confirm that there isn't another match after it that would be more leftmost, which can happen for patterns where alternates are overlapping.

I'm a first time contributor and went with a straightforward way to fix the issue, any feedback is welcome. If there's a better approach to fix it, I'm also happy to rework the PR.

@JakubValtar

JakubValtar commented Jun 2, 2026

Copy link
Copy Markdown
Author

Hm, this might need more work. It returns the first match correctly, but matches get skipped when searching for all matches.

regex-cli find match meta -p '.abb|b' -y 'zabbb'

  total matches:  2
0:0:4:zabb
0:4:5:b

Edit: This actually looks like the correct behavior for overlapping matches.

@BurntSushi

Copy link
Copy Markdown
Member

That looks right to me for all non-overlapping successive matches.

Overlapping is a totally different ball game and the meta regex engine (which is where this bug is) doesn't provide a way to do overlapping search. You can only get overlapping when you drop down to the lower layers:

$ cargo r -qp regex-cli -- find half hybrid --no-table -p '.abb|b' -y 'zabbb' -kall --overlapping
0:3
0:4
0:5

Comment thread regex-automata/src/meta/strategy.rs Outdated
input: &Input<'_>,
) -> Result<Option<HalfMatch>, RetryError> {
let max_len =
self.core.info.props_union().maximum_len().unwrap_or(usize::MAX);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my main hesitation with this fix. In all but very simple cases for fixed length regexes, max_len is going to be usize::MAX here. Which means any branch below for x >= max_len is basically never going to be true.

I imagine this was added for performance reasons.

Anyway, I'm going to dig into benchmarking this a bit more tomorrow. Thank you for getting the ball rolling here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a fundamental issue with this strategy, you can't return a half-start until you know it's leftmost.

Previously, reverse suffix would often also bail early due to the quadratic behavior when the suffix could be matched elsewhere in the pattern, e.g.
regex-cli -- find match meta -p '[a-z]b' -y '000bbb', which I assume would be the case for many patterns that don't have max size.

Curious about the benchmarks.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the full suite of rebar benchmarks for rust/regex before and after this PR, and there aren't any regressions there from what I can see:

[andrew@duff rebar]$ rebar diff tmp/regex-master.csv tmp/regex-fix-reverse-suffix-leftmost.csv -t 1.2
benchmark                                  engine      tmp/regex-master.csv  tmp/regex-fix-reverse-suffix-leftmost.csv
---------                                  ------      --------------------  -----------------------------------------
imported/sherlock/repeated-class-negation  rust/regex  17.3 GB/s (1.39x)     24.0 GB/s (1.00x)

(Some benchmarks are noisy, so I'd ignore the improvement here.)

@BurntSushi BurntSushi Jun 5, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I dove deeper into this and I flubbed the benchmark. This change has significant performance regressions unfortunately. I'm going to need to think about this one.

The reverse inner optimization has the same kind of bug too.

Disabling these optimizations would fix the bug, but these are critical load bearing optimizations.

@BurntSushi BurntSushi Jun 5, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think about what this change is doing, it virtually guarantees that it will exhaustively search the entire remaining haystack for every occurrence of literal (assuming the existing quadratic trip-wire doesn't catch it first, which it won't in cases where the literal doesn't overlap with the first part of the regex). This is even worse than disabling the optimization in the first place and itself introduces more quadratic behavior when iterating over matches. So I think this particular fix is firmly bunk unfortunately.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to actually compute whether there is overlap between the suffix (or inner) literal and the parts of the pattern that come before that. My quadratic trip-wire was partially meant as a hack to avoid doing that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I dove deeper into this and I flubbed the benchmark.

Actual benchmark results:

[andrew@duff rebar]$ rebar diff tmp/regex-master.csv tmp/regex-fix-reverse-suffix-leftmost-04.csv -t 1.2
benchmark                                            engine      tmp/regex-master.csv  tmp/regex-fix-reverse-suffix-leftmost-04.csv
---------                                            ------      --------------------  --------------------------------------------
curated/06-cloud-flare-redos/simplified-long         rust/regex  83.9 GB/s (1.00x)     55.1 GB/s (1.52x)
hyperscan/literal-suffix-nosom                       rust/regex  19.2 GB/s (1.00x)     148.1 MB/s (132.48x)
hyperscan/literal-suffix-som                         rust/regex  19.4 GB/s (1.00x)     129.8 MB/s (153.06x)
hyperscan/literal-inner-nosom                        rust/regex  22.2 GB/s (1.00x)     311.7 MB/s (73.00x)
hyperscan/literal-inner-som                          rust/regex  22.2 GB/s (1.00x)     292.5 MB/s (77.79x)
imported/leipzig/shing                               rust/regex  25.7 GB/s (1.00x)     33.8 MB/s (776.71x)
imported/leipzig/word-ending-nn                      rust/regex  31.5 GB/s (1.00x)     348.3 MB/s (92.73x)
imported/leipzig/certain-long-strings-ending-x       rust/regex  5.6 GB/s (1.00x)      330.9 MB/s (17.29x)
imported/leipzig/tom-sawyer-huckle-fin-prefix-short  rust/regex  19.3 GB/s (1.00x)     377.8 MB/s (52.24x)
imported/leipzig/tom-sawyer-huckle-fin-prefix-long   rust/regex  18.6 GB/s (1.00x)     410.9 MB/s (46.30x)
imported/leipzig/ing                                 rust/regex  3.1 GB/s (1.00x)      99.5 MB/s (31.78x)
imported/leipzig/ing-whitespace                      rust/regex  3.0 GB/s (1.00x)      133.1 MB/s (22.76x)
imported/leipzig/awyer-inn                           rust/regex  14.6 GB/s (1.00x)     153.5 MB/s (97.55x)
imported/mariomka/email                              rust/regex  50.5 GB/s (1.00x)     1032.1 MB/s (50.11x)
imported/mariomka/uri                                rust/regex  9.2 GB/s (1.00x)      11.1 MB/s (848.70x)
imported/sherlock/before-holmes                      rust/regex  18.0 GB/s (1.00x)     147.8 MB/s (124.43x)
imported/sherlock/before-after-holmes                rust/regex  20.9 GB/s (1.00x)     297.1 MB/s (71.94x)
imported/sherlock/word-ending-n                      rust/regex  838.4 MB/s (1.00x)    119.2 MB/s (7.03x)
imported/sherlock/repeated-class-negation            rust/regex  17.3 GB/s (1.00x)     368.4 MB/s (48.05x)
imported/sherlock/ing-suffix                         rust/regex  3.5 GB/s (1.00x)      136.4 MB/s (25.91x)
imported/sherlock/ing-suffix-limited-space           rust/regex  3.5 GB/s (1.00x)      166.9 MB/s (21.76x)
opt/reverse-inner/holmes                             rust/regex  19.1 GB/s (1.00x)     297.1 MB/s (65.75x)
opt/reverse-inner/email                              rust/regex  82.2 GB/s (1.00x)     29.4 GB/s (2.80x)
opt/reverse-inner/factored-prefix                    rust/regex  9.2 GB/s (1.00x)      40.0 MB/s (235.27x)
opt/reverse-suffix/holmes                            rust/regex  19.3 GB/s (1.00x)     148.5 MB/s (133.19x)
reported/i13-subset-regex/original-ascii             rust/regex  5.6 GB/s (1.00x)      330.5 MB/s (17.37x)
reported/i13-subset-regex/original-unicode           rust/regex  345.6 MB/s (1.00x)    17.9 MB/s (19.35x)
reported/i13-subset-regex/big-ascii                  rust/regex  92.1 MB/s (1.00x)     16.4 MB/s (5.62x)
reported/i13-subset-regex/big-unicode                rust/regex  36.1 MB/s (1.00x)     15.6 MB/s (2.31x)

@JakubValtar JakubValtar Jun 8, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updated benchmark!

This optimization looks for ends of the matches, but it needs to know the start to return the leftmost match, which sounds like quite a complex problem and might indeed need a change in approach or larger design changes rather than a "fix" like this PR.

I don't think I'm able to contribute further to resolve this issue (apart from high-level ideas) as I don't have enough experience with this crate. I appreciate you looking into it and I'm still curious about the eventual approach!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it's absolutely brutal. See: #1364

I will take it from here. Thank you for the report and attempt. This turned out to be quite the bug!

This changes the code organization a bit in a way that makes
a subsequent bug fix change a little easier to reason about.
@BurntSushi BurntSushi force-pushed the fix-reverse-suffix-leftmost branch from b2cf1a6 to cf290f1 Compare June 4, 2026 03:43
@BurntSushi BurntSushi force-pushed the fix-reverse-suffix-leftmost branch from cf290f1 to c8f0287 Compare June 4, 2026 04:06
BurntSushi added a commit that referenced this pull request Jun 7, 2026
A minimal reproducer of this bug is on a haystack of `zabb` with the
regex `.bb|b`. The `regex` crate will report a match at `2..3`, but the
correct match is `1..4`.

While this seems like a simple regex, there are a pretty specific set
of circumstances required to trigger the bug:

1. There are no prefix literals that activate a standard prefix
   literal scan.
2. There needs to be an extractable *suffix* or *inner* literal.
3. An actual match needs to be present in the haystack.
4. The regex and haystack

Crucially, note that because of (3), this bug will never lead to
`Regex::is_match` providing a false positive *or* a false negative.
This bug is strictly about leftmost-first match semantics being
incorrect in some cases and will report an incorrect match span.

(4) could do with a bit more explanation, since it's rather subtle.
Let's trace the minimal example through the regex crate's "reverse
suffix" optimization.

During compilation, there is no prefix literal that can be extracted.
The `.` defeats that class of optimization. Moreover, there is a suffix
literal in the regex. That is, all matches for `.bb|b` must end with
`b`. The regex crate sees this and will scan for matches of `b`. It will
then attempt to match the regex in reverse at each candidate match of
`b`. Let's see what happens:

* Find first occurrence of `b` at offset `2` in `zabb`.
* Start reverse confirmation step at offset `2`.
* The second alternation branch, `b` in `.bb|b`, matches at `2..3`.
* The second alternation branch is reported as the overall match.
  This happens because the first alternation branch, `.bb`, does _not_
  have a match ending at offset `3`.

The fundamental problem here is that there is an overlap between the
reverse automaton for confirming the match and the literal scan. Small
changes, even to the haystack, can result in the bug disappearing.
For example, with a haystack of `zbb`, the correct match of `0..3` is
reported. This occurs because there is a quadratic "trip wire" that
triggers in this case that causes the search to bail out and fall back
to a DFA without using any literal optimizations.

This bug also applies to the "reverse inner" optimization. This
can happen when the literal is extracted from inside the regex
as opposed to it being a suffix literal. For example, the regex
`(?:..acbb|b)a(?:c|d)` on the haystack `xzbacbbac` reported a match at
`2..5`, but the correct match is `1..9`.

Note that #1355 technically fixes this problem and is much simpler, but
in so doing, makes the reverse suffix and inner optimizations completely
ineffective.

Fixes #1354, Closes #1355
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reverse suffix doesn't return the leftmost match when alternates overlap

2 participants