fix: Return the leftmost match from ReverseSuffix by JakubValtar · Pull Request #1355 · rust-lang/regex

JakubValtar · 2026-06-02T10:32:41Z

This PR delays returning the first half-start match from ReverseSuffix strategy until we confirm that there isn't another match after it that would be more leftmost, which can happen for patterns where alternates are overlapping.

I'm a first time contributor and went with a straightforward way to fix the issue, any feedback is welcome. If there's a better approach to fix it, I'm also happy to rework the PR.

Fixes Reverse suffix doesn't return the leftmost match when alternates overlap #1354

JakubValtar · 2026-06-02T11:26:30Z

Hm, this might need more work. It returns the first match correctly, but matches get skipped when searching for all matches.

regex-cli find match meta -p '.abb|b' -y 'zabbb'

  total matches:  2
0:0:4:zabb
0:4:5:b

Edit: This actually looks like the correct behavior for overlapping matches.

BurntSushi · 2026-06-03T03:44:25Z

That looks right to me for all non-overlapping successive matches.

Overlapping is a totally different ball game and the meta regex engine (which is where this bug is) doesn't provide a way to do overlapping search. You can only get overlapping when you drop down to the lower layers:

$ cargo r -qp regex-cli -- find half hybrid --no-table -p '.abb|b' -y 'zabbb' -kall --overlapping
0:3
0:4
0:5

BurntSushi · 2026-06-03T04:03:29Z

        input: &Input<'_>,
    ) -> Result<Option<HalfMatch>, RetryError> {
+        let max_len =
+            self.core.info.props_union().maximum_len().unwrap_or(usize::MAX);


This is my main hesitation with this fix. In all but very simple cases for fixed length regexes, max_len is going to be usize::MAX here. Which means any branch below for x >= max_len is basically never going to be true.

I imagine this was added for performance reasons.

Anyway, I'm going to dig into benchmarking this a bit more tomorrow. Thank you for getting the ball rolling here.

I think that's a fundamental issue with this strategy, you can't return a half-start until you know it's leftmost.

Previously, reverse suffix would often also bail early due to the quadratic behavior when the suffix could be matched elsewhere in the pattern, e.g.
regex-cli -- find match meta -p '[a-z]b' -y '000bbb', which I assume would be the case for many patterns that don't have max size.

Curious about the benchmarks.

I ran the full suite of rebar benchmarks for rust/regex before and after this PR, and there aren't any regressions there from what I can see:

[andrew@duff rebar]$ rebar diff tmp/regex-master.csv tmp/regex-fix-reverse-suffix-leftmost.csv -t 1.2 benchmark engine tmp/regex-master.csv tmp/regex-fix-reverse-suffix-leftmost.csv --------- ------ -------------------- ----------------------------------------- imported/sherlock/repeated-class-negation rust/regex 17.3 GB/s (1.39x) 24.0 GB/s (1.00x)

(Some benchmarks are noisy, so I'd ignore the improvement here.)

OK, so I dove deeper into this and I flubbed the benchmark. This change has significant performance regressions unfortunately. I'm going to need to think about this one.

The reverse inner optimization has the same kind of bug too.

Disabling these optimizations would fix the bug, but these are critical load bearing optimizations.

If you think about what this change is doing, it virtually guarantees that it will exhaustively search the entire remaining haystack for every occurrence of literal (assuming the existing quadratic trip-wire doesn't catch it first, which it won't in cases where the literal doesn't overlap with the first part of the regex). This is even worse than disabling the optimization in the first place and itself introduces more quadratic behavior when iterating over matches. So I think this particular fix is firmly bunk unfortunately.

We might need to actually compute whether there is overlap between the suffix (or inner) literal and the parts of the pattern that come before that. My quadratic trip-wire was partially meant as a hack to avoid doing that.

OK, so I dove deeper into this and I flubbed the benchmark.

Actual benchmark results:

[andrew@duff rebar]$ rebar diff tmp/regex-master.csv tmp/regex-fix-reverse-suffix-leftmost-04.csv -t 1.2 benchmark engine tmp/regex-master.csv tmp/regex-fix-reverse-suffix-leftmost-04.csv --------- ------ -------------------- -------------------------------------------- curated/06-cloud-flare-redos/simplified-long rust/regex 83.9 GB/s (1.00x) 55.1 GB/s (1.52x) hyperscan/literal-suffix-nosom rust/regex 19.2 GB/s (1.00x) 148.1 MB/s (132.48x) hyperscan/literal-suffix-som rust/regex 19.4 GB/s (1.00x) 129.8 MB/s (153.06x) hyperscan/literal-inner-nosom rust/regex 22.2 GB/s (1.00x) 311.7 MB/s (73.00x) hyperscan/literal-inner-som rust/regex 22.2 GB/s (1.00x) 292.5 MB/s (77.79x) imported/leipzig/shing rust/regex 25.7 GB/s (1.00x) 33.8 MB/s (776.71x) imported/leipzig/word-ending-nn rust/regex 31.5 GB/s (1.00x) 348.3 MB/s (92.73x) imported/leipzig/certain-long-strings-ending-x rust/regex 5.6 GB/s (1.00x) 330.9 MB/s (17.29x) imported/leipzig/tom-sawyer-huckle-fin-prefix-short rust/regex 19.3 GB/s (1.00x) 377.8 MB/s (52.24x) imported/leipzig/tom-sawyer-huckle-fin-prefix-long rust/regex 18.6 GB/s (1.00x) 410.9 MB/s (46.30x) imported/leipzig/ing rust/regex 3.1 GB/s (1.00x) 99.5 MB/s (31.78x) imported/leipzig/ing-whitespace rust/regex 3.0 GB/s (1.00x) 133.1 MB/s (22.76x) imported/leipzig/awyer-inn rust/regex 14.6 GB/s (1.00x) 153.5 MB/s (97.55x) imported/mariomka/email rust/regex 50.5 GB/s (1.00x) 1032.1 MB/s (50.11x) imported/mariomka/uri rust/regex 9.2 GB/s (1.00x) 11.1 MB/s (848.70x) imported/sherlock/before-holmes rust/regex 18.0 GB/s (1.00x) 147.8 MB/s (124.43x) imported/sherlock/before-after-holmes rust/regex 20.9 GB/s (1.00x) 297.1 MB/s (71.94x) imported/sherlock/word-ending-n rust/regex 838.4 MB/s (1.00x) 119.2 MB/s (7.03x) imported/sherlock/repeated-class-negation rust/regex 17.3 GB/s (1.00x) 368.4 MB/s (48.05x) imported/sherlock/ing-suffix rust/regex 3.5 GB/s (1.00x) 136.4 MB/s (25.91x) imported/sherlock/ing-suffix-limited-space rust/regex 3.5 GB/s (1.00x) 166.9 MB/s (21.76x) opt/reverse-inner/holmes rust/regex 19.1 GB/s (1.00x) 297.1 MB/s (65.75x) opt/reverse-inner/email rust/regex 82.2 GB/s (1.00x) 29.4 GB/s (2.80x) opt/reverse-inner/factored-prefix rust/regex 9.2 GB/s (1.00x) 40.0 MB/s (235.27x) opt/reverse-suffix/holmes rust/regex 19.3 GB/s (1.00x) 148.5 MB/s (133.19x) reported/i13-subset-regex/original-ascii rust/regex 5.6 GB/s (1.00x) 330.5 MB/s (17.37x) reported/i13-subset-regex/original-unicode rust/regex 345.6 MB/s (1.00x) 17.9 MB/s (19.35x) reported/i13-subset-regex/big-ascii rust/regex 92.1 MB/s (1.00x) 16.4 MB/s (5.62x) reported/i13-subset-regex/big-unicode rust/regex 36.1 MB/s (1.00x) 15.6 MB/s (2.31x)

Thanks for the updated benchmark!

This optimization looks for ends of the matches, but it needs to know the start to return the leftmost match, which sounds like quite a complex problem and might indeed need a change in approach or larger design changes rather than a "fix" like this PR.

I don't think I'm able to contribute further to resolve this issue (apart from high-level ideas) as I don't have enough experience with this crate. I appreciate you looking into it and I'm still curious about the eventual approach!

Oh it's absolutely brutal. See: #1364

I will take it from here. Thank you for the report and attempt. This turned out to be quite the bug!

This changes the code organization a bit in a way that makes a subsequent bug fix change a little easier to reason about.

A minimal reproducer of this bug is on a haystack of `zabb` with the regex `.bb|b`. The `regex` crate will report a match at `2..3`, but the correct match is `1..4`. While this seems like a simple regex, there are a pretty specific set of circumstances required to trigger the bug: 1. There are no prefix literals that activate a standard prefix literal scan. 2. There needs to be an extractable *suffix* or *inner* literal. 3. An actual match needs to be present in the haystack. 4. The regex and haystack Crucially, note that because of (3), this bug will never lead to `Regex::is_match` providing a false positive *or* a false negative. This bug is strictly about leftmost-first match semantics being incorrect in some cases and will report an incorrect match span. (4) could do with a bit more explanation, since it's rather subtle. Let's trace the minimal example through the regex crate's "reverse suffix" optimization. During compilation, there is no prefix literal that can be extracted. The `.` defeats that class of optimization. Moreover, there is a suffix literal in the regex. That is, all matches for `.bb|b` must end with `b`. The regex crate sees this and will scan for matches of `b`. It will then attempt to match the regex in reverse at each candidate match of `b`. Let's see what happens: * Find first occurrence of `b` at offset `2` in `zabb`. * Start reverse confirmation step at offset `2`. * The second alternation branch, `b` in `.bb|b`, matches at `2..3`. * The second alternation branch is reported as the overall match. This happens because the first alternation branch, `.bb`, does _not_ have a match ending at offset `3`. The fundamental problem here is that there is an overlap between the reverse automaton for confirming the match and the literal scan. Small changes, even to the haystack, can result in the bug disappearing. For example, with a haystack of `zbb`, the correct match of `0..3` is reported. This occurs because there is a quadratic "trip wire" that triggers in this case that causes the search to bail out and fall back to a DFA without using any literal optimizations. This bug also applies to the "reverse inner" optimization. This can happen when the literal is extracted from inside the regex as opposed to it being a suffix literal. For example, the regex `(?:..acbb|b)a(?:c|d)` on the haystack `xzbacbbac` reported a match at `2..5`, but the correct match is `1..9`. Note that #1355 technically fixes this problem and is much simpler, but in so doing, makes the reverse suffix and inner optimizations completely ineffective. Fixes #1354, Closes #1355

BurntSushi mentioned this pull request Jun 2, 2026

Reverse suffix doesn't return the leftmost match when alternates overlap #1354

Open

orlp mentioned this pull request Jun 2, 2026

regex and regex_lite disagree on lazy empty matches #1357

Open

BurntSushi reviewed Jun 3, 2026

View reviewed changes

automata: slight refactor

436e80a

This changes the code organization a bit in a way that makes a subsequent bug fix change a little easier to reason about.

BurntSushi force-pushed the fix-reverse-suffix-leftmost branch from b2cf1a6 to cf290f1 Compare June 4, 2026 03:43

BurntSushi added 2 commits June 4, 2026 00:06

automata: fix a bug in the reverse suffix/inner search

a7bfc8e

automata: another reverse suffix fix

c8f0287

BurntSushi force-pushed the fix-reverse-suffix-leftmost branch from cf290f1 to c8f0287 Compare June 4, 2026 04:06

BurntSushi mentioned this pull request Jun 7, 2026

automata: fix bug in reverse suffix/inner optimization #1364

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Return the leftmost match from ReverseSuffix#1355

fix: Return the leftmost match from ReverseSuffix#1355
JakubValtar wants to merge 3 commits into
rust-lang:masterfrom
JakubValtar:fix-reverse-suffix-leftmost

JakubValtar commented Jun 2, 2026 •

edited

Loading

Uh oh!

JakubValtar commented Jun 2, 2026 •

edited

Loading

Uh oh!

BurntSushi commented Jun 3, 2026

Uh oh!

BurntSushi Jun 3, 2026

Uh oh!

JakubValtar Jun 3, 2026

Uh oh!

BurntSushi Jun 4, 2026

Uh oh!

BurntSushi Jun 5, 2026 •

edited

Loading

Uh oh!

BurntSushi Jun 5, 2026 •

edited

Loading

Uh oh!

BurntSushi Jun 5, 2026

Uh oh!

BurntSushi Jun 5, 2026

Uh oh!

JakubValtar Jun 8, 2026 •

edited

Loading

Uh oh!

BurntSushi Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JakubValtar commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JakubValtar commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BurntSushi commented Jun 3, 2026

Uh oh!

BurntSushi Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

JakubValtar Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

BurntSushi Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

BurntSushi Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BurntSushi Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BurntSushi Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

BurntSushi Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

JakubValtar Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BurntSushi Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JakubValtar commented Jun 2, 2026 •

edited

Loading

JakubValtar commented Jun 2, 2026 •

edited

Loading

BurntSushi Jun 5, 2026 •

edited

Loading

BurntSushi Jun 5, 2026 •

edited

Loading

JakubValtar Jun 8, 2026 •

edited

Loading