Skip to content

Cache compiled regex in unicode_regex_split_stl (ByteLevel hotspot) (#197)#197

Merged
meta-codesync[bot] merged 1 commit into
meta-pytorch:mainfrom
joshuuuasu:export-D108634865
Jun 15, 2026
Merged

Cache compiled regex in unicode_regex_split_stl (ByteLevel hotspot) (#197)#197
meta-codesync[bot] merged 1 commit into
meta-pytorch:mainfrom
joshuuuasu:export-D108634865

Conversation

@joshuuuasu

@joshuuuasu joshuuuasu commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary:

The ByteLevel pre-tokenizer's STL regex path in
third-party/llama.cpp-unicode/src/unicode.cpp recompiled its split regex on
every call to unicode_regex_split_stl (~497 compiles per request over a SID
prompt), dominating tokenize latency. A single std::regex/std::wregex compile
is expensive and the set of patterns is small and fixed, so we cache the
compiled regex per pattern.

This diff adds a function-local static
unordered_map<pattern, shared_ptr> guarded by a std::mutex in BOTH
unicode_regex_split_stl overloads (std::wregex and std::regex). The compiled
regex is returned as a shared_ptr and matched concurrently across
the multi-threaded tokenizer pool; matching on a const std::regex from multiple
threads is thread-safe. Behavior is identical by construction (same pattern +
flags -> same compiled regex -> same matches). Adds and .

Measured win (model 2119730608, constrained decoding on):

Metric Before After
Tokenizer.encode (bench) 144 ms 1.37 ms
Server tokenize ~97 ms ~1.7 ms
gr_loadgen greedy client p50 166.7 ms 68.9 ms
gr_loadgen beam=10 client p50 199.2 ms 91.3 ms
gr_loadgen client p99 272 ms 72 ms

This is upstreamable to llama.cpp (MIT) and we intend to send it there.

Differential Revision: D108634865

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 15, 2026
@meta-codesync

meta-codesync Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

@joshuuuasu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108634865.

…eta-pytorch#197)

Summary:

The ByteLevel pre-tokenizer's STL regex path in
third-party/llama.cpp-unicode/src/unicode.cpp recompiled its split regex on
every call to unicode_regex_split_stl (~497 compiles per request over a SID
prompt), dominating tokenize latency. A single std::regex/std::wregex compile
is expensive and the set of patterns is small and fixed, so we cache the
compiled regex per pattern.

This diff adds a function-local static
unordered_map<pattern, shared_ptr<const regex>> guarded by a std::mutex in BOTH
unicode_regex_split_stl overloads (std::wregex and std::regex). The compiled
regex is returned as a shared_ptr<const regex> and matched concurrently across
the multi-threaded tokenizer pool; matching on a const std::regex from multiple
threads is thread-safe. Behavior is identical by construction (same pattern +
flags -> same compiled regex -> same matches). Adds <memory> and <mutex>.

Measured win (model 2119730608, constrained decoding on):

  | Metric                         | Before   | After    |
  |--------------------------------|----------|----------|
  | Tokenizer.encode (bench)       | 144 ms   | 1.37 ms  |
  | Server tokenize                | ~97 ms   | ~1.7 ms  |
  | gr_loadgen greedy client p50   | 166.7 ms | 68.9 ms  |
  | gr_loadgen beam=10 client p50  | 199.2 ms | 91.3 ms  |
  | gr_loadgen client p99          | 272 ms   | 72 ms    |

This is upstreamable to llama.cpp (MIT) and we intend to send it there.

Differential Revision: D108634865
@meta-codesync meta-codesync Bot changed the title Cache compiled regex in unicode_regex_split_stl (ByteLevel hotspot) Cache compiled regex in unicode_regex_split_stl (ByteLevel hotspot) (#197) Jun 15, 2026
@meta-codesync meta-codesync Bot merged commit dd727c3 into meta-pytorch:main Jun 15, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants