Cache compiled regex in unicode_regex_split_stl (ByteLevel hotspot) (#197)#197
Merged
meta-codesync[bot] merged 1 commit intoJun 15, 2026
Merged
Conversation
Contributor
|
@joshuuuasu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108634865. |
…eta-pytorch#197) Summary: The ByteLevel pre-tokenizer's STL regex path in third-party/llama.cpp-unicode/src/unicode.cpp recompiled its split regex on every call to unicode_regex_split_stl (~497 compiles per request over a SID prompt), dominating tokenize latency. A single std::regex/std::wregex compile is expensive and the set of patterns is small and fixed, so we cache the compiled regex per pattern. This diff adds a function-local static unordered_map<pattern, shared_ptr<const regex>> guarded by a std::mutex in BOTH unicode_regex_split_stl overloads (std::wregex and std::regex). The compiled regex is returned as a shared_ptr<const regex> and matched concurrently across the multi-threaded tokenizer pool; matching on a const std::regex from multiple threads is thread-safe. Behavior is identical by construction (same pattern + flags -> same compiled regex -> same matches). Adds <memory> and <mutex>. Measured win (model 2119730608, constrained decoding on): | Metric | Before | After | |--------------------------------|----------|----------| | Tokenizer.encode (bench) | 144 ms | 1.37 ms | | Server tokenize | ~97 ms | ~1.7 ms | | gr_loadgen greedy client p50 | 166.7 ms | 68.9 ms | | gr_loadgen beam=10 client p50 | 199.2 ms | 91.3 ms | | gr_loadgen client p99 | 272 ms | 72 ms | This is upstreamable to llama.cpp (MIT) and we intend to send it there. Differential Revision: D108634865
08d703c to
e7dfa41
Compare
larryliu0820
approved these changes
Jun 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
The ByteLevel pre-tokenizer's STL regex path in
third-party/llama.cpp-unicode/src/unicode.cpp recompiled its split regex on
every call to unicode_regex_split_stl (~497 compiles per request over a SID
prompt), dominating tokenize latency. A single std::regex/std::wregex compile
is expensive and the set of patterns is small and fixed, so we cache the
compiled regex per pattern.
This diff adds a function-local static
unordered_map<pattern, shared_ptr> guarded by a std::mutex in BOTH
unicode_regex_split_stl overloads (std::wregex and std::regex). The compiled
regex is returned as a shared_ptr and matched concurrently across
the multi-threaded tokenizer pool; matching on a const std::regex from multiple
threads is thread-safe. Behavior is identical by construction (same pattern +
flags -> same compiled regex -> same matches). Adds and .
Measured win (model 2119730608, constrained decoding on):
This is upstreamable to llama.cpp (MIT) and we intend to send it there.
Differential Revision: D108634865