perf(string): drop per-call Iter alloc in to_lower#3635
Conversation
`String::to_lower` and `StringView::to_lower` both fall into a `for c in self.view(start_offset=idx)` loop after the all-lowercase fast-path guard. On the native target, that for-loop desugars into a heap-allocated `Iter` (a closure + size_hint pair) — one per call. Combined with the `StringBuilder` and result `String` they already allocate, every uppercase-containing string passed through `to_lower` costs one extra heap object. Replace the iterator loop with a plain UTF-16 code-unit loop over the view via `unsafe_get`. The conversion is unchanged (still ASCII-only per the existing `TODO`); any non-ASCII unit, including unpaired surrogate halves, is written through bit-identically because `write_char` of a code unit ≤ 0xFFFF emits exactly one UTF-16 unit. ## Numbers Measured by running mizchi/pprof-mbt's `memprofile-native` against moonbitlang/async's `http_server_benchmark` under `wrk -t 8 -c 128 -d 8s` (~108k requests served, 4 headers each → ~430k `to_lower` calls): * total allocations: 14 381 000 → 12 607 200 (−12.3 %, 1.77 M fewer) * total bytes: 473.67 MB → 459.68 MB (−3.0 %, −14 MB) * `to_lower` site: 8.44 MB / 466 100 allocs → gone from top sites * `StringView::iter2` site (mostly the to_lower loop): 8.04 MB → 4.0 % `StringBuilder::to_string` and the `StringBuilder` itself are still allocated; only the per-iter overhead goes away. Tests: `moon test --target native -p builtin` (2806/2806) and `moon test --target wasm-gc -p builtin` (2848/2848). `moon fmt --check` clean.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Optimizes to_lower for StringView and String by replacing iterator-based traversal with a manual loop intended to avoid a native-target heap allocation, while keeping the ASCII-only lowercase mapping behavior.
Changes:
- Replaced
for c in self.view(start_offset=idx)with an index-based loop usingunsafe_get. - Implemented ASCII uppercase detection via integer range checks (
0x41..0x5A) and conversion via+ 32. - Added comments explaining the performance motivation and UTF-16/code-unit assumptions.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let len = self.length() | ||
| for i = idx; i < len; i = i + 1 { | ||
| let cui = self.unsafe_get(i).to_int() | ||
| if cui >= 0x41 && cui <= 0x5A { | ||
| // 'A' is 65, 'a' is 97; difference is 32. | ||
| buf.write_char((cui + 32).unsafe_to_char()) | ||
| } else { | ||
| buf.write_char(c) | ||
| buf.write_char(cui.unsafe_to_char()) | ||
| } | ||
| } |
There was a problem hiding this comment.
The current version avoids the surrogate-half issue by not writing non-ASCII code units through write_char(unsafe_to_char(...)).
write_ascii_lowercase_code_unit is only called after is_ascii_uppercase_code_unit, so the unsafe conversion there is limited to ASCII A..Z -> a..z. All non-ASCII ranges, including valid surrogate pairs and unpaired surrogate halves, are copied with write_substring, preserving the original UTF-16 units verbatim.
I added coverage for both valid non-BMP surrogate-pair paths ("🤣A", "A🤣B", sliced StringView) and unpaired surrogate passthrough in 522ffda. Verified with:
moon test --target native -p builtin— 2806 / 2806 passmoon test --target wasm-gc -p builtin— 2848 / 2848 pass
| let cui = self.unsafe_get(i).to_int() | ||
| if cui >= 0x41 && cui <= 0x5A { | ||
| // 'A' is 65, 'a' is 97; difference is 32. | ||
| buf.write_char((cui + 32).unsafe_to_char()) | ||
| } else { | ||
| buf.write_char(c) | ||
| buf.write_char(cui.unsafe_to_char()) |
|
Codex review pass: I reviewed the current head (522ffda) for correctness and did not find a blocker. The refactor looks safe to me because it consistently uses UTF-16 code-unit offsets after the scan. That fixes the previous char-index/code-unit-index mismatch around non-BMP characters before uppercase ASCII. ASCII uppercase detection remains equivalent (0x41..0x5A), and non-ASCII spans are copied as substrings so valid surrogate pairs are preserved. I also checked malformed-surrogate edge behavior around the added tests. The high-surrogate passthrough case is covered and preserved; a leading low-surrogate plus uppercase still aborts through the same substring-boundary class of path as the old implementation, so I do not see a compatibility regression there. Validation run locally:
Note: a formal approval review could not be submitted because this account already has a pending review on the PR. |
|
note we are working on to make string unicode iteration fast (may come next week), so put on hold to make a comparison later |
Summary
String::to_lowerandStringView::to_lowerboth fall into afor c in self.view(start_offset=idx)loop after the all-lowercasefast-path guard. On the native target, that for-loop desugars into a
heap-allocated
Iter(a closure + size_hint pair) — one per call.Combined with the
StringBuilderand resultStringthey alreadyallocate, every uppercase-containing string passed through
to_lowercosts one extra heap object.
Replace the iterator loop with a plain UTF-16 code-unit loop over the
view via
unsafe_get. The conversion is unchanged (still ASCII-onlyper the existing
TODO); any non-ASCII unit — including unpairedsurrogate halves — is written through bit-identically because
write_charof a code unit ≤ 0xFFFF emits exactly one UTF-16 unit.Numbers
Measured by running mizchi/pprof-mbt's
memprofile-nativeagainstmoonbitlang/async'shttp_server_benchmarkunder
wrk -t 8 -c 128 -d 8s(~108 k requests served, ~4 headers each→ ~430 k
to_lowercalls):to_lowerattributionStringView::iter2(mostly the to_lower loop)StringBuilder::to_stringand theStringBuilderitself are stillallocated; only the per-iter overhead goes away.
Why this is safe
The original iterator decoded the StringView into
Charvalues(BMP and surrogate-pair-merged), then
write_charwould re-emitthem — surrogate pairs as two code units, BMP chars as one.
The new loop iterates UTF-16 code units directly and writes each
through
write_charof the same code unit. Since every code unit(including surrogate halves 0xD800..0xDFFF) is ≤ 0xFFFF, the
<= 0xFFFFUarm ofwrite_charruns and writes the unit verbatim.Output is byte-identical to the original on all inputs.
ASCII uppercase detection is the same
0x41..=0x5Arange, justexpressed as raw code-unit comparison instead of
Char.is_ascii_uppercase().Test plan
moon test --target native -p builtin— 2806 / 2806 passmoon test --target wasm-gc -p builtin— 2848 / 2848 passmoon fmt --checkcleanSame workflow as #3632 / #3633 / #3634 (native alloc profile a real
workload → attack the top site), this time exercising a live HTTP
server instead of a parser bench.