Skip to content

perf(string): drop per-call Iter alloc in to_lower#3635

Open
mizchi wants to merge 4 commits into
moonbitlang:mainfrom
mizchi:string/to-lower-no-iter-alloc
Open

perf(string): drop per-call Iter alloc in to_lower#3635
mizchi wants to merge 4 commits into
moonbitlang:mainfrom
mizchi:string/to-lower-no-iter-alloc

Conversation

@mizchi

@mizchi mizchi commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary

String::to_lower and StringView::to_lower both fall into a
for c in self.view(start_offset=idx) loop after the all-lowercase
fast-path guard. On the native target, that for-loop desugars into a
heap-allocated Iter (a closure + size_hint pair) — one per call.
Combined with the StringBuilder and result String they already
allocate, every uppercase-containing string passed through to_lower
costs one extra heap object.

Replace the iterator loop with a plain UTF-16 code-unit loop over the
view via unsafe_get. The conversion is unchanged (still ASCII-only
per the existing TODO); any non-ASCII unit — including unpaired
surrogate halves — is written through bit-identically because
write_char of a code unit ≤ 0xFFFF emits exactly one UTF-16 unit.

Numbers

Measured by running mizchi/pprof-mbt's memprofile-native against
moonbitlang/async's http_server_benchmark
under wrk -t 8 -c 128 -d 8s (~108 k requests served, ~4 headers each
→ ~430 k to_lower calls):

metric before after Δ
total allocations 14 381 000 12 607 200 −12.3 % (−1.77 M)
total bytes 473.67 MB 459.68 MB −3.0 % (−14 MB)
to_lower attribution 8.44 MB / 466 100 allocs gone from top sites
StringView::iter2 (mostly the to_lower loop) 8.04 MB / 469 300 4.0 % / 434 500 partial

StringBuilder::to_string and the StringBuilder itself are still
allocated; only the per-iter overhead goes away.

Why this is safe

The original iterator decoded the StringView into Char values
(BMP and surrogate-pair-merged), then write_char would re-emit
them — surrogate pairs as two code units, BMP chars as one.

The new loop iterates UTF-16 code units directly and writes each
through write_char of the same code unit. Since every code unit
(including surrogate halves 0xD800..0xDFFF) is ≤ 0xFFFF, the
<= 0xFFFFU arm of write_char runs and writes the unit verbatim.
Output is byte-identical to the original on all inputs.

ASCII uppercase detection is the same 0x41..=0x5A range, just
expressed as raw code-unit comparison instead of Char.is_ascii_uppercase().

Test plan

  • moon test --target native -p builtin — 2806 / 2806 pass
  • moon test --target wasm-gc -p builtin — 2848 / 2848 pass
  • moon fmt --check clean

Same workflow as #3632 / #3633 / #3634 (native alloc profile a real
workload → attack the top site), this time exercising a live HTTP
server instead of a parser bench.

`String::to_lower` and `StringView::to_lower` both fall into a
`for c in self.view(start_offset=idx)` loop after the all-lowercase
fast-path guard. On the native target, that for-loop desugars into a
heap-allocated `Iter` (a closure + size_hint pair) — one per call.
Combined with the `StringBuilder` and result `String` they already
allocate, every uppercase-containing string passed through `to_lower`
costs one extra heap object.

Replace the iterator loop with a plain UTF-16 code-unit loop over the
view via `unsafe_get`. The conversion is unchanged (still ASCII-only
per the existing `TODO`); any non-ASCII unit, including unpaired
surrogate halves, is written through bit-identically because
`write_char` of a code unit ≤ 0xFFFF emits exactly one UTF-16 unit.

## Numbers

Measured by running mizchi/pprof-mbt's `memprofile-native` against
moonbitlang/async's `http_server_benchmark` under `wrk -t 8 -c 128 -d 8s`
(~108k requests served, 4 headers each → ~430k `to_lower` calls):

* total allocations:   14 381 000 → 12 607 200  (−12.3 %, 1.77 M fewer)
* total bytes:         473.67 MB  → 459.68 MB   (−3.0 %, −14 MB)
* `to_lower` site:     8.44 MB / 466 100 allocs → gone from top sites
* `StringView::iter2` site (mostly the to_lower loop): 8.04 MB → 4.0 %

`StringBuilder::to_string` and the `StringBuilder` itself are still
allocated; only the per-iter overhead goes away.

Tests: `moon test --target native -p builtin` (2806/2806) and
`moon test --target wasm-gc -p builtin` (2848/2848). `moon fmt --check`
clean.
Copilot AI review requested due to automatic review settings May 26, 2026 13:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Optimizes to_lower for StringView and String by replacing iterator-based traversal with a manual loop intended to avoid a native-target heap allocation, while keeping the ASCII-only lowercase mapping behavior.

Changes:

  • Replaced for c in self.view(start_offset=idx) with an index-based loop using unsafe_get.
  • Implemented ASCII uppercase detection via integer range checks (0x41..0x5A) and conversion via + 32.
  • Added comments explaining the performance motivation and UTF-16/code-unit assumptions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1680 to 1689
let len = self.length()
for i = idx; i < len; i = i + 1 {
let cui = self.unsafe_get(i).to_int()
if cui >= 0x41 && cui <= 0x5A {
// 'A' is 65, 'a' is 97; difference is 32.
buf.write_char((cui + 32).unsafe_to_char())
} else {
buf.write_char(c)
buf.write_char(cui.unsafe_to_char())
}
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current version avoids the surrogate-half issue by not writing non-ASCII code units through write_char(unsafe_to_char(...)).

write_ascii_lowercase_code_unit is only called after is_ascii_uppercase_code_unit, so the unsafe conversion there is limited to ASCII A..Z -> a..z. All non-ASCII ranges, including valid surrogate pairs and unpaired surrogate halves, are copied with write_substring, preserving the original UTF-16 units verbatim.

I added coverage for both valid non-BMP surrogate-pair paths ("🤣A", "A🤣B", sliced StringView) and unpaired surrogate passthrough in 522ffda. Verified with:

  • moon test --target native -p builtin — 2806 / 2806 pass
  • moon test --target wasm-gc -p builtin — 2848 / 2848 pass

Comment thread builtin/string_methods.mbt Outdated
Comment on lines +1682 to +1687
let cui = self.unsafe_get(i).to_int()
if cui >= 0x41 && cui <= 0x5A {
// 'A' is 65, 'a' is 97; difference is 32.
buf.write_char((cui + 32).unsafe_to_char())
} else {
buf.write_char(c)
buf.write_char(cui.unsafe_to_char())
@bobzhang

Copy link
Copy Markdown
Contributor

Codex review pass: I reviewed the current head (522ffda) for correctness and did not find a blocker.

The refactor looks safe to me because it consistently uses UTF-16 code-unit offsets after the scan. That fixes the previous char-index/code-unit-index mismatch around non-BMP characters before uppercase ASCII. ASCII uppercase detection remains equivalent (0x41..0x5A), and non-ASCII spans are copied as substrings so valid surrogate pairs are preserved.

I also checked malformed-surrogate edge behavior around the added tests. The high-surrogate passthrough case is covered and preserved; a leading low-surrogate plus uppercase still aborts through the same substring-boundary class of path as the old implementation, so I do not see a compatibility regression there.

Validation run locally:

  • moon test --target all -p builtin
  • moon check --target all -p builtin
  • moon fmt --check
  • moon info
  • git diff --check

Note: a formal approval review could not be submitted because this account already has a pending review on the PR.

@bobzhang

Copy link
Copy Markdown
Contributor

note we are working on to make string unicode iteration fast (may come next week), so put on hold to make a comparison later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants