Skip to content

perf(codegen): Use char codes for simple classes#660

Open
scttcper wants to merge 1 commit into
peggyjs:mainfrom
scttcper:scttcper/charcode-classes
Open

perf(codegen): Use char codes for simple classes#660
scttcper wants to merge 1 commit into
peggyjs:mainfrom
scttcper:scttcper/charcode-classes

Conversation

@scttcper
Copy link
Copy Markdown

@scttcper scttcper commented May 14, 2026

This optimizes generated parser code for simple character classes.

A lot of grammars have hot loops that look like this:

identifier = [a-zA-Z_$] [a-zA-Z0-9_$]*
number = [0-9]+ ("." [0-9]+)?
whitespace = [ \t\n\r]*

Peggy currently emits a regexp test for each character consumed by those classes. This PR emits direct charCodeAt comparisons when the class is simple enough to do that exactly: no i flag, no unicode mode, and only single-code-unit ranges/characters.

Anything more complicated still uses the existing regexp path.

As one real-world benchmark, I used Sentry’s search grammar, which is the grammar we use for parsing search bar queries: https://github.com/getsentry/sentry/blob/master/static/app/components/searchSyntax/grammar.pegjs. That grammar has a lot of key/value filters, so longer searches spend plenty of time in these character-class loops.

Input Before After Speedup
200 short search strings, cycling through common filter/query shapes 3.1664 ms 2.5784 ms 18.6%
one 2.7 KB query with 120 mixed free-text, filter, and grouped terms 0.7804 ms 0.6653 ms 14.7%
two copies of the long query joined with AND, about 5.4 KB 1.6284 ms 1.4060 ms 13.7%
the long query plus a parenthesized copy joined with OR, about 5.4 KB 1.6405 ms 1.3890 ms 15.3%

For that generated parser, repeated class-check code went from 177 charCodeAt calls to 76 because each check computes the character code once and reuses it.

Simple character classes were emitted as regexp tests for every accepted character. Hot parser loops like keys and identifiers pay for that over and over.

Emit direct charCodeAt comparisons for simple non-unicode, case-sensitive classes and reuse a temp char code inside each check. Keeps the regexp path for the complex cases.

Co-Authored-By: Codex GPT-5 <noreply@openai.com>
@scttcper scttcper marked this pull request as ready for review May 14, 2026 15:08
@scttcper
Copy link
Copy Markdown
Author

let me know if you need the benchmark or something else

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant