Skip to content

use js-regexp for wasm targets#64

Merged
waltzofpearls merged 8 commits into
waltzofpearls:mainfrom
Jazzpirate:main
Nov 6, 2025
Merged

use js-regexp for wasm targets#64
waltzofpearls merged 8 commits into
waltzofpearls:mainfrom
Jazzpirate:main

Conversation

@Jazzpirate

@Jazzpirate Jazzpirate commented Oct 24, 2025

Copy link
Copy Markdown
Contributor

the regex crate is a relatively complex state machine that includes the entire unicode table for parsing. This of course makes perfect sense in general, but it also means that it adds ~0.5MB of binary size. When compiling to wasm and targeting browser, this is nonsensical, since javascript already has a regex implementation that could be used instead, so avoiding regex in wasm is strongly recommended.

This PR adds a feature "wasm" that uses js-regexp instead of the regex crate. Since the latter has a very different interface, I had to do some minor refactoring, but I was careful to change as little as possible, offloading things to a common trait implemented by both regex::Regex and (a custom wrapper around) js_regexp::RegExp

@waltzofpearls waltzofpearls left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Jazzpirate, thanks for adding js-regexp for wasm!

I left some comments in the diffs. The implementation looks good, but the CI ran into issues with tests. When running cargo test --all-features on non-wasm targets, the tests fail because js-regexp requires a wasm runtime.

Other than that, I have some feedback/questions:

  • I assume you've tested the changes with wasm. Have you observed any performance regression comparing to the non-wasm version? I'm curious because in your wasm specific implementation of the new/constructor, it doesn't create/compile the regex, as a result, every single call to the date and time format (and the regex is_match method) will trigger a recompile of the regex. That could significantly slows down the date and time parsing, if the code iterates on a large set of data.
  • The wasm target doesn't have any unit test coverage. It would be nice to add some for wasm with the support of wasm-bindgen-test. That said, if you feel this is out of scope. I can add some tests later.

Comment thread dateparser/src/datetime.rs Outdated
) -> Option<R>;
}

#[cfg(not(feature = "wasm"))]

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please change this to

#[cfg(not(all(feature = "wasm", target_arch = "wasm32")))]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread dateparser/src/datetime.rs Outdated

#[cfg(not(feature = "wasm"))]
use regex::Regex;
#[cfg(not(feature = "wasm"))]

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please change this to

#[cfg(not(all(feature = "wasm", target_arch = "wasm32")))]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread dateparser/src/datetime.rs Outdated
}
}

#[cfg(feature = "wasm")]

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please change this to

#[cfg(all(feature = "wasm", target_arch = "wasm32"))]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread dateparser/src/datetime.rs Outdated
}
}
}
#[cfg(feature = "wasm")]

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please change this to

#[cfg(all(feature = "wasm", target_arch = "wasm32"))]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread dateparser/Cargo.toml Outdated
chrono = "0.4.31"
lazy_static = "1.4.0"
regex = "1.10.2"
js-regexp = {version="0.2",optional=true}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please remove it from [dependencies], and add another section just for wasm:

[target.'cfg(target_arch = "wasm32")'.dependencies]
js-regexp = { version = "0.2", optional = true }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Jazzpirate

Copy link
Copy Markdown
Contributor Author
* I assume you've tested the changes with wasm. Have you observed any performance regression comparing to the non-wasm version?

I have not compared them directly, but I am actively using this branch for my own purposes now, and yes, I would suspect there is a performance regression involved; there is naturally a tradeoff here between binary size and performance. Using js-regexp trades performance for size.

I'm curious because in your wasm specific implementation of the new/constructor, it doesn't create/compile the regex, as a result, every single call to the date and time format (and the regex is_match method) will trigger a recompile of the regex. That could significantly slows down the date and time parsing, if the code iterates on a large set of data.

Yes, the API for js-regexp differs significantly from regex, because it constructs a javascript regex - meaning, a js_regexp::RegExp<'p> is a thin wrapper around a js_sys::Object, which holds a *mut u8-pointer into javascript memory, which is, among other things, not Send, which means we can't have a lazy_static! { static ref ...}. For some reason I admittedly don't entirely understand, the RegExp::exec method (which actually applies the regex to a haystack) also takes a &mut self.

There are ways to optimize around that; e.g. I could happily experiment with thread_local to work around the !Send restriction, but that would entail more invasive changes in all the *_family methods. My thinking was to change as little as possible in the first PR. If you don't mind, I could make some more invasive changes; possibly replace all lazy_static!s by a custom macro that branches depending on target_arch (and maybe while I'm at it, get rid of lazy_static altogether, which is now deprecated in favor of std::sync::LazyLock, or possibly a thread_local{std::cell::LazyCell}?)

* The wasm target doesn't have any unit test coverage. It would be nice to add some for wasm with the support of ` wasm-bindgen-test`. That said, if you feel this is out of scope. I can add some tests later.

Sounds reasonable. I haven't worked with wasm-bindgen-test yet, but sounds like a useful thing to concern myself with anyway.

@Jazzpirate

Copy link
Copy Markdown
Contributor Author

update: I started toying around with thread_local and benchmarks show significant performance improvements in non-wasm targets by over lazy_static

…ic (seems to also improve performance), 3. wasm tests
@Jazzpirate

Copy link
Copy Markdown
Contributor Author

ok, added another commit that refactors things to avoid constructing the same regex in wasm every time; also got rid of lazy_static in favor of thread_local for more uniformity (seems to show performance improvements as well) and added wasm tests, including to the CI.

@Jazzpirate

Jazzpirate commented Oct 28, 2025

Copy link
Copy Markdown
Contributor Author

update: made a mistake in the wasm test and didn't even use the feature -.- getting errors now. Will fix

@Jazzpirate

Copy link
Copy Markdown
Contributor Author

ok, several rabbit holes later, it turned out there is a bug in js_regexp that illegaly tries to use js_sys::Reflect to get at a field of something that isn't an object, which panics. I don't see how to work around that.

So instead, I got rid of js_regexp and now use js_sys directly.

I also thread_local static all field names as JsStrings directly (including "tz" as a group name, since that's the only one that occurs anyway), which avoids allocating a new string and translating it across the js-wasm boundary to create a new javascript string for every field access, which should also improve performance in wasm even further.

@waltzofpearls waltzofpearls left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Thanks for the contribution. Great additions and improvements!

The clippy errors are not caused by your changes. I will fix them separately in this repo.

@waltzofpearls waltzofpearls merged commit e7e4ebb into waltzofpearls:main Nov 6, 2025
15 of 16 checks passed
@Jazzpirate

Copy link
Copy Markdown
Contributor Author

awesome, thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants