Skip to content

feat(common): Improve _xorBuffer() performance#66

Open
dwatteau wants to merge 2 commits into
mainfrom
feat/common-xorbuffer-potential-performance-improvement
Open

feat(common): Improve _xorBuffer() performance#66
dwatteau wants to merge 2 commits into
mainfrom
feat/common-xorbuffer-potential-performance-improvement

Conversation

@dwatteau

Copy link
Copy Markdown
Owner

Thomas' original code had some logic to xor the values with 4 bytes at once, to improve performance.

However, this caused -Wcast-align issues on some platforms (see issue #10 for more details). This code was then removed in commit ecff30b.

This new approach restores a bit of this logic, but it's now done in a way to avoid unaligned loads. It seems to work on that old mips64el board I have.

TODO:

  • do more tests
  • see if it does make a difference, regarding performance (is there a real bottleneck there?)
  • don't forget to test with different compilers: GCC, Clang, MSVC…
  • check everything's still OK on a big-endian environment
  • see if it improves issue [Performance] GCC builds are ~30% faster than Clang/MSVC builds #4, in some way

@dwatteau dwatteau force-pushed the feat/common-xorbuffer-potential-performance-improvement branch 2 times, most recently from 1efd148 to 6c5cba5 Compare April 18, 2025 21:34
@dwatteau

Copy link
Copy Markdown
Owner Author

It looks like most of the _xorBuffer() calls are made on very small buffers (few bytes), where this optimization pipeline won't bring much (it could even be a bit counter-productive, so I've added a if (X > xxx) check).

==> to be checked a bit more, though.

@dwatteau

dwatteau commented Apr 19, 2025

Copy link
Copy Markdown
Owner Author

I'd need to add a logger to register how often the if (n >= 16) case is triggered, in some typical imports/exports (on various games).

If it's very rare, then this optimization is probably just not worth it, and could be discarded (still, some unrelated parts of this PR would be worth a cherry-pick).

==> also see if 12 or 16 is the best value?

@dwatteau

Copy link
Copy Markdown
Owner Author

Well, it's just a single example, but with scummtr -g monkeysega -if original-text.txt I have:

  • 2802 hits for the (n >= 16) case
  • 617778 hits for the (n < 16) case

==> so, at least for a V5 game like Monkey1 Sega CD, it looks like _xorBuffer is called many times, mainly on very small buffers. So there, the optimization is called very rarely, and so it won't much effect, in the end.

And if I move the threshold to (n >= 12) (probably the lowest value where setting up the "optimization pipeline" could be worth it, maybe), I have 3476 hits for 614308 misses.

@dwatteau dwatteau force-pushed the feat/common-xorbuffer-potential-performance-improvement branch 2 times, most recently from dd9a7f5 to 4ef0862 Compare May 13, 2025 21:54
dwatteau added 2 commits July 8, 2025 10:03
Thomas' original code had some logic to xor the values with 4 bytes
at once, to improve performance.

However, this caused -Wcast-align issues on some platforms (see issue
ecff30b.

This new approach restores a bit of this logic, but it's now done in
a way to avoid unaligned loads. It seems to work on that old mips64el
board I have.

Since it appears that _xorBuffer() is called on a majority of very
small buffers, it doesn't make sense to set up this "pipeline" if
there are just a few bytes to copy, though.
@dwatteau dwatteau force-pushed the feat/common-xorbuffer-potential-performance-improvement branch from 0966627 to 273c51d Compare July 8, 2025 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant