Skip to content

Add compiling for specific architectures section to README#542

Merged
timbray merged 5 commits into
timbray:mainfrom
sayrer:cpu_target
Jun 13, 2026
Merged

Add compiling for specific architectures section to README#542
timbray merged 5 commits into
timbray:mainfrom
sayrer:cpu_target

Conversation

@sayrer

@sayrer sayrer commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

This did get a win on the matcher--Go's AMD default is really old stuff.

Added section on compiling for specific architectures in Go.

@timbray timbray left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it add value to say what you tested it on and what level you used and what the observed perf boost was, just as an example to make this concrete?

@timbray

timbray commented Jun 12, 2026

Copy link
Copy Markdown
Owner

… by which I mean "I'm curious about what you tested on and what you saw" and I suspect if I am lots of other people will be too.

@sayrer

sayrer commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author
 Final result: v1 vs v3, -count=6, benchstat

  Overall geomean: −0.75% (v3 faster). The instruction-set bump produces a small but real, 
  consistent improvement — almost everything that moved, moved down, with most changes
  in the 1–5% range.

  Notable significant wins (p < 0.05, n=6)

  ┌─────────────────────────────────────────────┬──────────┬──────────┬────────────────┐
  │                  Benchmark                  │    v1    │    v3    │       Δ        │              
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ Workload_ParallelMatchers/G=8               │ 1075 ns  │ 980 ns   │ −8.9%          │
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ Nfa2Dfa/two_stars                           │ 270.3 µs │ 253.0 µs │ −6.4%          │
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ Workload_ExactString                        │ 300 ns   │ 285 ns   │ −5.2%          │
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ Workload_LiteralInRegex                     │ 117 ns   │ 112 ns   │ −4.5%          │
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ Nfa2Dfa/five_stars                          │ 87.8 µs  │ 84.1 µs  │ −4.2%          │
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ Workload_ParallelMatchers/G=32              │ 1066 ns  │ 1020 ns  │ −4.3%          │
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ ShellstyleSimpleWildcardScaling (all sizes) │ ~600 ns  │ ~580 ns  │ −1.6% to −3.2% │
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ Workload_SingleShellstyle                   │ 663 ns   │ 640 ns   │ −3.5%          │
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ CityLots                                    │ 10.70 µs │ 10.56 µs │ −1.3%          │
  └─────────────────────────────────────────────┴──────────┴──────────┴────────────────┘

  Regressions (small, likely noise)
  
  ┌────────────────────────────────────────────┬───────┬─────────────────────────────────────────────────────────────┐
  │                 Benchmark                  │   Δ   │                            note                             │
  ├────────────────────────────────────────────┼───────┼─────────────────────────────────────────────────────────────┤
  │ Workload_ManyOverlappingWildcards/N=64     │ +5.9% │ high variance (±6%), inconsistent with N=8/32 which         │
  │                                            │       │ improved                                                    │
  ├────────────────────────────────────────────┼───────┼─────────────────────────────────────────────────────────────┤
  │ ShellstyleWidePatternsScaling/patterns=512 │ +2.0% │ ±9% at v1 — noisy                                           │
  ├────────────────────────────────────────────┼───────┼─────────────────────────────────────────────────────────────┤
  │ ShellstyleZWJEmoji/patterns=8              │ +0.8% │ within noise                                                │
  ├────────────────────────────────────────────┼───────┼─────────────────────────────────────────────────────────────┤
  │ _JsonFlattener_MiddleNestedField           │ +0.9% │ flattener is unaffected by ISA, as expected                 │
  └────────────────────────────────────────────┴───────┴─────────────────────────────────────────────────────────────┘

  The JSON-flattener benchmarks are essentially flat (±1%, mostly "~"), which makes sense — 
  that path is byte-scanning with no arithmetic the compiler would vectorize or BMI2-optimize.

  Bottom line

  v4 cannot be measured on this box — the Threadripper 2920X has no AVX‑512, and Go refuses 
  to run a v4 binary. You'd need a Zen 4+ or Intel Skylake-X/Ice-Lake+ machine to test it.

AMD Threadripper 2920X. Got the idea from Daniel Lemire, and had Claude do the benchmarking duties while I did other things.

@sayrer

sayrer commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

I had other runs that got like 9.6% on Workload_ParallelMatchers/G=8, but this is not a pristine quiet benchmark machine.

sayrer added 4 commits June 12, 2026 23:58
Added testing details for AMD Threadripper 2920X machine.
Added details about test machines for AMD64 and ARM64.
Corrected formatting and spelling for test machine descriptions.
Clarified performance testing results for AMD64 and ARM64 architectures.
@timbray timbray merged commit bcd660c into timbray:main Jun 13, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants