Add compiling for specific architectures section to README by sayrer · Pull Request #542 · timbray/quamina

sayrer · 2026-06-12T07:14:12Z

This did get a win on the matcher--Go's AMD default is really old stuff.

Added section on compiling for specific architectures in Go.

timbray

Might it add value to say what you tested it on and what level you used and what the observed perf boost was, just as an example to make this concrete?

timbray · 2026-06-12T18:59:37Z

… by which I mean "I'm curious about what you tested on and what you saw" and I suspect if I am lots of other people will be too.

sayrer · 2026-06-13T06:50:22Z

 Final result: v1 vs v3, -count=6, benchstat

  Overall geomean: −0.75% (v3 faster). The instruction-set bump produces a small but real, 
  consistent improvement — almost everything that moved, moved down, with most changes
  in the 1–5% range.

  Notable significant wins (p < 0.05, n=6)

  ┌─────────────────────────────────────────────┬──────────┬──────────┬────────────────┐
  │                  Benchmark                  │    v1    │    v3    │       Δ        │              
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ Workload_ParallelMatchers/G=8               │ 1075 ns  │ 980 ns   │ −8.9%          │
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ Nfa2Dfa/two_stars                           │ 270.3 µs │ 253.0 µs │ −6.4%          │
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ Workload_ExactString                        │ 300 ns   │ 285 ns   │ −5.2%          │
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ Workload_LiteralInRegex                     │ 117 ns   │ 112 ns   │ −4.5%          │
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ Nfa2Dfa/five_stars                          │ 87.8 µs  │ 84.1 µs  │ −4.2%          │
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ Workload_ParallelMatchers/G=32              │ 1066 ns  │ 1020 ns  │ −4.3%          │
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ ShellstyleSimpleWildcardScaling (all sizes) │ ~600 ns  │ ~580 ns  │ −1.6% to −3.2% │
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ Workload_SingleShellstyle                   │ 663 ns   │ 640 ns   │ −3.5%          │
  ├─────────────────────────────────────────────┼──────────┼──────────┼────────────────┤
  │ CityLots                                    │ 10.70 µs │ 10.56 µs │ −1.3%          │
  └─────────────────────────────────────────────┴──────────┴──────────┴────────────────┘

  Regressions (small, likely noise)
  
  ┌────────────────────────────────────────────┬───────┬─────────────────────────────────────────────────────────────┐
  │                 Benchmark                  │   Δ   │                            note                             │
  ├────────────────────────────────────────────┼───────┼─────────────────────────────────────────────────────────────┤
  │ Workload_ManyOverlappingWildcards/N=64     │ +5.9% │ high variance (±6%), inconsistent with N=8/32 which         │
  │                                            │       │ improved                                                    │
  ├────────────────────────────────────────────┼───────┼─────────────────────────────────────────────────────────────┤
  │ ShellstyleWidePatternsScaling/patterns=512 │ +2.0% │ ±9% at v1 — noisy                                           │
  ├────────────────────────────────────────────┼───────┼─────────────────────────────────────────────────────────────┤
  │ ShellstyleZWJEmoji/patterns=8              │ +0.8% │ within noise                                                │
  ├────────────────────────────────────────────┼───────┼─────────────────────────────────────────────────────────────┤
  │ _JsonFlattener_MiddleNestedField           │ +0.9% │ flattener is unaffected by ISA, as expected                 │
  └────────────────────────────────────────────┴───────┴─────────────────────────────────────────────────────────────┘

  The JSON-flattener benchmarks are essentially flat (±1%, mostly "~"), which makes sense — 
  that path is byte-scanning with no arithmetic the compiler would vectorize or BMI2-optimize.

  Bottom line

  v4 cannot be measured on this box — the Threadripper 2920X has no AVX‑512, and Go refuses 
  to run a v4 binary. You'd need a Zen 4+ or Intel Skylake-X/Ice-Lake+ machine to test it.

AMD Threadripper 2920X. Got the idea from Daniel Lemire, and had Claude do the benchmarking duties while I did other things.

sayrer · 2026-06-13T06:53:49Z

I had other runs that got like 9.6% on Workload_ParallelMatchers/G=8, but this is not a pristine quiet benchmark machine.

Added testing details for AMD Threadripper 2920X machine.

Added details about test machines for AMD64 and ARM64.

Corrected formatting and spelling for test machine descriptions.

Clarified performance testing results for AMD64 and ARM64 architectures.

Add compiling for specific architectures section

8b50c0a

Added section on compiling for specific architectures in Go.

timbray reviewed Jun 12, 2026

View reviewed changes

sayrer added 4 commits June 12, 2026 23:58

Update README with AMD Threadripper testing info

0fb502d

Added testing details for AMD Threadripper 2920X machine.

Update README with test machine specifications

e2ab69b

Added details about test machines for AMD64 and ARM64.

Fix formatting and spelling in README.md

091ec0b

Corrected formatting and spelling for test machine descriptions.

Update README with performance testing details

11ff439

Clarified performance testing results for AMD64 and ARM64 architectures.

timbray approved these changes Jun 13, 2026

View reviewed changes

timbray merged commit bcd660c into timbray:main Jun 13, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add compiling for specific architectures section to README#542

Add compiling for specific architectures section to README#542
timbray merged 5 commits into
timbray:mainfrom
sayrer:cpu_target

sayrer commented Jun 12, 2026

Uh oh!

timbray left a comment

Uh oh!

timbray commented Jun 12, 2026

Uh oh!

sayrer commented Jun 13, 2026 •

edited

Loading

Uh oh!

sayrer commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sayrer commented Jun 12, 2026

Uh oh!

timbray left a comment

Choose a reason for hiding this comment

Uh oh!

timbray commented Jun 12, 2026

Uh oh!

sayrer commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayrer commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sayrer commented Jun 13, 2026 •

edited

Loading