Add support for gpu computations#15
Conversation
Merging this PR will improve performance by 17.04%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Memory | rf_diverse |
1,034.3 KB | 846.9 KB | +22.14% |
| ⚡ | Memory | rf_similar |
164.6 KB | 146.7 KB | +12.15% |
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing GPU-implmentation (b05906d) with master (b46f8a9)
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #15 +/- ##
===========================================
- Coverage 96.64% 86.54% -10.10%
===========================================
Files 6 8 +2
Lines 2501 2950 +449
===========================================
+ Hits 2417 2553 +136
- Misses 84 397 +313
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…m for perpetutity.
|
Dropping the GPU (wgpu) path; keeping the fast CPU path I explored GPU acceleration for the pairwise distance matrix (branch GPU-implmentation, wgpu → Vulkan/Metal). I'm not merging it — the payoff doesn't justify the amount of code the additional layer of complexity & additional param. Result (Tesla P100 vs 8-core CPU, RF): only 1.2–2.5× on large jobs, a slowdown on small ones, and the gain shrinks as taxa grow. RF is a memory-bandwidth-bound, near-zero-arithmetic kernel — a poor GPU fit — and our CPU path (bit-packed popcount + rayon, upper-triangle only) is already near-optimal, so there's little to win back. Why not worth it:
The effort wasn't wasted. Debugging the GPU's memory limits surfaced a real CPU-side win: snapshot construction now interns bipartitions incrementally in bounded chunks instead of holding every tree's raw bitsets at once — peak construction memory drops ~15× (e.g. 4.4 GB → ~0.3 GB at 2000×5000), large runs that used to OOM now complete, results byte-identical. Plus some CI speedups. These land in a separate GPU-free PR. Can revisit this if we are consistenly hitting time limits instead of memory limits. |
No description provided.