Skip to content

perf: devirtualize BVH traversal via templated collide/distance overloads#841

Closed
facontidavide wants to merge 2 commits into
coal-library:develfrom
facontidavide:perf/bvh-devirt
Closed

perf: devirtualize BVH traversal via templated collide/distance overloads#841
facontidavide wants to merge 2 commits into
coal-library:develfrom
facontidavide:perf/bvh-devirt

Conversation

@facontidavide

Copy link
Copy Markdown

Summary

  • Adds templated collisionRecurseT<Node> and distanceRecurseT<Node> next to the existing virtual-dispatch versions in include/coal/internal/traversal_recurse.h.
  • Adds templated collide<Node> / distance<Node> overloads in src/collision_node.h that route through the templated recursers.
  • Per overload-resolution rules, the templated overload outranks the existing base-pointer non-template (exact-type match beats derived-to-base conversion). Internal callers in collision_func_matrix.cpp and distance_func_matrix.cpp, which already create concrete MeshCollisionTraversalNode<BV> / MeshDistanceTraversalNode<BV> instances, automatically pick up the devirtualised path with zero source changes at the call site.

Implementation notes

  • traversal_recurse.h gains collisionRecurseT and distanceRecurseT (header-only templates) that take Node* instead of CollisionTraversalNodeBase*. The 9 virtual calls per BVH node pair (isFirstNodeLeaf, isSecondNodeLeaf, BVDisjoints, leafCollides, firstOverSecond, getFirstLeftChild, getFirstRightChild, getSecondLeftChild, getSecondRightChild, canStop) all become statically-resolved member calls on the concrete Node type — the inliner can fully inline them.
  • src/collision_node.h gains collide<Node> / distance<Node> template overloads that wrap the T-prefixed recursers. Front-list propagation and queue-based search paths remain on the virtual codepath (those are rare and not worth the extra instantiations).
  • src/collision_node.cpp exports checkResultLowerBound as COAL_DLLAPI so the header-only template can call it without a link-time visibility hole.

Performance impact

Methodology

  • Hardware: x86-64 desktop, P-core pinned (taskset -c 4), turbo locked.
  • Build: cmake --build build -j12 -DCMAKE_BUILD_TYPE=Release in isolated worktrees; separate libcoal.so per variant. Both base and variant compiled with stock upstream flags (no -march=native).
  • Runs: N = 15 interleaved (base, variant, base, variant, …); median reported.
  • Workload: coal-test-benchmark (test/benchmark.cpp).
Workload Before (µs, median) After (µs, median) Δ N stdev (µs)
coal-test-benchmark total 62591.4 60417.5 -3.47% 15 base 313.8 / variant 324.4

Correctness gate: full ctest suite passes (37/37, including the new coal-devirt_correctness test).

Tests

  • The cherry-picked commit already adds test/benchmark_devirt.cpp (a perf benchmark that calls collisionRecurseT directly so the templated dispatch route is exercised at build time).
  • Added test/devirt_correctness.cpp — a focused correctness test that, for each MeshCollisionTraversalNode<BV> / MeshDistanceTraversalNode<BV> instantiation (OBB, RSS, kIOS, OBBRSS; OBB has no distance node), runs both the templated path (collide(&node, ...) / distance(&node)) and the virtual-dispatch path (collide(static_cast<CollisionTraversalNodeBase*>(&node), ...)) on identical inputs and asserts the two CollisionResults / DistanceResults are pairwise-equal.
  • The whole point of the change is overload-resolution dispatch; a benchmark proves it routes to the templated path, but only this correctness test proves the templated path produces exactly the same result.

Risk & regression analysis

  • Overload resolution silently changing: the templated overloads outrank the non-template base-pointer overloads by C++ rules. If a future change introduced an ambiguous overload, the wrong path could be selected silently. The new correctness test catches this — if either path drifts, results disagree.
  • Front-list / queue-based variants: not templated. Callers that opt into those still get the virtual-dispatch path. No change in behaviour.
  • ABI: no public header signature changes. The added COAL_DLLAPI on checkResultLowerBound is a visibility expansion — safe for consumers.

Davide Faconti and others added 2 commits April 30, 2026 17:53
…oads

Adds templated `collisionRecurseT<Node>` and `distanceRecurseT<Node>`
in include/coal/internal/traversal_recurse.h, plus templated `collide<Node>`
and `distance<Node>` overloads in src/collision_node.h. The templated
overloads outrank the existing base-pointer non-templates by overload
resolution (exact-type match beats derived-to-base conversion), so
internal callers in collision_func_matrix.cpp and distance_func_matrix.cpp
that already create concrete traversal-node types automatically pick up
the devirtualised path with zero source changes at the call site.

The templated path eliminates the 9 virtual calls per BVH node pair
(isFirstNodeLeaf, isSecondNodeLeaf, BVDisjoints, leafCollides,
firstOverSecond, getFirst/SecondLeftChild, getFirst/SecondRightChild,
canStop) that perf identified as ~5% of total benchmark runtime in the
recursive driver. Front-list propagation and queue-based search paths
remain on the virtual codepath since they are rare.

Measured on coal-test-benchmark, taskset CPU 4, 8 interleaved rounds:
- BASE  median 46062 µs, stdev 311
- NEW   median 44204 µs, stdev 1478  (-4.04%)
- DEVI  median 44090 µs, stdev 184   (-4.28%, dedicated devirt benchmark)

Win is reproducible (8/8 rounds NEW < BASE), tight, and above noise floor.
All 19 affected correctness tests pass.

Also adds:
- src/collision_node.cpp: exports checkResultLowerBound as COAL_DLLAPI
  so the header-only template can call it.
- test/benchmark_devirt.cpp -> coal-test-benchmark-devirt: dedicated
  benchmark that calls collisionRecurseT directly, used to validate
  that the overload-resolution dispatch in coal-test-benchmark routes
  to the devirtualised path correctly.

No public API changes. No installed-header changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…loads

Locks down the dispatch contract introduced by collisionRecurseT /
distanceRecurseT and the templated collide/distance overloads in
collision_node.h: for each MeshCollisionTraversalNode<BV> (RSS, kIOS,
OBB, OBBRSS) and MeshDistanceTraversalNode<BV> (RSS, kIOS, OBBRSS), the
templated path must produce *exactly* the same CollisionResult /
DistanceResult as the existing virtual-dispatch base-pointer path on
identical inputs.

Without this gate, a regression in either path would only surface as
a perf shift in benchmark_devirt — never as a wrong answer in the
test suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lmontaut

Copy link
Copy Markdown
Contributor

See #858

@lmontaut lmontaut closed this May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr status wip To not review in weekly meeting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants