BVHAccelerator.compute_multipath broadcasts thread-0 result across all sats when n_sat > 1

## Summary

`BVHAccelerator.compute_multipath(rx_ecef, sat_ecef)` (CUDA kernel `multipath_bvh_kernel` in `src/raytrace/bvh.cu`) returns thread-0's result for **every** satellite when `n_sat > 1`. Calling it once per satellite (`n_sat=1`) gives the correct, distinct per-satellite values. The new `compute_multipath_batch` added in #48 is **not** affected.

## Reproduction

```python
import numpy as np
from gnss_gpu.raytrace import BuildingModel
from gnss_gpu.bvh import BVHAccelerator

building = BuildingModel.create_box(center=[100.0, 0.0, 25.0],
                                     width=20.0, depth=20.0, height=50.0)
bvh = BVHAccelerator.from_building_model(building)

rx = np.array([22.49339292, -7.97963954, 4.7373692])

# Each sat called ALONE (correct):
for s in [[200, -10, 25], [210, -10, 26], [220, -10, 27], [230, -10, 28]]:
    d, _ = bvh.compute_multipath(rx, np.array([s]))
    print(f"alone {s}: delay = {d[0]:.4f}")
# alone [200,-10,25]: 3.9809
# alone [210,-10,26]: 0.0000
# alone [220,-10,27]: 0.0000
# alone [230,-10,28]: 0.0000

# All four sats together (BUG):
sats = np.array([[200,-10,25],[210,-10,26],[220,-10,27],[230,-10,28]], dtype=float)
d, _ = bvh.compute_multipath(rx, sats)
print(f"together: delays = {d}")
# together: delays = [3.9809 3.9809 3.9809 3.9809]   ← all thread-0's answer
```

Reversing the satellite order gives the opposite broadcast:

```python
sats_rev = sats[::-1].copy()
d, _ = bvh.compute_multipath(rx, sats_rev)
# delays = [0. 0. 0. 0.]  (because sat 0 in this order has no reflection)
```

So thread 0's `best_delay` becomes the value seen by every other thread in the launch.

## Geometry sanity check (why sat 0 is the only one with a reflection)

Box: x∈[90,110], y∈[-10,10], z∈[0,50]. Sats are at x=200..230, y=-10, near elevation 0. Mirror across the y=+10 wall puts the reflection at (≈111.3, 10, ≈14.3) for sat 0 — just inside the wall (x<110), excess delay ≈3.98 m. For sat 1 the mirror reflection lands at x≈111.4 → outside the wall → no valid reflection. Same logic excludes sats 2 and 3. Calling `compute_multipath` per satellite confirms this.

## Diagnosis pointer

`BVHAccelerator.check_los` (LOS kernel) does **not** show this behaviour, so the issue seems specific to `multipath_bvh_kernel`. Likely candidates worth checking:
- Register pressure with `int stack[64]` + the inner mirror/intersect math forcing local-memory spills that alias across threads
- Some compiler optimization that hoists the read of `sat_ecef[sid * 3 + …]` out of per-thread context

## Suggested fix

Replace the single-rx `compute_multipath` with a thin wrapper around the batched kernel from #48 (`raytrace_multipath_bvh_batch` with `n_epoch=1`). The batched kernel uses one thread per `(epoch, sat)` and was verified bug-free against per-satellite reference calls in tests added in #48.

## Impact

Anyone calling `BVHAccelerator.compute_multipath` with `n_sat > 1` is silently getting wrong per-satellite delays. Users of the linear-scan `BuildingModel.compute_multipath` are unaffected (different kernel in `src/raytrace/raytrace.cu` that uses one thread per `(sat, triangle)` pair).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BVHAccelerator.compute_multipath broadcasts thread-0 result across all sats when n_sat > 1 #49

Summary

Reproduction

Geometry sanity check (why sat 0 is the only one with a reflection)

Diagnosis pointer

Suggested fix

Impact

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

BVHAccelerator.compute_multipath broadcasts thread-0 result across all sats when n_sat > 1 #49

Description

Summary

Reproduction

Geometry sanity check (why sat 0 is the only one with a reflection)

Diagnosis pointer

Suggested fix

Impact

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions