Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,8 @@ This entire section is also optional and similar to the `custom_mean` section ab

`n_marks_min`: Only used if `use_filter` is `true`. Minimum number of marks that must be in the n-cube surrounding a candidate spike mark to be considered for decoding.

`exact_histogram`: Optional, default `false`. The per-spike joint probability uses a fast path (a uniform-bin histogram and a fused distance sum) that matches the original computation to within floating-point rounding (~1e-13). Set to `true` to restore the exact original numerics if you need bit-for-bit reproducibility against an older run. See `docs/latency_analysis.md`.

### `dead_channels`

This section is optional.
Expand All @@ -229,7 +231,7 @@ For both options below, the reference sampling rate is the `spikes` sampling rat

`samples`: How many samples make up a time bin. For example, if the `spikes` sampling rate is 30 kHz and `samples` is 180, the time bin size is 6 ms.

`delay_samples`: How many samples behind the current LFP timestamp the right edge of the current time bin should be. For example, if the `spikes` sampling rate is 30 kHz and `delay_samples` is 90, the right edge of the current time bin will be 3 ms behind the current LFP timestamp.
`delay_samples`: How many samples behind the current LFP timestamp the right edge of the current time bin should be. For example, if the `spikes` sampling rate is 30 kHz and `delay_samples` is 90, the right edge of the current time bin will be 3 ms behind the current LFP timestamp. This is a spike jitter buffer, and it is the single largest controllable term in end-to-end latency: tune it down to your measured worst-case encoder-to-decoder arrival time rather than copying a round number between configs. See `docs/latency_analysis.md`.

## `clusterless_decoder`

Expand Down
41 changes: 41 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Hot-path latency benchmarks

`bench_hotpath.py` measures the two compute kernels on the real-time critical
path, in isolation, without needing MPI or Trodes:

- `Encoder.get_joint_prob` - the per-spike clusterless KDE (encoder ranks)
- `ClusterlessDecoder.compute_posterior` - the per-time-bin posterior (decoder ranks)

It drives the *actual* classes from `realtime_decoder`, not a reimplementation,
so the numbers reflect the shipped code. It is the "measure" half of the
measure -> change one thing -> measure loop in `docs/realtime_tuning.md`.

## Usage

```bash
# latency distributions (p50/p99/p99.9/max) at SC66 sizes and stress sizes
python benchmarks/bench_hotpath.py

# prove the optimized encoder kernel still matches the original algorithm
python benchmarks/bench_hotpath.py --verify

# deterministic digest of kernel outputs (compare across code versions)
python benchmarks/bench_hotpath.py --checksum
```

Only `numpy` is required. The harness installs a tiny `mpi4py` stub so the
package imports without an MPI runtime; the kernels themselves do not use MPI.

## What to look at

Report **p99 and p99.9**, not the mean. Closed-loop latency is a tail problem:
a kernel that is fast on average but occasionally stalls for milliseconds will
force a larger spike jitter buffer (`decoder.time_bin.delay_samples`), which is
the single biggest controllable term in end-to-end latency. See
`docs/latency_analysis.md`.

## Sizes

Defaults mirror `config/SC66.yml`: `num_bins=41`, encoder mark buffer
`bufsize=50000`, `mark_dim=4`. `get_joint_prob` cost scales with the number of
stored marks, so the buffer-full case (50000) is the steady-state worst case.
Loading