LorenFrankLab · jl33-ai · Jun 10, 2026
diff --git a/README.md b/README.md
@@ -205,6 +205,8 @@ This entire section is also optional and similar to the `custom_mean` section ab
 
 `n_marks_min`: Only used if `use_filter` is `true`. Minimum number of marks that must be in the n-cube surrounding a candidate spike mark to be considered for decoding.
 
+`exact_histogram`: Optional, default `false`. The per-spike joint probability uses a fast path (a uniform-bin histogram and a fused distance sum) that matches the original computation to within floating-point rounding (~1e-13). Set to `true` to restore the exact original numerics if you need bit-for-bit reproducibility against an older run. See `docs/latency_analysis.md`.
+
 ### `dead_channels`
 
 This section is optional.
@@ -229,7 +231,7 @@ For both options below, the reference sampling rate is the `spikes` sampling rat
 
 `samples`: How many samples make up a time bin. For example, if the `spikes` sampling rate is 30 kHz and `samples` is 180, the time bin size is 6 ms.
 
-`delay_samples`: How many samples behind the current LFP timestamp the right edge of the current time bin should be. For example, if the `spikes` sampling rate is 30 kHz and `delay_samples` is 90, the right edge of the current time bin will be 3 ms behind the current LFP timestamp.
+`delay_samples`: How many samples behind the current LFP timestamp the right edge of the current time bin should be. For example, if the `spikes` sampling rate is 30 kHz and `delay_samples` is 90, the right edge of the current time bin will be 3 ms behind the current LFP timestamp. This is a spike jitter buffer, and it is the single largest controllable term in end-to-end latency: tune it down to your measured worst-case encoder-to-decoder arrival time rather than copying a round number between configs. See `docs/latency_analysis.md`.
 
 ## `clusterless_decoder`
 

diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -0,0 +1,41 @@
+# Hot-path latency benchmarks
+
+`bench_hotpath.py` measures the two compute kernels on the real-time critical
+path, in isolation, without needing MPI or Trodes:
+
+- `Encoder.get_joint_prob` - the per-spike clusterless KDE (encoder ranks)
+- `ClusterlessDecoder.compute_posterior` - the per-time-bin posterior (decoder ranks)
+
+It drives the *actual* classes from `realtime_decoder`, not a reimplementation,
+so the numbers reflect the shipped code. It is the "measure" half of the
+measure -> change one thing -> measure loop in `docs/realtime_tuning.md`.
+
+## Usage
+
+```bash
+# latency distributions (p50/p99/p99.9/max) at SC66 sizes and stress sizes
+python benchmarks/bench_hotpath.py
+
+# prove the optimized encoder kernel still matches the original algorithm
+python benchmarks/bench_hotpath.py --verify
+
+# deterministic digest of kernel outputs (compare across code versions)
+python benchmarks/bench_hotpath.py --checksum
+```
+
+Only `numpy` is required. The harness installs a tiny `mpi4py` stub so the
+package imports without an MPI runtime; the kernels themselves do not use MPI.
+
+## What to look at
+
+Report **p99 and p99.9**, not the mean. Closed-loop latency is a tail problem:
+a kernel that is fast on average but occasionally stalls for milliseconds will
+force a larger spike jitter buffer (`decoder.time_bin.delay_samples`), which is
+the single biggest controllable term in end-to-end latency. See
+`docs/latency_analysis.md`.
+
+## Sizes
+
+Defaults mirror `config/SC66.yml`: `num_bins=41`, encoder mark buffer
+`bufsize=50000`, `mark_dim=4`. `get_joint_prob` cost scales with the number of
+stored marks, so the buffer-full case (50000) is the steady-state worst case.