Reproducible end-to-end first-token latency benchmark across the six leading AI interview copilots.
This repository contains the raw audio, screen recordings, frame-count spreadsheet, and harness used to measure perceived latency — last syllable of the interviewer's question to the first visible token in the copilot's UI — across:
The data published here is the source of the results posted on mirly.co.uk/latency and the analysis post at mirly.co.uk/blog/latency-teardown-6-copilots.
Every vendor in this category publishes a latency number. None are comparable:
| Vendor | Public claim | What it measures |
|---|---|---|
| LockedIn AI | 116 ms | LLM token-generation only |
| Final Round AI | "real-time" | no number |
| Parakeet AI | "instant" | no number |
| Cluely | <300 ms | cache-best-case |
| Sensei AI | nothing published | — |
| Mirly | <150 ms p50 | full pipeline, end-to-end |
To compare them honestly, someone has to run them all through the same harness on the same machine with the same audio. This is that harness.
Warm p50, milliseconds from last syllable to first visible token:
| Tool | p50 | p95 | Cold-start | Vendor version |
|---|---|---|---|---|
| Mirly | 127 | 189 | 412 | 0.0.1 |
| Parakeet AI | 421 | 690 | 1,210 | 3.0 |
| LockedIn AI | 478 | 820 | 1,560 | 1.8 |
| Verve AI | 581 | 970 | 1,520 | 1.6 |
| Cluely | 612 | 1,184 | 1,920 | 2.4 |
| Sensei AI | 718 | 1,030 | 1,470 | 2.2 |
| Final Round AI | 1,810 | 2,940 | 3,820 | 4.1 |
Raw per-run data: data/runs-2026-05-15.csv.
- Machine: MacBook Air M2, 16 GB RAM, macOS 14.5, plugged in, single-app foreground
- Network: gigabit ethernet, London (deliberate worst-case for US-East-hosted vendors)
- Audio source: 16 kHz mono WAV (
audio/question-behavioural-12s.wav) played into system audio via BlackHole so every copilot receives identical bytes. v1 of the dataset uses synthesised audio generated with macOSsay -v Daniel -r 175for reproducibility (anyone re-running can regenerate the exact bytes from a stock Mac). A native British-English human recording replaces it in the 2026-Q3 run — synthetic cadence stresses STT slightly differently from a real interviewer, and we want both numbers in the public history - Question text: "Tell me about a time you led a contentious technical decision."
- Metric: time from the last syllable of the question (measured against the WAV's timestamp) to the first visible token in the copilot's UI
- Capture: 60 fps screen recording (QuickTime), frame-counted with
harness/count-frames.mjs - Runs: 10 per tool — 1 cold + 9 warm;
p50 = median(warm);p95 = nth-percentile-rank(warm) - Date of run: 2026-05-15
The audio file is audio/question-behavioural-12s.wav. Drop it into your own DAW or BlackHole to reproduce.
- Install BlackHole (free, MIT)
- Set BlackHole as the system audio output and input device
- Open QuickTime → New Screen Recording → set Microphone to BlackHole
- For each vendor: install, sign in, start a session, press Play on the WAV, stop the recording when the copilot finishes streaming
- Run
node harness/count-frames.mjs path/to/recording.mov— it prompts for the frame index of (a) the last-syllable cue and (b) the first visible token, prints the delta in ms - Append the row to
data/runs-<date>.csv
A full vendor sweep takes ~3 hours. The harness is intentionally low-tech — every step is auditable.
- Synthesised audio in v1.
audio/question-behavioural-12s.wavis generated via macOSsay -v Daniel -r 175. Pros: bit-identical reproducibility on any Mac, zero recording artefacts. Cons: a real interviewer's cadence has slightly different prosody and timing, which can stress the STT layer differently. The 2026-Q3 run re-measures with a native British-English human recording and publishes both columns side-by-side so the synth-vs-human delta is visible. - One question type. Behavioural. Coding and technical-deep-dive questions stress the LLM differently. Per-category benchmarks scheduled for the 2026-Q3 run.
- One machine. Apple Silicon M2. Intel Mac + Windows runs in flight.
- One geography. London. US-East candidates should see lower absolute numbers for US-hosted competitors; relative ordering should hold.
- No coding-interview screen-capture latency — the OCR step in Parakeet / Final Round is a separate measurement and isn't included here.
This benchmark is re-run on the 15th of February, May, August, and November each year. Diffs are published in the data/ directory. The repository is the source of truth; the marketing site quotes it.
Manu Mahadheer — manu@wayanerd.co.uk. Filing an issue with diff numbers from your own machine is the best way to call BS on anything in here.
MIT. Use the harness, cite the dataset, run your own benchmarks. Pull requests welcome — especially additional vendors and additional question types.