Skip to content

mirlyuk/latency-benchmark-2026

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

latency-benchmark-2026

Reproducible end-to-end first-token latency benchmark across the six leading AI interview copilots.

This repository contains the raw audio, screen recordings, frame-count spreadsheet, and harness used to measure perceived latency — last syllable of the interviewer's question to the first visible token in the copilot's UI — across:

The data published here is the source of the results posted on mirly.co.uk/latency and the analysis post at mirly.co.uk/blog/latency-teardown-6-copilots.

Why this exists

Every vendor in this category publishes a latency number. None are comparable:

Vendor Public claim What it measures
LockedIn AI 116 ms LLM token-generation only
Final Round AI "real-time" no number
Parakeet AI "instant" no number
Cluely <300 ms cache-best-case
Sensei AI nothing published
Mirly <150 ms p50 full pipeline, end-to-end

To compare them honestly, someone has to run them all through the same harness on the same machine with the same audio. This is that harness.

Headline result (2026-05-15 run)

Warm p50, milliseconds from last syllable to first visible token:

Tool p50 p95 Cold-start Vendor version
Mirly 127 189 412 0.0.1
Parakeet AI 421 690 1,210 3.0
LockedIn AI 478 820 1,560 1.8
Verve AI 581 970 1,520 1.6
Cluely 612 1,184 1,920 2.4
Sensei AI 718 1,030 1,470 2.2
Final Round AI 1,810 2,940 3,820 4.1

Raw per-run data: data/runs-2026-05-15.csv.

Methodology

  • Machine: MacBook Air M2, 16 GB RAM, macOS 14.5, plugged in, single-app foreground
  • Network: gigabit ethernet, London (deliberate worst-case for US-East-hosted vendors)
  • Audio source: 16 kHz mono WAV (audio/question-behavioural-12s.wav) played into system audio via BlackHole so every copilot receives identical bytes. v1 of the dataset uses synthesised audio generated with macOS say -v Daniel -r 175 for reproducibility (anyone re-running can regenerate the exact bytes from a stock Mac). A native British-English human recording replaces it in the 2026-Q3 run — synthetic cadence stresses STT slightly differently from a real interviewer, and we want both numbers in the public history
  • Question text: "Tell me about a time you led a contentious technical decision."
  • Metric: time from the last syllable of the question (measured against the WAV's timestamp) to the first visible token in the copilot's UI
  • Capture: 60 fps screen recording (QuickTime), frame-counted with harness/count-frames.mjs
  • Runs: 10 per tool — 1 cold + 9 warm; p50 = median(warm); p95 = nth-percentile-rank(warm)
  • Date of run: 2026-05-15

The audio file is audio/question-behavioural-12s.wav. Drop it into your own DAW or BlackHole to reproduce.

Reproduce in 30 minutes

  1. Install BlackHole (free, MIT)
  2. Set BlackHole as the system audio output and input device
  3. Open QuickTime → New Screen Recording → set Microphone to BlackHole
  4. For each vendor: install, sign in, start a session, press Play on the WAV, stop the recording when the copilot finishes streaming
  5. Run node harness/count-frames.mjs path/to/recording.mov — it prompts for the frame index of (a) the last-syllable cue and (b) the first visible token, prints the delta in ms
  6. Append the row to data/runs-<date>.csv

A full vendor sweep takes ~3 hours. The harness is intentionally low-tech — every step is auditable.

Open questions / known caveats

  • Synthesised audio in v1. audio/question-behavioural-12s.wav is generated via macOS say -v Daniel -r 175. Pros: bit-identical reproducibility on any Mac, zero recording artefacts. Cons: a real interviewer's cadence has slightly different prosody and timing, which can stress the STT layer differently. The 2026-Q3 run re-measures with a native British-English human recording and publishes both columns side-by-side so the synth-vs-human delta is visible.
  • One question type. Behavioural. Coding and technical-deep-dive questions stress the LLM differently. Per-category benchmarks scheduled for the 2026-Q3 run.
  • One machine. Apple Silicon M2. Intel Mac + Windows runs in flight.
  • One geography. London. US-East candidates should see lower absolute numbers for US-hosted competitors; relative ordering should hold.
  • No coding-interview screen-capture latency — the OCR step in Parakeet / Final Round is a separate measurement and isn't included here.

Quarterly re-runs

This benchmark is re-run on the 15th of February, May, August, and November each year. Diffs are published in the data/ directory. The repository is the source of truth; the marketing site quotes it.

Maintainer

Manu Mahadheer — manu@wayanerd.co.uk. Filing an issue with diff numbers from your own machine is the best way to call BS on anything in here.

Licence

MIT. Use the harness, cite the dataset, run your own benchmarks. Pull requests welcome — especially additional vendors and additional question types.

About

Reproducible end-to-end first-token latency benchmark across the 6 leading AI interview copilots. Raw data + harness.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors