Problem
The hardware table in README.md:218-225 is the maintainer's best guess — there's no field data validating which models actually load, run, and produce reasonable timings on which hardware. The maintainer doesn't have a Mac or NVIDIA GPU, so collecting this manually isn't an option.
Goal: one command, zero effort
The whole design of this feature is anchored on a single principle: a contributor with the right hardware should be able to type `sentrysearch benchmark`, walk away, and come back to a copy-pasteable report block. Nothing else.
- No supplying their own clip.
- No typing their own hardware specs.
- No choosing queries.
- No editing config.
- No running multiple commands or post-processing the output.
Every other design choice below follows from this principle. If something on the list adds a step for the contributor, it's wrong.
Suggested feature
A new CLI command, sentrysearch benchmark, that runs the entire flow end-to-end:
-
Auto-detect hardware — OS, CPU, Python version, GPU (CUDA name + VRAM via torch.cuda.get_device_properties, or MPS detection, or "CPU only"), system RAM.
-
Download the project's fixed sample clip — URL is hardcoded in the script. Cache under ~/.sentrysearch/benchmark/ so repeated runs don't re-download. Show a progress bar (it's ~141 MB). Do not use YouTube/yt-dlp.
- Asset URL:
https://github.com/ssrajadh/sentrysearch/releases/download/benchmark-clip-v1/benchmark_video.mp4
-
Run the fixed query list — hardcoded in the script (not CLI args). The clip is highway driving so the first three queries are broad "floor" queries that should hit near-ceiling on any working setup, and the rest test fine-grained vehicle identification:
QUERIES = [
# Floor — should hit near-ceiling on any working setup
"car driving on road",
"highway driving",
"black car",
# Fine-grained vehicle ID — all verified present in the clip
"white toyota pickup truck",
"amazon prime van",
"black toyota 4runner",
"black gmc pickup truck",
"range rover",
"tesla suv",
"black ford explorer",
"silver acura mdx",
"black toyota camry",
"blue garbage truck",
]
Floor queries give a diagnostic baseline — if those score low, the contributor's setup is broken, not just the model being weak at fine-grained recognition. Queries are cheap (one text embed + one cosine search against ~75 vectors per query, adds ~1s total to a 15-20 min run).
-
Keep the embedder warm across queries. The benchmark runs as a single Python process: call get_embedder() once, run indexing, then loop through all queries against the same loaded model. Do not spawn sentrysearch shell as a subprocess, and do not invoke sentrysearch search once per query — both would reload the model. For qwen8b that reload alone is 30s-2min, which would dwarf the actual query work.
-
Single run, report mean and stddev of per-chunk times. The clip is long enough (~31 min, ~75 chunks at default settings) that one run gives plenty of samples — median-of-N runs is unnecessary and would balloon contributor time. A longer run also surfaces thermal-throttling on laptops, which is real usage signal.
-
Emit a markdown report block — print to stdout and also write to ./sentrysearch-benchmark.md. Drop-in-pastable into a PR appending to docs/hardware-reports.md. Example:
### M2 Max 32 GB — qwen8b (full bf16)
- **OS / Python:** macOS 14.5 / 3.12
- **Install:** \`uv tool install \".[local]\"\`
- **Auto-detected model:** qwen8b
- **Quantized:** no
- **Per-chunk time:** 4.2s ± 0.6s (n=75)
- **Total run time:** 5m 18s
- **Peak memory:** 18.3 GB
- **Status:** worked
| Query | Top score |
|---|---|
| car driving on road | 0.78 |
| highway driving | 0.74 |
| black car | 0.69 |
| white toyota pickup truck | 0.64 |
| amazon prime van | 0.61 |
| ... | ... |
-
Print clear next steps at the end — "Open a PR adding this block to `docs/hardware-reports.md`." Do not attempt to open the PR automatically: requires `gh` auth + fork detection, contributors will run the script without reading what it does, and it doesn't change data quality vs. manual paste.
-
Create docs/hardware-reports.md as part of this PR with a header explaining the format and a placeholder section. Future PRs append entries to this file. Eventually the README hardware table cites it as the source of truth.
Prerequisites (maintainer)
Design constraints
- Expected total run time: ~15-20 min on GPU/Mac, ~30+ min on tight setups. CPU users self-select out (README already steers them to Gemini).
- Benchmark must work offline after the first download (clip cached).
- Should not require `--backend local` flag — the command implies local. Fail clearly on systems where the local backend can't load at all (e.g. local extras not installed → print exactly what to install).
- Use a separate ChromaDB collection (e.g. `_benchmark`) and delete it on exit so contributors aren't left with junk in their real index.
- Single Python process for the whole run — embedder loaded once, reused across all queries.
Non-goals (explicitly)
- No auto-PR creation.
- No YouTube / yt-dlp.
- No user-supplied clips or queries — the whole point is a fixed workload.
- No spawning `sentrysearch shell` or per-query `sentrysearch search` subprocesses (would reload the model each time).
- No bench harness for the Gemini or qwen-cloud backends (their performance is API-bound, not hardware-bound).
Problem
The hardware table in
README.md:218-225is the maintainer's best guess — there's no field data validating which models actually load, run, and produce reasonable timings on which hardware. The maintainer doesn't have a Mac or NVIDIA GPU, so collecting this manually isn't an option.Goal: one command, zero effort
The whole design of this feature is anchored on a single principle: a contributor with the right hardware should be able to type `sentrysearch benchmark`, walk away, and come back to a copy-pasteable report block. Nothing else.
Every other design choice below follows from this principle. If something on the list adds a step for the contributor, it's wrong.
Suggested feature
A new CLI command,
sentrysearch benchmark, that runs the entire flow end-to-end:Auto-detect hardware — OS, CPU, Python version, GPU (CUDA name + VRAM via
torch.cuda.get_device_properties, or MPS detection, or "CPU only"), system RAM.Download the project's fixed sample clip — URL is hardcoded in the script. Cache under
~/.sentrysearch/benchmark/so repeated runs don't re-download. Show a progress bar (it's ~141 MB). Do not use YouTube/yt-dlp.https://github.com/ssrajadh/sentrysearch/releases/download/benchmark-clip-v1/benchmark_video.mp4Run the fixed query list — hardcoded in the script (not CLI args). The clip is highway driving so the first three queries are broad "floor" queries that should hit near-ceiling on any working setup, and the rest test fine-grained vehicle identification:
Floor queries give a diagnostic baseline — if those score low, the contributor's setup is broken, not just the model being weak at fine-grained recognition. Queries are cheap (one text embed + one cosine search against ~75 vectors per query, adds ~1s total to a 15-20 min run).
Keep the embedder warm across queries. The benchmark runs as a single Python process: call
get_embedder()once, run indexing, then loop through all queries against the same loaded model. Do not spawnsentrysearch shellas a subprocess, and do not invokesentrysearch searchonce per query — both would reload the model. For qwen8b that reload alone is 30s-2min, which would dwarf the actual query work.Single run, report mean and stddev of per-chunk times. The clip is long enough (~31 min, ~75 chunks at default settings) that one run gives plenty of samples — median-of-N runs is unnecessary and would balloon contributor time. A longer run also surfaces thermal-throttling on laptops, which is real usage signal.
Emit a markdown report block — print to stdout and also write to
./sentrysearch-benchmark.md. Drop-in-pastable into a PR appending todocs/hardware-reports.md. Example:Print clear next steps at the end — "Open a PR adding this block to `docs/hardware-reports.md`." Do not attempt to open the PR automatically: requires `gh` auth + fork detection, contributors will run the script without reading what it does, and it doesn't change data quality vs. manual paste.
Create
docs/hardware-reports.mdas part of this PR with a header explaining the format and a placeholder section. Future PRs append entries to this file. Eventually the README hardware table cites it as the source of truth.Prerequisites (maintainer)
benchmark-clip-v1(31 min, 480p h264, ~141 MB).Design constraints
Non-goals (explicitly)