Add `sentrysearch benchmark` to collect local-backend hardware reports

## Problem

The hardware table in `README.md:218-225` is the maintainer's best guess — there's no field data validating which models actually load, run, and produce reasonable timings on which hardware. The maintainer doesn't have a Mac or NVIDIA GPU, so collecting this manually isn't an option.

## Goal: one command, zero effort

The whole design of this feature is anchored on a single principle: a contributor with the right hardware should be able to type \`sentrysearch benchmark\`, walk away, and come back to a copy-pasteable report block. Nothing else.

- No supplying their own clip.
- No typing their own hardware specs.
- No choosing queries.
- No editing config.
- No running multiple commands or post-processing the output.

Every other design choice below follows from this principle. If something on the list adds a step for the contributor, it's wrong.

## Suggested feature

A new CLI command, `sentrysearch benchmark`, that runs the entire flow end-to-end:

1. **Auto-detect hardware** — OS, CPU, Python version, GPU (CUDA name + VRAM via `torch.cuda.get_device_properties`, or MPS detection, or "CPU only"), system RAM.
2. **Download the project's fixed sample clip** — URL is hardcoded in the script. Cache under `~/.sentrysearch/benchmark/` so repeated runs don't re-download. Show a progress bar (it's ~141 MB). **Do not use YouTube/yt-dlp.**
   - Asset URL: `https://github.com/ssrajadh/sentrysearch/releases/download/benchmark-clip-v1/benchmark_video.mp4`
3. **Run the fixed query list** — hardcoded in the script (not CLI args). The clip is highway driving so the first three queries are broad "floor" queries that should hit near-ceiling on any working setup, and the rest test fine-grained vehicle identification:

   ```python
   QUERIES = [
       # Floor — should hit near-ceiling on any working setup
       "car driving on road",
       "highway driving",
       "black car",
       # Fine-grained vehicle ID — all verified present in the clip
       "white toyota pickup truck",
       "amazon prime van",
       "black toyota 4runner",
       "black gmc pickup truck",
       "range rover",
       "tesla suv",
       "black ford explorer",
       "silver acura mdx",
       "black toyota camry",
       "blue garbage truck",
   ]
   ```

   Floor queries give a diagnostic baseline — if those score low, the contributor's setup is broken, not just the model being weak at fine-grained recognition. Queries are cheap (one text embed + one cosine search against ~75 vectors per query, adds ~1s total to a 15-20 min run).
4. **Keep the embedder warm across queries.** The benchmark runs as a single Python process: call `get_embedder()` once, run indexing, then loop through all queries against the same loaded model. Do **not** spawn `sentrysearch shell` as a subprocess, and do **not** invoke `sentrysearch search` once per query — both would reload the model. For qwen8b that reload alone is 30s-2min, which would dwarf the actual query work.
5. **Single run, report mean and stddev of per-chunk times.** The clip is long enough (~31 min, ~75 chunks at default settings) that one run gives plenty of samples — median-of-N runs is unnecessary and would balloon contributor time. A longer run also surfaces thermal-throttling on laptops, which is real usage signal.
6. **Emit a markdown report block** — print to stdout and also write to `./sentrysearch-benchmark.md`. Drop-in-pastable into a PR appending to `docs/hardware-reports.md`. Example:

   ```markdown
   ### M2 Max 32 GB — qwen8b (full bf16)
   - **OS / Python:** macOS 14.5 / 3.12
   - **Install:** \`uv tool install \".[local]\"\`
   - **Auto-detected model:** qwen8b
   - **Quantized:** no
   - **Per-chunk time:** 4.2s ± 0.6s (n=75)
   - **Total run time:** 5m 18s
   - **Peak memory:** 18.3 GB
   - **Status:** worked

   | Query | Top score |
   |---|---|
   | car driving on road | 0.78 |
   | highway driving | 0.74 |
   | black car | 0.69 |
   | white toyota pickup truck | 0.64 |
   | amazon prime van | 0.61 |
   | ... | ... |
   ```

7. **Print clear next steps** at the end — "Open a PR adding this block to \`docs/hardware-reports.md\`." Do not attempt to open the PR automatically: requires \`gh\` auth + fork detection, contributors will run the script without reading what it does, and it doesn't change data quality vs. manual paste.
8. **Create `docs/hardware-reports.md` as part of this PR** with a header explaining the format and a placeholder section. Future PRs append entries to this file. Eventually the README hardware table cites it as the source of truth.

## Prerequisites (maintainer)

- [x] Upload a benchmark clip to a GitHub release — [`benchmark-clip-v1`](https://github.com/ssrajadh/sentrysearch/releases/tag/benchmark-clip-v1) (31 min, 480p h264, ~141 MB).
- [x] Decide the fixed query list — see step 3 above.

## Design constraints

- Expected total run time: ~15-20 min on GPU/Mac, ~30+ min on tight setups. CPU users self-select out (README already steers them to Gemini).
- Benchmark must work offline after the first download (clip cached).
- Should not require \`--backend local\` flag — the command implies local. Fail clearly on systems where the local backend can't load at all (e.g. local extras not installed → print exactly what to install).
- Use a separate ChromaDB collection (e.g. \`_benchmark\`) and delete it on exit so contributors aren't left with junk in their real index.
- Single Python process for the whole run — embedder loaded once, reused across all queries.

## Non-goals (explicitly)

- No auto-PR creation.
- No YouTube / yt-dlp.
- No user-supplied clips or queries — the whole point is a fixed workload.
- No spawning \`sentrysearch shell\` or per-query \`sentrysearch search\` subprocesses (would reload the model each time).
- No bench harness for the Gemini or qwen-cloud backends (their performance is API-bound, not hardware-bound).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `sentrysearch benchmark` to collect local-backend hardware reports #68

Problem

Goal: one command, zero effort

Suggested feature

Prerequisites (maintainer)

Design constraints

Non-goals (explicitly)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add sentrysearch benchmark to collect local-backend hardware reports #68

Description

Problem

Goal: one command, zero effort

Suggested feature

Prerequisites (maintainer)

Design constraints

Non-goals (explicitly)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Add `sentrysearch benchmark` to collect local-backend hardware reports #68