Benchmark local LLMs running under Ollama — score quality, measure speed, and compare models side by side, all offline.
The UI has a categorised suite picker, live progress, per-model breakdown tables, and full light/dark themes.
| Light | Dark | |
|---|---|---|
| New Run | ![]() |
![]() |
| History | ![]() |
![]() |
You're running one or more LLMs locally through Ollama (say llama3:8b, qwen3.6:27b, gpt-oss:120b). You want to answer questions like:
- How fast is each model on my hardware? tokens/s, time-to-first-token.
- How accurate is it on the kinds of prompts I actually care about?
- Which model should I pick for this task?
- Did the newer model regress on anything the older one got right?
This tool runs a suite of prompts through each model, scores the responses, measures throughput, and produces a JSON + Markdown report plus a live web UI.
Everything runs on your own hardware. Nothing is sent to a cloud service.
- 🧪 33 ready-made evaluation suites · 4 910 test cases across reasoning, knowledge, coding, math, instruction-following, multilingual, long-context, safety, and judge-scored open-ended tasks.
- 📊 Public-benchmark adapters — MMLU · HellaSwag · TruthfulQA · GSM8K · HumanEval · BigBench-Hard · ARC · PIQA · WinoGrande · C-Eval · MATH-500 · MBPP · SQuAD v2 · IFEval · MT-Bench · PubMedQA · CRUXEval · Spider.
- 🌐 Web UI with a categorised suite picker, live progress over WebSocket, per-model-× per-suite breakdown tables, light/dark theme, URL-encoded filters, and a diff view.
- 🧰 CLI + REST + WebSocket — same features from your terminal, your CI pipeline, or a custom dashboard.
- 🎯 Multi-model comparison — run 2–N models against the same suites; reports surface each model's pass rate independently, not just the aggregate.
- 🛡️ Safety by design — metric errors are isolated per test case; runs can be cooperatively cancelled; artefacts write atomically.
- 📦 One-button install + launch —
./scripts/start.sh(orstart.baton Windows) brings up Ollama, backend, and UI together. - ✅ Extensively tested — 447 backend tests (unit + property + integration) and 18 UI tests, including 44 Hypothesis / fast-check property tests derived from a formal spec.
- Python 3.11+
- Node.js 18+
- Ollama 0.5+ running on port 11434
- At least one Ollama model pulled (
ollama pull llama3:8b)
git clone https://github.com/<your-username>/ollama-model-evaluator.git
cd ollama-model-evaluator
./scripts/start.sh # Linux/macOSor on Windows:
.\scripts\start.ps1 # or double-click start.batThe launcher:
- Runs the installer if
.venv/orui/dist/are missing. - Ensures Ollama is running (or starts it if the CLI is installed).
- Starts the backend (FastAPI on
:8765) with the built UI mounted at/. - Health-checks every service before handing back.
Open http://localhost:8765/ and submit your first run.
Stop everything later with:
./scripts/stop.sh # or Ctrl-C if start is still in the foreground./scripts/install.sh # Linux/macOS
.\scripts\install.ps1 # Windows./scripts/deploy-remote.sh user@host /target/dirTars the repo, scp's it, runs install.sh on the remote, optionally starts the server. See docs/USER_MANUAL.md for the full flag list.
┌──────────────────────┐ ┌──────────────────────────────────┐
│ React UI (Vite) │<─WS──→│ Backend · FastAPI + asyncio │
│ Tailwind + Radix │─REST─→│ port 8765 │
└──────────────────────┘ └────────────┬─────────────────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Ollama HTTP │ │ SQLite │ │ Evaluation_Suites │
│ localhost:11434 │ │ history.db │ │ YAML / HF adapters│
└──────────────────┘ └──────────────┘ └──────────────────┘
- Preflight — verifies Ollama is reachable, pulls any missing models (opt-in), and materialises public-benchmark datasets when needed.
- Scheduling — expands
(model, test_case, repetition)tuples into a pending queue, gated byconcurrency. - Generation — streams from Ollama, measuring time-to-first-token and tokens/second per response.
- Scoring — runs every configured metric; a metric error does not fail the test case (per-metric isolation).
- Reporting — writes
report.json, rendersreport.md, persists to the SQLite history store. - Streaming — every event (
run-started,test-case-completed,run-progress, terminal) is broadcast over WebSocket to connected clients.
Suites are plain YAML or JSON — easy to author, version-control, and diff. Minimum shape:
name: my-suite
test_cases:
- id: arithmetic-simple
prompt: "What is 2 + 2? Answer with just the number."
expected_output: "4"
metrics:
- name: regex-match
params:
pattern: "\\b4\\b"Built-in metrics: exact-match, regex-match, contains, json-schema-valid, length-range, llm-as-judge, response-capture.
Materialise any supported benchmark into a standard suite file:
python -m ollama_evaluator.cli convert mmlu \
--output examples/suites --limit 200 --seed 42From then on the benchmark looks identical to a hand-authored suite.
Running more than one model in a single submission gives you per-model stats and a model × suite breakdown automatically:
| Model | Passed | Failed | Pass rate | Mean tokens/s |
|---|---|---|---|---|
qwen3.5:35b-a3b |
3 | 0 | 100.0% | 48.97 |
qwen3.6:27b |
2 | 1 | 66.7% | 10.54 |
- User Manual — hands-on walkthrough (installation, first run, suite authoring, HuggingFace loader, remote SSH, report reading, troubleshooting).
- Requirements — functional requirements.
- Design — system design and architecture.
- UI style previews — open
docs/ui-previews/index.htmllocally to see the three design directions that were evaluated before the final pick.
python -m ollama_evaluator.cli validate-suite examples/suites/reasoning-basics.yaml
# → OK: reasoning-basics (3 test cases)python -m ollama_evaluator.cli --config examples/config.qwen.yaml run
# → Run <id>: 3 executions, passed=2, failed=1, error=0, timeout=0Exit code is 0 iff every test passed, 1 on any failure, 2 on preflight errors — useful for CI pipelines.
python -m ollama_evaluator.cli --config examples/config.qwen.yaml compare <RUN_A> <RUN_B>OLLAMA_EVAL_UI_DIR=$PWD/ui/dist \
python -m ollama_evaluator.cli --config examples/config.qwen.yaml \
serve --host 0.0.0.0 --port 8765.
├── backend/ # Python backend (FastAPI + CLI + 447 tests)
├── ui/ # Vite + React + TypeScript UI (Tailwind + Radix)
├── shared/ # OpenAPI + JSON Schemas shared by both
├── examples/ # 33 evaluation suites + sample configs
├── scripts/ # install / start / stop / deploy helpers
├── docs/ # User manual, screenshots, UI style previews
└── .kiro/specs/ # Requirements, design, task list
make test # All 447 backend tests (~45 s)
make test-unit # Fastest path
make test-property # 44 Hypothesis property tests
make test-integration # FastAPI + fake Ollama
make ui-test # VitestThe property tests derive from a formal specification in .kiro/specs/ and exercise invariants like:
- Evaluation_Suite YAML↔JSON round-trip equivalence
- Metric error isolation — one broken metric never fails a test case
- Test-case-completed events bijectively cover executed tuples
- UI event-stream reducer is replay-deterministic
Issues and pull requests welcome. Please:
- Open an issue first for anything bigger than a one-line fix.
- Add or update tests for new behaviour.
- Match the existing code style (
ruff+mypyfor Python; TypeScript strict for the UI). - Run
make testandmake ui-testlocally before pushing.
MIT — do what you like, no warranty.
- Ollama for making it effortless to run LLMs locally.
- The open-source benchmark authors — Hendrycks et al. (MMLU), Zellers et al. (HellaSwag), Lin et al. (TruthfulQA), Cobbe et al. (GSM8K), Chen et al. (HumanEval), Suzgun et al. (BBH), Clark et al. (ARC), and many more whose datasets this tool materialises.
- The FastAPI, Pydantic, TanStack Query, Vite, React, Tailwind, Radix UI, and Hypothesis / fast-check communities.



