Local-first LLM evaluation workbench for Ollama users who want trustworthy metrics, fair model comparisons, and faster iteration loops.
Status: v1.0.0 - Local-first eval workbench with trusted runs, optional judge scoring, frontier comparisons, Arena battles, and custom dataset tooling; 3 published releases with canonical changelog in GitHub Releases.
- Why EvalBench
- Why It's Different
- Quickstart (60 Seconds)
- Real Usage Example
- Demo
- Architecture
- Core Concepts
- Features
- Try in 5 Minutes
- Public Status
- Releases
- Technical Stack
- Setup & Installation
- Platform Notes
- How to Run and Stop
- Validation
- Roadmap
- Contributing
EvalBench is for builders who run local models and want evidence, not vibes.
It gives you one practical loop:
- Benchmark quality with real metrics across tasks like summarization, code, RAG, knowledge, and embeddings.
- Compare local models against frontier models in the same run when you need an external baseline.
- Inspect reliability and failure context so run quality and run health are both visible.
- Create your own golden datasets so evaluations match your actual use case, not generic demos.
This project started as a teaching tool for students to learn golden datasets, metrics, and LLM-as-Judge without writing heavy pipeline code.
If you want "LM Studio for evaluation" with stronger rigor and dataset control, this is it.
- Local-first by design: EvalBench is intentionally single-user and local-first, with SQLite on-device and encrypted key storage.
- Hybrid evaluation without platform lock-in: keep Ollama as your center, then optionally add OpenAI/Gemini/Claude/Groq models for comparison.
- Objective + subjective scoring in one flow: combine reference metrics with optional LLM-as-Judge scoring and rationale.
- Dataset creation is a core feature: build, import, version, and safely manage custom datasets from inside the product.
- Two modes of truth: metric-based head-to-head comparison plus human preference testing via blind Arena battles and ELO.
- Educational layer included: built-in metric guidance helps teams learn why each score exists, not just what the number is.
- Start Ollama locally and ensure at least one model is pulled.
- Install dependencies and start EvalBench:
npm install
npm run dev- Open http://localhost:5173.
- Go to Eval Wizard, choose a task, select models, and run your first evaluation.
Example: compare two local models on Question Answering.
- Open Eval Wizard.
- Choose Question Answering.
- Select two local Ollama models (for example tinyllama:1.1b and gemma3:270m).
- Start the run and open Run Details.
- Review Exact Match and Token F1 plus run health fields (failed pairs, retries, cache hits).
- Export results as JSON, Markdown, or CSV.
Expected outcome:
- You get model-by-model score differences on the same dataset/context.
- You can inspect best/worst examples before deciding which model to ship.
- You can share reproducible output with your team from the export files.
EvalBench uses a local-first architecture optimized for privacy and speed. It separates a lightweight, reactive frontend from a heavy, computational Python backend.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend (React + Vite) β
β ββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββ β
β β Dashboard β Eval Wizard β Compare β Arena β β
β ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β REST API + Server-Sent Events (SSE)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Backend (Python FastAPI) β
β ββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ β
β β Scoring β Eval Runner β Ollama β β
β β Algorithms β (vLLM later) β Integration β β
β ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
SQLite Database (evalbench.db)
We use established Python libraries (rouge-score, sacrebleu, nltk) to compute metrics like ROUGE, BLEU, Exact Match, Token F1, and Distinct-1/2 locally against Ground-Truth Golden Datasets. Datasets are seeded from inline subsets at startup (no external downloads).
For subjective generation tasks, EvalBench can optionally use a configured judge model to score outputs on criteria such as coherence, fluency, and relevance, returning both a score and rationale. Judge providers are loaded lazily so optional SDKs do not block the core app, and judge scoring can be turned off entirely from Settings when you want objective-only runs.
EvalBench computes mean scores and margin of error where supported, and now separates quality from reliability by tracking failed pairs, retries, cache hits, cancellation state, and success rate for each run.
- Local model benchmarking first: Auto-discovers Ollama models and keeps the primary catalog local-first.
- Mixed local + frontier evaluation: Add selected OpenAI/Gemini/Claude/Groq models as comparison baselines in the same run.
- Optional LLM-as-Judge: Turn judge scoring on or off from Settings; when enabled, judge rationale is available in run analysis.
- Task-aware eval wizard: Task selection drives metrics and benchmark dataset defaults with estimated runtime context.
- Custom dataset builder and registry: Manually create or import CSV/JSON datasets, version by name, and safely delete unused user-authored sets.
- Trusted run lifecycle: Live progress, cancellation, retries, partial-failure visibility, cache-hit tracking, and clear run health indicators.
- Head-to-head compare: Fair model comparison based on shared completed run contexts with clearer significance framing.
- Arena battles with ELO: Run blind pairwise matchups (random or manual model-vs-model) and update leaderboard ratings.
- Export anywhere: Download results as JSON, Markdown, or CSV from Run Details.
- Built to teach: Learn tab and metric guidance explain why each metric fits each task.
- Pull or discover 2 local Ollama models in Models.
- Open Eval Wizard and start a benchmark run on a built-in dataset.
- (Optional) Enable one frontier comparison model in Settings and run mixed local + cloud.
- Open Run Details to inspect quality metrics, run health, and example outputs.
- Export the run as JSON, Markdown, or CSV.
- Launch Arena to run a blind battle and vote your preferred output.
- Stable version: v1.0.0
- Release history: 3 tagged releases
- Canonical changelog: GitHub Releases
- Validation baseline: npm run check and pytest -q
GitHub Releases are the canonical changelog for EvalBench and include shipped features, user impact, validation evidence, and upgrade notes.
- Scope summary: What shipped and user impact: Why it matters
- Explicit upgrade notes: breaking changes, migration steps, config/env changes
- Validation evidence:
npm run check,pytest -q, plus commit/tag reference
| Layer | Technology |
|---|---|
| Frontend UI | React 18, Vite, Tailwind CSS, Shadcn UI |
| Routing & State | Wouter, TanStack React Query |
| Charts | Recharts |
| Backend Framework | Python 3, FastAPI |
| Database | SQLite (via SQLAlchemy ORM) |
| Validation | Pydantic v2 (Backend) + Zod (Frontend) |
| Scoring Libs | rouge-score, sacrebleu, nltk, scipy |
- Node.js: v18 or higher (for the frontend React app)
- Python: v3.11 or higher (Windows users: Python 3.12 is recommended for best dependency compatibility)
- Ollama: Installed locally and running on
http://localhost:11434(with at least one model pulled) - uv: Required for backend install/run, but you can install it with
pipif it is missing
- Clone and Install Frontend
git clone https://github.com/tatwan/EvalBench.git
cd EvalBench
npm install- Install Backend Dependencies
EvalBench uses
pyproject.tomldependency management (lockfile:uv.lock) as the source of truth.
If you want to install Python deps explicitly up front, run:
uv syncIf you do not have uv, install it first with pip:
python -m pip install uvThen rerun:
uv syncOr use the npm helper:
npm run py:installmacOS is the primary development environment. The notes below are for Windows users.
Project metadata currently allows Python 3.11+ (see pyproject.toml), but plain uv sync on Windows may select a very new interpreter (for example 3.14) and fail due to binary dependency and toolchain gaps.
Use Python 3.12 explicitly for reliable setup:
uv sync --python 3.12If npm commands fail with a "running scripts is disabled" error, run:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUserIf combined startup fails in your shell, run frontend and backend separately:
# Terminal 1
npm run dev:frontend
# Terminal 2
npm run dev:backendIf combined startup works in your environment, you can keep using npm run dev.
On first run, EvalBench auto-generates an encryption key at ~/.evalbench_key (chmod 600). All API keys entered in Settings are encrypted with this key before being stored in evalbench.db.
Back up ~/.evalbench_key. If you lose this file, stored API keys become permanently unreadable and must be re-entered.
This app is designed for single-user local use. Do not expose the backend port over a network.
You can start the entire stack (both the Vite Frontend and the FastAPI Backend) with a single command from the root EvalBench directory:
npm run devThis command uses concurrently to spin up two processes:
- Frontend runs on
http://localhost:5173 - Backend runs on
http://localhost:8001(Note: The frontend automatically proxies/apirequests to this port).
Open http://localhost:5173 in your browser to view EvalBench.
To stop the application, simply go to the terminal window where it is currently running and press:
Ctrl + C
This will gracefully terminate both the Frontend Vite server and the Backend FastAPI server simultaneously. Make sure to close the browser tab to avoid any lingering connection attempts.
Use these commands before shipping changes:
npm run check
pytest -qRelease notes in GitHub Releases are the canonical changelog for this project.
Both are kept green as part of the active audit/remediation work.
- Broader provider and model coverage for evaluated comparison baselines
- Additional benchmark packs and deeper task-specific metrics
- Richer shareable reporting and export workflows
- Performance scaling improvements (parallelism, caching, larger-run ergonomics)
- Stronger dataset provenance, governance, and collaboration workflows
Ideas, issues, and PRs are welcome. Here are the most impactful ways to contribute:
Task types live in backend/scoring/ β each task has its own scorer module. To add a new task:
- Create a new scorer file in
backend/scoring/following the pattern of an existing one (e.g.,summarization.py). - Register the task name and its default metrics in
backend/schemas.pyand keep shared response shapes aligned inshared/routes.tsif needed. - Add a seed dataset entry in
backend/services/dataset_seeder.pyso the Eval Wizard can surface your task with a working built-in example. - Add at least one
pytesttest intests/covering expected score outputs.
Metrics are computed inside the task scorer files in backend/scoring/. To add a metric:
- Implement the scoring function and return it as part of the scorer's output dictionary.
- Add the metric label to the frontend's display config so it renders in Run Details and Compare views.
- Document the metric in the Learn tab guidance (the educational layer) so users understand when to use it.
When proposing a new built-in dataset, please include:
- The benchmark source + license
- Expected metric behavior on known-good outputs
- A small seed subset (10β20 examples) for quick local tests
- Open an issue before large PRs to align on scope.
- Keep PRs focused β one feature or fix per PR.
- Run
npm run checkandpytest -qbefore submitting and include the output in your PR description.
MIT





