EvalBench

Local-first LLM evaluation workbench for Ollama users who want trustworthy metrics, fair model comparisons, and faster iteration loops.

Status: v1.0.0 - Local-first eval workbench with trusted runs, optional judge scoring, frontier comparisons, Arena battles, and custom dataset tooling; 3 published releases with canonical changelog in GitHub Releases.

Why EvalBench

EvalBench is for builders who run local models and want evidence, not vibes.

It gives you one practical loop:

Benchmark quality with real metrics across tasks like summarization, code, RAG, knowledge, and embeddings.
Compare local models against frontier models in the same run when you need an external baseline.
Inspect reliability and failure context so run quality and run health are both visible.
Create your own golden datasets so evaluations match your actual use case, not generic demos.

This project started as a teaching tool for students to learn golden datasets, metrics, and LLM-as-Judge without writing heavy pipeline code.

If you want "LM Studio for evaluation" with stronger rigor and dataset control, this is it.

Why It's Different

Local-first by design: EvalBench is intentionally single-user and local-first, with SQLite on-device and encrypted key storage.
Hybrid evaluation without platform lock-in: keep Ollama as your center, then optionally add OpenAI/Gemini/Claude/Groq models for comparison.
Objective + subjective scoring in one flow: combine reference metrics with optional LLM-as-Judge scoring and rationale.
Dataset creation is a core feature: build, import, version, and safely manage custom datasets from inside the product.
Two modes of truth: metric-based head-to-head comparison plus human preference testing via blind Arena battles and ELO.
Educational layer included: built-in metric guidance helps teams learn why each score exists, not just what the number is.

Quickstart (60 Seconds)

Start Ollama locally and ensure at least one model is pulled.
Install dependencies and start EvalBench:

npm install
npm run dev

Open http://localhost:5173.
Go to Eval Wizard, choose a task, select models, and run your first evaluation.

Real Usage Example

Example: compare two local models on Question Answering.

Open Eval Wizard.
Choose Question Answering.
Select two local Ollama models (for example tinyllama:1.1b and gemma3:270m).
Start the run and open Run Details.
Review Exact Match and Token F1 plus run health fields (failed pairs, retries, cache hits).
Export results as JSON, Markdown, or CSV.

Expected outcome:

You get model-by-model score differences on the same dataset/context.
You can inspect best/worst examples before deciding which model to ship.
You can share reproducible output with your team from the export files.

Demo

Product Overview (GIF)

Eval Wizard Walkthrough (GIF)

Screenshots

Eval Wizard — Run a benchmark in seconds

Arena — Blind pairwise voting with ELO

Head-to-Head Compare

Architecture

EvalBench uses a local-first architecture optimized for privacy and speed. It separates a lightweight, reactive frontend from a heavy, computational Python backend.

┌─────────────────────────────────────────────────────────────┐
│                    Frontend (React + Vite)                  │
│  ┌──────────────┬──────────────┬──────────────┬──────────┐  │
│  │  Dashboard   │ Eval Wizard  │  Compare     │  Arena   │  │
│  └──────────────┴──────────────┴──────────────┴──────────┘  │
└─────────────────────────────────────────────────────────────┘
                             ↕ REST API + Server-Sent Events (SSE)
┌─────────────────────────────────────────────────────────────┐
│            Backend (Python FastAPI)                         │
│  ┌──────────────┬──────────────┬──────────────┐             │
│  │  Scoring     │ Eval Runner  │  Ollama      │             │
│  │  Algorithms  │ (vLLM later) │  Integration │             │
│  └──────────────┴──────────────┴──────────────┘             │
└─────────────────────────────────────────────────────────────┘
                             ↕
               SQLite Database (evalbench.db)

Core Concepts

1. Traditional Reference Metrics

We use established Python libraries (rouge-score, sacrebleu, nltk) to compute metrics like ROUGE, BLEU, Exact Match, Token F1, and Distinct-1/2 locally against Ground-Truth Golden Datasets. Datasets are seeded from inline subsets at startup (no external downloads).

2. LLM-as-Judge (Optional)

For subjective generation tasks, EvalBench can optionally use a configured judge model to score outputs on criteria such as coherence, fluency, and relevance, returning both a score and rationale. Judge providers are loaded lazily so optional SDKs do not block the core app, and judge scoring can be turned off entirely from Settings when you want objective-only runs.

3. Statistical Rigor And Reliability

EvalBench computes mean scores and margin of error where supported, and now separates quality from reliability by tracking failed pairs, retries, cache hits, cancellation state, and success rate for each run.

Features

Local model benchmarking first: Auto-discovers Ollama models and keeps the primary catalog local-first.
Mixed local + frontier evaluation: Add selected OpenAI/Gemini/Claude/Groq models as comparison baselines in the same run.
Optional LLM-as-Judge: Turn judge scoring on or off from Settings; when enabled, judge rationale is available in run analysis.
Task-aware eval wizard: Task selection drives metrics and benchmark dataset defaults with estimated runtime context.
Custom dataset builder and registry: Manually create or import CSV/JSON datasets, version by name, and safely delete unused user-authored sets.
Trusted run lifecycle: Live progress, cancellation, retries, partial-failure visibility, cache-hit tracking, and clear run health indicators.
Head-to-head compare: Fair model comparison based on shared completed run contexts with clearer significance framing.
Arena battles with ELO: Run blind pairwise matchups (random or manual model-vs-model) and update leaderboard ratings.
Export anywhere: Download results as JSON, Markdown, or CSV from Run Details.
Built to teach: Learn tab and metric guidance explain why each metric fits each task.

Try in 5 Minutes

Pull or discover 2 local Ollama models in Models.
Open Eval Wizard and start a benchmark run on a built-in dataset.
(Optional) Enable one frontier comparison model in Settings and run mixed local + cloud.
Open Run Details to inspect quality metrics, run health, and example outputs.
Export the run as JSON, Markdown, or CSV.
Launch Arena to run a blind battle and vote your preferred output.

Public Status

Stable version: v1.0.0
Release history: 3 tagged releases
Canonical changelog: GitHub Releases
Validation baseline: npm run check and pytest -q

Releases

GitHub Releases are the canonical changelog for EvalBench and include shipped features, user impact, validation evidence, and upgrade notes.

View Releases

Release Quality Checklist (recommended)

Scope summary: What shipped and user impact: Why it matters
Explicit upgrade notes: breaking changes, migration steps, config/env changes
Validation evidence: npm run check, pytest -q, plus commit/tag reference

Technical Stack

Layer	Technology
Frontend UI	React 18, Vite, Tailwind CSS, Shadcn UI
Routing & State	Wouter, TanStack React Query
Charts	Recharts
Backend Framework	Python 3, FastAPI
Database	SQLite (via SQLAlchemy ORM)
Validation	Pydantic v2 (Backend) + Zod (Frontend)
Scoring Libs	`rouge-score`, `sacrebleu`, `nltk`, `scipy`

Setup & Installation

Prerequisites

Node.js: v18 or higher (for the frontend React app)
Python: v3.11 or higher (Windows users: Python 3.12 is recommended for best dependency compatibility)
Ollama: Installed locally and running on http://localhost:11434 (with at least one model pulled)
uv: Required for backend install/run, but you can install it with pip if it is missing

Installation Steps

Clone and Install Frontend

git clone https://github.com/tatwan/EvalBench.git
cd EvalBench
npm install

Install Backend Dependencies EvalBench uses pyproject.toml dependency management (lockfile: uv.lock) as the source of truth.

If you want to install Python deps explicitly up front, run:

uv sync

If you do not have uv, install it first with pip:

python -m pip install uv

Then rerun:

uv sync

Or use the npm helper:

npm run py:install

Platform Notes

macOS is the primary development environment. The notes below are for Windows users.

Python Version On Windows

Project metadata currently allows Python 3.11+ (see pyproject.toml), but plain uv sync on Windows may select a very new interpreter (for example 3.14) and fail due to binary dependency and toolchain gaps.

Use Python 3.12 explicitly for reliable setup:

uv sync --python 3.12

PowerShell Execution Policy

If npm commands fail with a "running scripts is disabled" error, run:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

npm run dev On Windows

If combined startup fails in your shell, run frontend and backend separately:

# Terminal 1
npm run dev:frontend

# Terminal 2
npm run dev:backend

If combined startup works in your environment, you can keep using npm run dev.

Security Note — Encryption Key Backup

On first run, EvalBench auto-generates an encryption key at ~/.evalbench_key (chmod 600). All API keys entered in Settings are encrypted with this key before being stored in evalbench.db.

Back up ~/.evalbench_key. If you lose this file, stored API keys become permanently unreadable and must be re-entered.

This app is designed for single-user local use. Do not expose the backend port over a network.

How to Run and Stop

🟢 Starting the App

You can start the entire stack (both the Vite Frontend and the FastAPI Backend) with a single command from the root EvalBench directory:

npm run dev

This command uses concurrently to spin up two processes:

Frontend runs on http://localhost:5173
Backend runs on http://localhost:8001 (Note: The frontend automatically proxies /api requests to this port).

Open http://localhost:5173 in your browser to view EvalBench.

🔴 Stopping the App

To stop the application, simply go to the terminal window where it is currently running and press:

Ctrl + C

This will gracefully terminate both the Frontend Vite server and the Backend FastAPI server simultaneously. Make sure to close the browser tab to avoid any lingering connection attempts.

Validation

Use these commands before shipping changes:

npm run check
pytest -q

Release notes in GitHub Releases are the canonical changelog for this project.

Both are kept green as part of the active audit/remediation work.

Roadmap

Broader provider and model coverage for evaluated comparison baselines
Additional benchmark packs and deeper task-specific metrics
Richer shareable reporting and export workflows
Performance scaling improvements (parallelism, caching, larger-run ergonomics)
Stronger dataset provenance, governance, and collaboration workflows

Contributing

Ideas, issues, and PRs are welcome. Here are the most impactful ways to contribute:

Adding a New Task Type

Task types live in backend/scoring/ — each task has its own scorer module. To add a new task:

Create a new scorer file in backend/scoring/ following the pattern of an existing one (e.g., summarization.py).
Register the task name and its default metrics in backend/schemas.py and keep shared response shapes aligned in shared/routes.ts if needed.
Add a seed dataset entry in backend/services/dataset_seeder.py so the Eval Wizard can surface your task with a working built-in example.
Add at least one pytest test in tests/ covering expected score outputs.

Adding a New Metric

Metrics are computed inside the task scorer files in backend/scoring/. To add a metric:

Implement the scoring function and return it as part of the scorer's output dictionary.
Add the metric label to the frontend's display config so it renders in Run Details and Compare views.
Document the metric in the Learn tab guidance (the educational layer) so users understand when to use it.

Adding a New Dataset

When proposing a new built-in dataset, please include:

The benchmark source + license
Expected metric behavior on known-good outputs
A small seed subset (10–20 examples) for quick local tests

General Guidelines

Open an issue before large PRs to align on scope.
Keep PRs focused — one feature or fix per PR.
Run npm run check and pytest -q before submitting and include the output in your PR description.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
backend		backend
client		client
images		images
shared		shared
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
components.json		components.json
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
pyproject.toml		pyproject.toml
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json
uv.lock		uv.lock
vite.config.ts		vite.config.ts

Folders and files

Latest commit

History

Repository files navigation

EvalBench

Table of Contents

Why EvalBench

Why It's Different

Quickstart (60 Seconds)

Real Usage Example

Demo

Product Overview (GIF)

Eval Wizard Walkthrough (GIF)

Screenshots

Eval Wizard — Run a benchmark in seconds

Arena — Blind pairwise voting with ELO

Head-to-Head Compare

Architecture

Core Concepts

1. Traditional Reference Metrics

2. LLM-as-Judge (Optional)

3. Statistical Rigor And Reliability

Features

Try in 5 Minutes

Public Status

Releases

Release Quality Checklist (recommended)

Technical Stack

Setup & Installation

Prerequisites

Installation Steps

Platform Notes

Python Version On Windows

PowerShell Execution Policy

npm run dev On Windows

Security Note — Encryption Key Backup

How to Run and Stop

🟢 Starting the App

🔴 Stopping the App

Validation

Roadmap

Contributing

Adding a New Task Type

Adding a New Metric

Adding a New Dataset

General Guidelines

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages