Skip to content

tatwan/EvalBench

Repository files navigation

EvalBench

evalbench

Local-first LLM evaluation workbench for Ollama users who want trustworthy metrics, fair model comparisons, and faster iteration loops.

Ollama-first Local-first Privacy LLM-as-Judge Optional Custom Datasets Arena ELO Export Frontier Compare License

Status: v1.0.0 - Local-first eval workbench with trusted runs, optional judge scoring, frontier comparisons, Arena battles, and custom dataset tooling; 3 published releases with canonical changelog in GitHub Releases.


Table of Contents


Why EvalBench

EvalBench is for builders who run local models and want evidence, not vibes.

It gives you one practical loop:

  1. Benchmark quality with real metrics across tasks like summarization, code, RAG, knowledge, and embeddings.
  2. Compare local models against frontier models in the same run when you need an external baseline.
  3. Inspect reliability and failure context so run quality and run health are both visible.
  4. Create your own golden datasets so evaluations match your actual use case, not generic demos.

This project started as a teaching tool for students to learn golden datasets, metrics, and LLM-as-Judge without writing heavy pipeline code.

If you want "LM Studio for evaluation" with stronger rigor and dataset control, this is it.

Why It's Different

  • Local-first by design: EvalBench is intentionally single-user and local-first, with SQLite on-device and encrypted key storage.
  • Hybrid evaluation without platform lock-in: keep Ollama as your center, then optionally add OpenAI/Gemini/Claude/Groq models for comparison.
  • Objective + subjective scoring in one flow: combine reference metrics with optional LLM-as-Judge scoring and rationale.
  • Dataset creation is a core feature: build, import, version, and safely manage custom datasets from inside the product.
  • Two modes of truth: metric-based head-to-head comparison plus human preference testing via blind Arena battles and ELO.
  • Educational layer included: built-in metric guidance helps teams learn why each score exists, not just what the number is.

Quickstart (60 Seconds)

  1. Start Ollama locally and ensure at least one model is pulled.
  2. Install dependencies and start EvalBench:
npm install
npm run dev
  1. Open http://localhost:5173.
  2. Go to Eval Wizard, choose a task, select models, and run your first evaluation.

Real Usage Example

Example: compare two local models on Question Answering.

  1. Open Eval Wizard.
  2. Choose Question Answering.
  3. Select two local Ollama models (for example tinyllama:1.1b and gemma3:270m).
  4. Start the run and open Run Details.
  5. Review Exact Match and Token F1 plus run health fields (failed pairs, retries, cache hits).
  6. Export results as JSON, Markdown, or CSV.

Expected outcome:

  • You get model-by-model score differences on the same dataset/context.
  • You can inspect best/worst examples before deciding which model to ship.
  • You can share reproducible output with your team from the export files.

Demo

Product Overview (GIF)

EvalBench Overview

Eval Wizard Walkthrough (GIF)

EvalBench Eval Wizard

Screenshots

Eval Wizard β€” Run a benchmark in seconds

Eval Wizard

Arena β€” Blind pairwise voting with ELO

Arena

Head-to-Head Compare

Compare

Architecture

EvalBench uses a local-first architecture optimized for privacy and speed. It separates a lightweight, reactive frontend from a heavy, computational Python backend.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Frontend (React + Vite)                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Dashboard   β”‚ Eval Wizard  β”‚  Compare     β”‚  Arena   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             ↕ REST API + Server-Sent Events (SSE)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Backend (Python FastAPI)                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚  β”‚  Scoring     β”‚ Eval Runner  β”‚  Ollama      β”‚             β”‚
β”‚  β”‚  Algorithms  β”‚ (vLLM later) β”‚  Integration β”‚             β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             ↕
               SQLite Database (evalbench.db)

Core Concepts

1. Traditional Reference Metrics

We use established Python libraries (rouge-score, sacrebleu, nltk) to compute metrics like ROUGE, BLEU, Exact Match, Token F1, and Distinct-1/2 locally against Ground-Truth Golden Datasets. Datasets are seeded from inline subsets at startup (no external downloads).

2. LLM-as-Judge (Optional)

For subjective generation tasks, EvalBench can optionally use a configured judge model to score outputs on criteria such as coherence, fluency, and relevance, returning both a score and rationale. Judge providers are loaded lazily so optional SDKs do not block the core app, and judge scoring can be turned off entirely from Settings when you want objective-only runs.

3. Statistical Rigor And Reliability

EvalBench computes mean scores and margin of error where supported, and now separates quality from reliability by tracking failed pairs, retries, cache hits, cancellation state, and success rate for each run.


Features

  • Local model benchmarking first: Auto-discovers Ollama models and keeps the primary catalog local-first.
  • Mixed local + frontier evaluation: Add selected OpenAI/Gemini/Claude/Groq models as comparison baselines in the same run.
  • Optional LLM-as-Judge: Turn judge scoring on or off from Settings; when enabled, judge rationale is available in run analysis.
  • Task-aware eval wizard: Task selection drives metrics and benchmark dataset defaults with estimated runtime context.
  • Custom dataset builder and registry: Manually create or import CSV/JSON datasets, version by name, and safely delete unused user-authored sets.
  • Trusted run lifecycle: Live progress, cancellation, retries, partial-failure visibility, cache-hit tracking, and clear run health indicators.
  • Head-to-head compare: Fair model comparison based on shared completed run contexts with clearer significance framing.
  • Arena battles with ELO: Run blind pairwise matchups (random or manual model-vs-model) and update leaderboard ratings.
  • Export anywhere: Download results as JSON, Markdown, or CSV from Run Details.
  • Built to teach: Learn tab and metric guidance explain why each metric fits each task.

Try in 5 Minutes

  1. Pull or discover 2 local Ollama models in Models.
  2. Open Eval Wizard and start a benchmark run on a built-in dataset.
  3. (Optional) Enable one frontier comparison model in Settings and run mixed local + cloud.
  4. Open Run Details to inspect quality metrics, run health, and example outputs.
  5. Export the run as JSON, Markdown, or CSV.
  6. Launch Arena to run a blind battle and vote your preferred output.

Public Status

  • Stable version: v1.0.0
  • Release history: 3 tagged releases
  • Canonical changelog: GitHub Releases
  • Validation baseline: npm run check and pytest -q

Releases

GitHub Releases are the canonical changelog for EvalBench and include shipped features, user impact, validation evidence, and upgrade notes.

Release Quality Checklist (recommended)

  • Scope summary: What shipped and user impact: Why it matters
  • Explicit upgrade notes: breaking changes, migration steps, config/env changes
  • Validation evidence: npm run check, pytest -q, plus commit/tag reference

Technical Stack

Layer Technology
Frontend UI React 18, Vite, Tailwind CSS, Shadcn UI
Routing & State Wouter, TanStack React Query
Charts Recharts
Backend Framework Python 3, FastAPI
Database SQLite (via SQLAlchemy ORM)
Validation Pydantic v2 (Backend) + Zod (Frontend)
Scoring Libs rouge-score, sacrebleu, nltk, scipy

Setup & Installation

Prerequisites

  1. Node.js: v18 or higher (for the frontend React app)
  2. Python: v3.11 or higher (Windows users: Python 3.12 is recommended for best dependency compatibility)
  3. Ollama: Installed locally and running on http://localhost:11434 (with at least one model pulled)
  4. uv: Required for backend install/run, but you can install it with pip if it is missing

Installation Steps

  1. Clone and Install Frontend
git clone https://github.com/tatwan/EvalBench.git
cd EvalBench
npm install
  1. Install Backend Dependencies EvalBench uses pyproject.toml dependency management (lockfile: uv.lock) as the source of truth.

If you want to install Python deps explicitly up front, run:

uv sync

If you do not have uv, install it first with pip:

python -m pip install uv

Then rerun:

uv sync

Or use the npm helper:

npm run py:install

Platform Notes

macOS is the primary development environment. The notes below are for Windows users.

Python Version On Windows

Project metadata currently allows Python 3.11+ (see pyproject.toml), but plain uv sync on Windows may select a very new interpreter (for example 3.14) and fail due to binary dependency and toolchain gaps.

Use Python 3.12 explicitly for reliable setup:

uv sync --python 3.12

PowerShell Execution Policy

If npm commands fail with a "running scripts is disabled" error, run:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

npm run dev On Windows

If combined startup fails in your shell, run frontend and backend separately:

# Terminal 1
npm run dev:frontend

# Terminal 2
npm run dev:backend

If combined startup works in your environment, you can keep using npm run dev.

Security Note β€” Encryption Key Backup

On first run, EvalBench auto-generates an encryption key at ~/.evalbench_key (chmod 600). All API keys entered in Settings are encrypted with this key before being stored in evalbench.db.

Back up ~/.evalbench_key. If you lose this file, stored API keys become permanently unreadable and must be re-entered.

This app is designed for single-user local use. Do not expose the backend port over a network.


How to Run and Stop

🟒 Starting the App

You can start the entire stack (both the Vite Frontend and the FastAPI Backend) with a single command from the root EvalBench directory:

npm run dev

This command uses concurrently to spin up two processes:

  • Frontend runs on http://localhost:5173
  • Backend runs on http://localhost:8001 (Note: The frontend automatically proxies /api requests to this port).

Open http://localhost:5173 in your browser to view EvalBench.

πŸ”΄ Stopping the App

To stop the application, simply go to the terminal window where it is currently running and press:

Ctrl + C

This will gracefully terminate both the Frontend Vite server and the Backend FastAPI server simultaneously. Make sure to close the browser tab to avoid any lingering connection attempts.


Validation

Use these commands before shipping changes:

npm run check
pytest -q

Release notes in GitHub Releases are the canonical changelog for this project.

Both are kept green as part of the active audit/remediation work.


Roadmap

  • Broader provider and model coverage for evaluated comparison baselines
  • Additional benchmark packs and deeper task-specific metrics
  • Richer shareable reporting and export workflows
  • Performance scaling improvements (parallelism, caching, larger-run ergonomics)
  • Stronger dataset provenance, governance, and collaboration workflows

Contributing

Ideas, issues, and PRs are welcome. Here are the most impactful ways to contribute:

Adding a New Task Type

Task types live in backend/scoring/ β€” each task has its own scorer module. To add a new task:

  1. Create a new scorer file in backend/scoring/ following the pattern of an existing one (e.g., summarization.py).
  2. Register the task name and its default metrics in backend/schemas.py and keep shared response shapes aligned in shared/routes.ts if needed.
  3. Add a seed dataset entry in backend/services/dataset_seeder.py so the Eval Wizard can surface your task with a working built-in example.
  4. Add at least one pytest test in tests/ covering expected score outputs.

Adding a New Metric

Metrics are computed inside the task scorer files in backend/scoring/. To add a metric:

  1. Implement the scoring function and return it as part of the scorer's output dictionary.
  2. Add the metric label to the frontend's display config so it renders in Run Details and Compare views.
  3. Document the metric in the Learn tab guidance (the educational layer) so users understand when to use it.

Adding a New Dataset

When proposing a new built-in dataset, please include:

  • The benchmark source + license
  • Expected metric behavior on known-good outputs
  • A small seed subset (10–20 examples) for quick local tests

General Guidelines

  • Open an issue before large PRs to align on scope.
  • Keep PRs focused β€” one feature or fix per PR.
  • Run npm run check and pytest -q before submitting and include the output in your PR description.

License

MIT

About

Local-first LLM evaluation for Ollama: benchmark, compare, judge, battle, and export results.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors