feat(eval+frontend): /api/eval routes and React eval UI#3
Open
elkaix wants to merge 12 commits into
Open
Conversation
Introduces RunRegistry backed by a threading.Lock-guarded dict to track in-flight eval runs across queued/running/completed/failed states. Supports incremental progress updates, TTL-based eviction of finished runs, and never evicts active runs regardless of age.
Implements 8 endpoints under /api/eval:
GET /configs — list config names from configs/eval/
POST /run — submit run (202), dispatches via BackgroundTasks
GET /runs — list persisted runs (disk), sorted by started_at desc
GET /runs/{id} — full run detail (metadata + aggregated + cost)
GET /runs/{id}/results — paginated per-question results (?page=1&page_size=50)
GET /runs/{id}/results/{qid} — full EvalResult for one question
GET /runs/{id}/status — poll registry then disk; 404 if neither
GET /compare?a=&b= — metric deltas; 409 on eval-set version mismatch
Also:
- EvalRunner: add optional run_id_override param so the API can pre-compute
the run_id and register it in RunRegistry before the run starts
- main.py: register eval_router; create RunRegistry on app.state at lifespan
- _get_registry: lazy-init fallback so TestClient without context manager works
- EVAL_LLM_OVERRIDE_DUMMY=1 respected in background worker (imports _DummyLLM
from cli, same pattern as the CLI)
Routes file is 270 lines — slightly over the 250-line target, accepted per spec.
Implements the /eval index route table component: - Sort by any column (config_name, started_at, n_questions, n_errors, headline_metric) with asc/desc toggle; nulls always sort last - Search filter debounced 300ms via extracted useDebounce hook - Multi-select via row checkboxes; "Compare Selected" enabled at exactly 2 - Row click navigates to /eval/runs/:runId - Loading skeleton, error banner, two distinct empty states (no runs vs no filter match) - NewRunButton placeholder with clear TODO for Task 10 (NewEvalRunDialog) Also extracts hooks/use-debounce.ts as a shared utility.
- Create eval-page.tsx as a nested route dispatcher for /eval/*,
/eval/runs/:runId, and /eval/compare
- Add { path: "eval/*", element: <EvalPage /> } to App.tsx router config
- Add Evaluation nav item (BarChart3 icon) to sidebar NAV_ITEMS array
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 1 / Sub-plan 1C — surface the eval harness through the API and a React UI on top of #2.
Backend
RunRegistry/api/eval/*routes (runs, results, compare, configs) using FastAPIBackgroundTasksFrontend
MetricBarschart (recharts) with CI whiskers and significance markers (★)RunsListwith sort, filter, multi-select for compareRunDetail— aggregated metrics chart + per-question tableCompareView— side-by-side bars + per-question diffNewEvalRunDialog— config picker + progress-poll toast/eval/*routes + Evaluation sidebar linkDeps added:
recharts.Stack
Targets
feature/eval-harness-1b(#2). Will retarget after the lower stack merges.Notes
Several frontend files exceed the 250-line CLAUDE.md ceiling (largest:
run-detail.tsx653). The excess is the educational portfolio comment density required by CLAUDE.md — logged in task metadata and acknowledged in review.Test plan
/eval, click "New Run", confirm progress toast