Skip to content

feat(eval+frontend): /api/eval routes and React eval UI#3

Open
elkaix wants to merge 12 commits into
feature/eval-harness-1bfrom
feature/eval-harness-1c
Open

feat(eval+frontend): /api/eval routes and React eval UI#3
elkaix wants to merge 12 commits into
feature/eval-harness-1bfrom
feature/eval-harness-1c

Conversation

@elkaix

@elkaix elkaix commented Apr 27, 2026

Copy link
Copy Markdown
Owner

Summary

Phase 1 / Sub-plan 1C — surface the eval harness through the API and a React UI on top of #2.

Backend

  • Pydantic DTOs for eval API
  • Thread-safe in-process RunRegistry
  • /api/eval/* routes (runs, results, compare, configs) using FastAPI BackgroundTasks

Frontend

  • TanStack Query hooks for the eval API
  • MetricBars chart (recharts) with CI whiskers and significance markers (★)
  • RunsList with sort, filter, multi-select for compare
  • RunDetail — aggregated metrics chart + per-question table
  • CompareView — side-by-side bars + per-question diff
  • NewEvalRunDialog — config picker + progress-poll toast
  • Wired /eval/* routes + Evaluation sidebar link

Deps added: recharts.

Stack

Targets feature/eval-harness-1b (#2). Will retarget after the lower stack merges.

Notes

Several frontend files exceed the 250-line CLAUDE.md ceiling (largest: run-detail.tsx 653). The excess is the educational portfolio comment density required by CLAUDE.md — logged in task metadata and acknowledged in review.

Test plan

  • Backend tests green for new routes and registry
  • Reviewer: start backend + frontend, navigate to /eval, click "New Run", confirm progress toast
  • Reviewer: select two runs, click "Compare Selected", confirm side-by-side render

elkaix added 11 commits April 26, 2026 22:20
Introduces RunRegistry backed by a threading.Lock-guarded dict to track
in-flight eval runs across queued/running/completed/failed states.
Supports incremental progress updates, TTL-based eviction of finished
runs, and never evicts active runs regardless of age.
Implements 8 endpoints under /api/eval:
  GET  /configs         — list config names from configs/eval/
  POST /run             — submit run (202), dispatches via BackgroundTasks
  GET  /runs            — list persisted runs (disk), sorted by started_at desc
  GET  /runs/{id}       — full run detail (metadata + aggregated + cost)
  GET  /runs/{id}/results — paginated per-question results (?page=1&page_size=50)
  GET  /runs/{id}/results/{qid} — full EvalResult for one question
  GET  /runs/{id}/status — poll registry then disk; 404 if neither
  GET  /compare?a=&b=  — metric deltas; 409 on eval-set version mismatch

Also:
  - EvalRunner: add optional run_id_override param so the API can pre-compute
    the run_id and register it in RunRegistry before the run starts
  - main.py: register eval_router; create RunRegistry on app.state at lifespan
  - _get_registry: lazy-init fallback so TestClient without context manager works
  - EVAL_LLM_OVERRIDE_DUMMY=1 respected in background worker (imports _DummyLLM
    from cli, same pattern as the CLI)

Routes file is 270 lines — slightly over the 250-line target, accepted per spec.
Implements the /eval index route table component:
- Sort by any column (config_name, started_at, n_questions, n_errors,
  headline_metric) with asc/desc toggle; nulls always sort last
- Search filter debounced 300ms via extracted useDebounce hook
- Multi-select via row checkboxes; "Compare Selected" enabled at exactly 2
- Row click navigates to /eval/runs/:runId
- Loading skeleton, error banner, two distinct empty states (no runs vs
  no filter match)
- NewRunButton placeholder with clear TODO for Task 10 (NewEvalRunDialog)

Also extracts hooks/use-debounce.ts as a shared utility.
- Create eval-page.tsx as a nested route dispatcher for /eval/*,
  /eval/runs/:runId, and /eval/compare
- Add { path: "eval/*", element: <EvalPage /> } to App.tsx router config
- Add Evaluation nav item (BarChart3 icon) to sidebar NAV_ITEMS array
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant