feat(cli): add `stats` subcommand — per-agent React Doctor leaderboard by aidenybai · Pull Request #932 · millionco/react-doctor

aidenybai · 2026-06-22T01:06:38Z

Summary

Adds react-doctor stats: a per-model/per-tool code-quality leaderboard built from your local AI agent history. It answers one question — which agent writes the cleanest React code in my repo?

It reads local agent history (Claude Code + Codex transcripts, and every place Cursor stores its chats), reconstructs the file content each model actually wrote, lints it with the existing engine, and ranks models and providers by a confidence-weighted React Doctor score.

Faithful reconstruction per provider: Claude post-edit snapshots, Cursor full-content blobs / CLI tool calls (real model attribution, not "Auto"), Codex apply_patch replay. Only actual React files (JSX/TSX, use client/use server, or a React-ecosystem import) are scored, so backend/util/config files don't dilute the result.
Every Cursor store is read: the GUI composer database (state.vscdb) for both the stable and Nightly builds, plus the cursor-agent CLI per-session stores under ~/.cursor / ~/.cursor-nightly (binary manifest + Write/ApplyPatch/StrReplace/Delete tool calls, with Read results captured as reconstruction bases). A database a running editor holds locked is read via SQLite's immutable mode rather than skipped.
Confidence-weighted ranking: each group's raw score regresses toward the global mean by its evidence (files are the dominant signal, lightly discounted by sessions since many files from one chat are one correlated sample), bounded by a floor. A tiny clean sample can't top the board; the raw and weighted scores both ship in --json.
Plain-language terminal UI: ranked model table + by-tool table with color-coded tools (cursor gray, claude orange, codex cyan), a best/worst callout, and honest "skipped" notes. Adds an orange formatter to the shared highlighter (honors --no-color).
Flags: --global (all repos), --since, --limit, --provider, --json. Default scope is the current repo.

Coverage is honest about limits: Codex shell edits aren't reconstructable (surfaced as skipped), reading any Cursor database needs node:sqlite (Node 22.13+), and the score requires network access.

Test plan

pnpm --filter react-doctor test (32 new stats tests: adapters, reconstruct, apply-patch, aggregate/weighting, is-react-source, render)
pnpm typecheck / pnpm lint / pnpm format:check
react-doctor stats in a repo with local Cursor/Claude/Codex history renders a sane leaderboard
react-doctor stats --json emits { schemaVersion, models, providers, best, worst, … } with both score and weightedScore
react-doctor stats --global ranks across repos; --provider cursor narrows the source

Update — broader Cursor/Codex coverage + engine deslop

Follow-up commit on this branch:

Coverage: scan the Cursor Nightly GUI build and the cursor-agent CLI stores (~/.cursor, ~/.cursor-nightly), not just the stable GUI state.vscdb. A live, editor-locked database is now read via SQLite immutable mode instead of crashing the run, and every SQLite close is guarded so an unreadable store degrades to "skip".
Deslop / DRY (behavior-preserving): consolidated the zip-slip path-inside guard into one audited @react-doctor/core util, shared the node:sqlite read-only open and the empty-string-preserving string narrow, dropped dead code, and replaced forbidden nested ternaries. Verified by typecheck/lint/format, the full react-doctor suite, and a real-data smoke (451 GUI + 12 CLI sessions across claude-opus-4-8/composer-2.5/gpt-5.5).

Note

Medium Risk
Large new CLI surface that reads homedir agent/SQLite data and optionally sends anonymized scores to react.doctor and Sentry; mitigations include path guards, React-only linting, and telemetry gating, but coverage of transcript formats remains inherently fragile.

Overview
Adds react-doctor stats, which mines local Claude Code, Codex, and Cursor history (GUI state.vscdb for stable/Nightly plus CLI store.db under ~/.cursor / ~/.cursor-nightly), replays edits into file snapshots, lints only React sources, and ranks models/tools by a confidence-weighted score (raw score regressed toward the global mean so tiny samples can’t win). Repo scope is default; --global, --since, --limit, --provider, and --json control scope and output.

The pipeline drops failed/skipped scans and non-faithful reconstructions (e.g. Codex shell edits), uses shared isPathInside zip-slip checks for temp trees, and optionally POSTs code-free leaderboard rows to /api/stats and Sentry cli.stats spans when telemetry/score is on (--no-score / --no-telemetry keeps runs local). Terminal UI adds provider-colored tables (including highlighter.orange for Claude).

Also wires STATS_API_URL in core, registers stats in CLI help/flag stripping/run context, and includes deslop-js in the pkg-pr-new publish workflow.

^{Reviewed by Cursor Bugbot for commit 83a9210. Bugbot is set up for automated code reviews on this repo. Configure here.}

…d from agent history Adds `react-doctor stats`, which reads local AI agent history (Claude Code + Codex transcripts, the Cursor composer database), reconstructs the React code each model actually wrote, lints it with the existing engine, and ranks models and providers by a confidence-weighted React Doctor score. - Reconstructs faithful post-edit file content per provider (Claude snapshots, Cursor `afterContentId` blobs, Codex `apply_patch`), filtered to real React. - Confidence-weighted ranking: each group's raw score regresses toward the global mean by its evidence (files dominant, lightly discounted by sessions), so a tiny clean sample can't top the board. - Plain-language terminal leaderboard with color-coded tools (adds an `orange` to the shared highlighter for Claude); `--json` for the machine-readable report.

pkg-pr-new · 2026-06-22T01:07:17Z

Open in StackBlitz

npm i https://pkg.pr.new/deslop-js@932

npm i https://pkg.pr.new/eslint-plugin-react-doctor@932

npm i https://pkg.pr.new/oxlint-plugin-react-doctor@932

npm i https://pkg.pr.new/react-doctor@932

commit: 83a9210

Discovery loaded each candidate session from the Cursor SQLite DB synchronously, blocking the event loop so the ora spinner appeared frozen for a few seconds. Yield to the event loop periodically and report live "(N found)" progress during the history walk.

Cap the terminal table to the top 5 with a "+ N more" pointer to --json; the full ranking still ships in the JSON report and the best/worst callout.

Consolidate the asString/asRecord/asArray/parseJson coercers (copied across the Claude/Codex/Cursor adapters) into a shared coerce.ts, extract the "most common model" tally into most-common-key.ts, and reuse statMtimeMs in findJsonlFiles. Behavior unchanged.

The static `node:sqlite` import crashed the whole adapter test file on Node 20 (where the module doesn't exist), failing the 20.19 CI matrix job. Load it via a guarded require and skip the Cursor suite when unavailable, mirroring cursor-db.ts's runtime degradation.

- closeCursorDb now closes the underlying node:sqlite database instead of only dropping the cached reference, so the fixture file is unlocked and Windows can unlink the temp dir (was EBUSY in the adapter test teardown). - The reconstruct test compared emitted absolute paths against hardcoded POSIX strings; on Windows resolveAgainstCwd normalizes to backslashes, so expectations now mirror that normalization. Production code unchanged.

- A failed apply_patch update hunk left the prior in-session buffer in place and still emitted the file as faithfully reconstructed; drop it to unreconstructable so stale content is never linted as the model's output. - Sessions touching only non-lintable files (e.g. markdown) had zero reconstructed files and zero failures but were counted as "unreconstructable"; require an actual reconstruction failure for that bucket so the skip note stays accurate.

Replace the readFileSync + split("\n") transcript reader with a streaming node:readline parser so memory stays flat on large Claude/Codex transcripts. Makes session loading async (SessionCandidate.load + the parse adapters); the Cursor composer load wraps its sync DB walk to match.

…--since) - Drop scans that error/skip/lint-fail instead of counting them as clean code, which was inflating the leaderboard. - Emit structured JSON on failure in --json mode (reuse enableJsonMode), which also silences the incidental score-API stderr warning. - Exclude unknown-timestamp candidates under --since so the filter is consistent. - Consolidate the path-inside predicate, move render magic numbers to constants, type-guard the provider flag, throw on invalid --limit, rename op -> operation.

… content A replace/Edit whose oldString isn't in the in-session buffer now marks the file unreconstructable (like a failed apply_patch hunk) rather than keeping the stale snapshot and scoring it as the model's final output.

Confidence weighting now counts only sessions that contributed scanned files, so non-React/failed/skipped sessions no longer raise session reliability or effective file weight. The reported per-group session count still reflects every analyzed session.

react-doctor has a runtime `deslop-js: workspace:*` dependency, but the Continuous Releases workflow didn't publish deslop-js, so pkg.pr.new couldn't rewrite the ref and `npx https://pkg.pr.new/react-doctor@<pr>` failed with EUNSUPPORTEDPROTOCOL ("workspace:"). Add deslop-js to the publish set.

…slop the engine Broaden which local agent history `stats` reads, and refine the engine. Coverage: - Cursor GUI: scan both the stable and Nightly builds' composer databases (was stable-only — a Nightly-only user got zero GUI sessions), and read a live, editor-locked database via SQLite's immutable mode instead of letting the lock crash the run. - Cursor CLI agent: new source for the per-session content-addressed stores under ~/.cursor and ~/.cursor-nightly — decode the hex meta row, parse the binary message manifest, and map Write/ApplyPatch/StrReplace/Delete tool calls to edits, capturing Read results as reconstruction bases. - Codex (~/.codex) was already covered; verified. Engine deslop (behavior-preserving): - consolidate the zip-slip path-inside guard into one audited core util (@react-doctor/core isPathInside), and share the node:sqlite read-only open and the empty-string-preserving string narrow (coerce asNullableString) instead of hand-rolling copies - drop dead code (write-only session timestamps, an unused export, an unreachable branch), replace forbidden nested ternaries with if/else and a lookup table, collapse a redundant variable and a pass-through wrapper - guard every SQLite close so a locked/unreadable store degrades to "skip" rather than sinking the whole stats run Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…board rows) Wrap `react-doctor stats` in a `cli.stats` root span with a discover/scan/ aggregate latency waterfall, and emit one queryable `stats.leaderboard_row` span per ranked model carrying its model, harness, confidence-weighted score, and files scored — the four leaderboard columns. Same gating as the scan path (no-op under --no-score, in tests, and for @react-doctor/api). - Extract the shared `modelLabel` helper (render + tracing) into one util. - Pure, exported `buildStatsRowAttributes` for testability, mirroring `buildRunEventAttributes`. - Fix `detectCommand`: `stats` runs were mis-tagged `command=inspect`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nity board Send the same per-model rows the stats command puts on Sentry ({model, harness, score, files}) to our own /api/stats so we store them and get back the community leaderboard, shown beneath the local board. - stats/leaderboard-row.ts: one shared projection feeds BOTH the Sentry span attributes and the /api/stats payload, so they can't drift and both stay code-free (no source, paths, or identity ever leaves the machine). - stats/report-stats-run.ts: best-effort gzip POST (null on any failure), honoring an optional REACT_DOCTOR_STATS_API_URL override for local e2e. - stats command: honors --no-score/--no-telemetry — skips the score API (scores n/a, ranked by diagnostics-per-file) AND the /api/stats report, so a --no-telemetry run is fully local and less rich. - render-stats: appends a "Community leaderboard" table (with run counts) when one is returned. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cursor · 2026-06-23T02:45:32Z

  ["version", VERSION_FLAG_SPEC],
  ["rules", RULES_FLAG_SPEC],
  ["why", WHY_FLAG_SPEC],
+  ["stats", STATS_FLAG_SPEC],


Stats strips trailing no-score

Medium Severity

For react-doctor stats --no-score (or --no-telemetry), the pre-Commander flag stripper drops those globals when they appear after stats, so statsAction still treats telemetry as on and calls the score and /api/stats endpoints even though Sentry already opted out via raw process.argv.

Additional Locations (1)

packages/react-doctor/src/cli/commands/stats.ts#L98-L136

^{Reviewed by Cursor Bugbot for commit db52fc6. Configure here.}

cursor · 2026-06-23T02:45:32Z

+        sessionsRanked: results.filter((result) => result.filesScanned > 0).length,
+        sessionsNonReact: results.filter(
+          (result) => result.filesScanned === 0 && result.reconstructedFiles > 0,
+        ).length,


Lint failures labeled non-React

Low Severity

Sessions where React files were reconstructed but linting failed or was skipped are counted in sessionsNonReact, so the footer can say they “changed only non-React files” even when the pipeline dropped them for scan errors.

Additional Locations (1)

packages/react-doctor/src/stats/run-stats-scan.ts#L115-L120

^{Reviewed by Cursor Bugbot for commit db52fc6. Configure here.}

cursor · 2026-06-23T02:45:32Z

+      await new Promise<void>((resolve) => setImmediate(resolve));
+    }
+    if (sessions.length >= scope.limit) break;
+  }


Repo scope scans all history

Medium Severity

With repo scope, if no candidate session passes the repo filter, discovery keeps loading every sorted candidate until the list ends, ignoring --limit, which can mean thousands of synchronous transcript/DB loads on a large machine.

^{Reviewed by Cursor Bugbot for commit db52fc6. Configure here.}

The leaderboard payload is a handful of tiny rows ({model, harness, score, files}), so gzip cost more than it saved — it was cargo-culted from the diagnostics-heavy score API. Send plain JSON and drop the Content-Encoding header. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 83a9210. Configure here.}

cursor · 2026-06-23T03:04:11Z

+    // phase failed yields zero diagnostics for reasons unrelated to code
+    // quality. Counting its files as clean would reward un-lintable code and
+    // inflate the leaderboard, so it joins the empty bucket instead.
+    if (!result.ok || result.skipped || result.didLintFail) return empty;


Failed lint labeled non-React

Medium Severity

When a session’s React files reconstruct successfully but runEditorScan errors, is skipped, or lint fails, the scan returns the same empty result shape as a non-React session (filesScanned === 0, reconstructedFiles > 0). The report then increments sessionsNonReact and shows “changed only non-React files” even though React code was replayed and lint never ran.

Additional Locations (1)

packages/react-doctor/src/cli/commands/stats.ts#L147-L150

^{Reviewed by Cursor Bugbot for commit 83a9210. Configure here.}