-
Notifications
You must be signed in to change notification settings - Fork 420
feat(cli): add stats subcommand — per-agent React Doctor leaderboard
#932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
aidenybai
wants to merge
17
commits into
main
Choose a base branch
from
feat/stats-agent-leaderboard
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
a35e16e
feat(cli): add `stats` subcommand — per-model React Doctor leaderboar…
aidenybai 7b0e7b1
fix(cli): keep stats spinner responsive during session discovery
aidenybai fe9f110
feat(cli): show only the top 5 models in the stats leaderboard
aidenybai 8769924
refactor(stats): deduplicate transcript coercion helpers
aidenybai 755b8aa
test(stats): guard cursor adapter test behind node:sqlite availability
aidenybai 721470d
fix(stats): make cursor DB + reconstruct tests pass on Windows
aidenybai f15f640
fix(stats): correct reconstruction fidelity and skip bucketing (Bugbot)
aidenybai 0d13c86
refactor(stats): stream JSONL transcripts via node:readline
aidenybai dad2a5c
fix(stats): address review feedback (score correctness, JSON errors, …
aidenybai 9a20e3d
fix(stats): drop unfaithful StrReplace edits instead of linting stale…
aidenybai f6b2f03
fix(stats): weight scores by productive sessions, not dead ones
aidenybai 3f50df7
chore(stats): bump changeset to patch
aidenybai 509f229
fix(ci): publish deslop-js to pkg.pr.new so previews install
aidenybai f26f960
feat(stats): scan every Cursor store (Nightly GUI + CLI agent) and de…
rayhanadev ac04d51
feat(stats): trace stats runs in Sentry (cli.stats + per-model leader…
rayhanadev db52fc6
feat(stats): report leaderboard rows to /api/stats + render the commu…
rayhanadev 83a9210
refactor(stats): send /api/stats payload as plain JSON, not gzip
rayhanadev File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| --- | ||
| "react-doctor": patch | ||
| --- | ||
|
|
||
| Add a `react-doctor stats` subcommand — a per-model code-quality leaderboard built from local AI agent chat history. | ||
|
|
||
| `stats` reads local agent history — Claude Code (`~/.claude`) and Codex (`~/.codex`) transcripts, plus Cursor's GUI composer databases and CLI agent stores (`~/.cursor`, `~/.cursor-nightly`) — reconstructs the file content each model actually wrote (Claude post-edit snapshots, Cursor full post-edit file snapshots, Codex `apply_patch` envelopes), lints that content with the existing engine, and ranks models and providers by their React Doctor score and diagnostics-per-file. The job: answer "which agent/model writes the cleanest React code in my repo". | ||
|
|
||
| - Only the React code each model wrote is scored. Reconstructed files are filtered to actual React (JSX/TSX, `use client`/`use server` directives, or a React-ecosystem import) before linting, so a model's plain backend/util/config files don't pad its file count or dilute its diagnostics-per-file. A scan that errors, is skipped, or whose lint phase fails is dropped rather than counted as zero-diagnostic "clean" code, so un-lintable output can't inflate a model's score. | ||
| - Ranking is by a confidence-weighted score, not the raw score: each group's score is regressed toward the global mean by its evidence, so a model with a handful of clean files can't top the board on a tiny sample. Files are the dominant signal; sessions only lightly discount the file weight (many files from one session are one correlated sample) and never below a floor. | ||
| - Cursor is read from every place it stores chats: the GUI composer database (`state.vscdb`) for both the stable and Nightly builds, and the CLI agent's per-session stores under `~/.cursor` and `~/.cursor-nightly`. Each session carries its real model (e.g. `claude-opus-4-8`, `gpt-5.5`, `composer-2.5`) and a faithful reconstruction of every edited file (full GUI post-edit snapshots; CLI `Write`/`ApplyPatch`/`StrReplace`/`Delete` tool calls replayed against captured reads). A database a running editor holds locked is read via SQLite's `immutable` mode rather than skipped. Attribution falls back to `unknown` only for GUI chats left on the "Auto" model. | ||
| - Default scope is the current repository (sessions whose cwd or edits touch the repo root); `--global` ranks across every repo on the machine. `--since`, `--limit`, and `--provider` bound the work. | ||
| - `--json` emits a structured leaderboard (`{ schemaVersion, scope, models, providers, best, worst, … }`); the terminal output shows the top models and per-tool tables with a single score bar (the confidence-weighted score) and a best/worst callout. | ||
| - Coverage is honest about its limits: Codex shell-based edits are not faithfully reconstructable (surfaced as skipped), reading any Cursor database requires `node:sqlite` (Node 22.13+), and the score requires network access. | ||
| - Anonymized Sentry tracing (CLI only, same gating as the scan path — off under `--no-score`, in tests, and for the programmatic API): each run is one `cli.stats` trace with a discover/scan/aggregate latency waterfall, and every ranked model is a queryable `stats.leaderboard_row` span carrying its model, harness, confidence-weighted score, and files scored — so the leaderboard is sliceable in Sentry's Trace Explorer. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| import * as path from "node:path"; | ||
|
|
||
| export interface IsPathInsideOptions { | ||
| /** When `true`, `childPath` equal to `parentPath` counts as inside. */ | ||
| readonly allowSame?: boolean; | ||
| } | ||
|
|
||
| /** | ||
| * `true` when `childPath` resolves within `parentPath`. By default the parent | ||
| * directory itself does not count (the strict zip-slip guard); pass | ||
| * `allowSame: true` to treat an exact match as inside (scope membership). | ||
| * | ||
| * Zip-Slip defense: relative paths can arrive from untrusted sources — a | ||
| * crafted git index/pack/symlinked tree, or a reconstructed agent transcript — | ||
| * and smuggle `..` segments that escape a temp root. Resolve against the parent | ||
| * and reject anything that lands outside before writing. This is the one | ||
| * audited copy of that guard, shared across the staged/baseline scan paths and | ||
| * the stats reconstruction tree so the two cannot drift. | ||
| */ | ||
| export const isPathInside = ( | ||
| childPath: string, | ||
| parentPath: string, | ||
| options: IsPathInsideOptions = {}, | ||
| ): boolean => { | ||
| const relative = path.relative(parentPath, childPath); | ||
| if (!relative) return Boolean(options.allowSame); | ||
| return !relative.startsWith("..") && !path.isAbsolute(relative); | ||
| }; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,186 @@ | ||
| import * as path from "node:path"; | ||
| import { resolveScanTarget, type ReactDoctorConfig } from "@react-doctor/core"; | ||
| import { aggregateStats } from "../../stats/aggregate-stats.js"; | ||
| import { STATS_DEFAULT_SESSION_LIMIT } from "../../stats/constants.js"; | ||
| import { discoverSessions } from "../../stats/discover-sessions.js"; | ||
| import { renderStatsReport } from "../../stats/render-stats.js"; | ||
| import { reportStatsRun } from "../../stats/report-stats-run.js"; | ||
| import { runStatsScan } from "../../stats/run-stats-scan.js"; | ||
| import type { | ||
| CommunityLeaderboard, | ||
| StatsProvider, | ||
| StatsReport, | ||
| StatsScopeOptions, | ||
| } from "../../stats/types.js"; | ||
| import { METRIC } from "../utils/constants.js"; | ||
| import { enableJsonMode } from "../utils/json-mode.js"; | ||
| import { recordCount } from "../utils/record-metric.js"; | ||
| import { spinner } from "../utils/spinner.js"; | ||
| import { | ||
| recordStatsLeaderboard, | ||
| traceStatsPhase, | ||
| withSentryStatsSpan, | ||
| } from "../utils/with-sentry-stats-span.js"; | ||
|
|
||
| export interface StatsFlags { | ||
| global?: boolean; | ||
| since?: string; | ||
| limit?: string; | ||
| provider?: string; | ||
| json?: boolean; | ||
| cwd?: string; | ||
| // Commander negations from the root program: `--no-score` → `score: false`, | ||
| // `--no-telemetry` → `telemetry: false`. Both opt out of the network. | ||
| score?: boolean; | ||
| telemetry?: boolean; | ||
| } | ||
|
|
||
| const VALID_PROVIDERS = new Set<string>(["claude", "codex", "cursor"]); | ||
|
|
||
| const isStatsProvider = (value: string): value is StatsProvider => VALID_PROVIDERS.has(value); | ||
|
|
||
| const parseProvider = (value: string | undefined): StatsProvider | undefined => { | ||
| if (value === undefined) return undefined; | ||
| if (!isStatsProvider(value)) { | ||
| throw new Error(`Unknown provider "${value}". Expected one of: claude, codex, cursor.`); | ||
| } | ||
| return value; | ||
| }; | ||
|
|
||
| const parseSince = (value: string | undefined): Date | undefined => { | ||
| if (value === undefined) return undefined; | ||
| const parsed = new Date(value); | ||
| if (Number.isNaN(parsed.getTime())) { | ||
| throw new Error(`Invalid --since date "${value}". Use e.g. 2026-06-01.`); | ||
| } | ||
| return parsed; | ||
| }; | ||
|
|
||
| const parseLimit = (value: string | undefined): number => { | ||
| if (value === undefined) return STATS_DEFAULT_SESSION_LIMIT; | ||
| const parsed = Number.parseInt(value, 10); | ||
| if (!Number.isFinite(parsed) || parsed <= 0) { | ||
| throw new Error(`Invalid --limit "${value}". Use a positive integer, e.g. 200.`); | ||
| } | ||
| return parsed; | ||
| }; | ||
|
|
||
| const resolveTarget = async ( | ||
| directory: string, | ||
| ): Promise<{ root: string; userConfig: ReactDoctorConfig | null }> => { | ||
| try { | ||
| const target = await resolveScanTarget(directory); | ||
| return { root: target.resolvedDirectory, userConfig: target.userConfig }; | ||
| } catch { | ||
| return { root: path.resolve(directory), userConfig: null }; | ||
| } | ||
| }; | ||
|
|
||
| export const statsAction = async (flags: StatsFlags): Promise<void> => { | ||
| const directory = flags.cwd ?? process.cwd(); | ||
| // Register JSON mode up front so any throw (flag parsing, scan, or score API | ||
| // failure) is emitted as a structured JSON error by the top-level handler | ||
| // instead of plain text — and so incidental logs (e.g. a score-API warning) | ||
| // never corrupt the report on stdout. | ||
| if (flags.json) enableJsonMode({ compact: false, directory }); | ||
| const scope: StatsScopeOptions = { | ||
| global: flags.global ?? false, | ||
| since: parseSince(flags.since), | ||
| limit: parseLimit(flags.limit), | ||
| provider: parseProvider(flags.provider), | ||
| }; | ||
|
|
||
| const { root, userConfig } = await resolveTarget(directory); | ||
|
|
||
| // `--no-score` / `--no-telemetry` (or `noScore` in config) opt out of the | ||
| // network entirely — same signal `resolve-cli-inspect-options` uses. When off, | ||
| // we skip the score API (scores show n/a, ranked by diagnostics-per-file) and | ||
| // the `/api/stats` report, so a `--no-telemetry` run is fully local. | ||
| const telemetryEnabled = !( | ||
| flags.score === false || | ||
| flags.telemetry === false || | ||
| Boolean(userConfig?.noScore) | ||
| ); | ||
|
|
||
| // ora renders to stderr; suppress it in JSON mode so the run stays quiet. | ||
| // The whole run is one Sentry trace: each phase below is a child span, and | ||
| // every ranked model becomes a queryable leaderboard-row span. | ||
| const { report, community } = await withSentryStatsSpan<{ | ||
| report: StatsReport; | ||
| community: CommunityLeaderboard | null; | ||
| }>(async (rootSpan) => { | ||
| const progress = flags.json ? null : spinner("Looking through your agent history…").start(); | ||
| try { | ||
| const sessions = await traceStatsPhase("discover sessions", () => | ||
| discoverSessions(root, scope, (foundCount) => | ||
| progress?.update(`Looking through your agent history… (${foundCount} found)`), | ||
| ), | ||
| ); | ||
| progress?.update("Checking the code each agent wrote…"); | ||
| const results = await traceStatsPhase("scan sessions", () => | ||
| runStatsScan(sessions, scope.global ? null : root, { | ||
| onProgress: (completedCount, totalCount) => | ||
| progress?.update( | ||
| `Checking the code each agent wrote… (${completedCount}/${totalCount})`, | ||
| ), | ||
| }), | ||
| ); | ||
| progress?.update(telemetryEnabled ? "Scoring…" : "Ranking…"); | ||
| const aggregated = await traceStatsPhase("aggregate + score", () => | ||
| // Skip the score API when telemetry is off: a null scorer leaves every | ||
| // score null, and ranking falls back to diagnostics-per-file. | ||
| aggregateStats( | ||
| results, | ||
| userConfig, | ||
| telemetryEnabled ? undefined : () => Promise.resolve(null), | ||
| ), | ||
| ); | ||
|
|
||
| const built: StatsReport = { | ||
| scope: scope.global ? "global" : "repo", | ||
| directory: root, | ||
| models: aggregated.models, | ||
| providers: aggregated.providers, | ||
| best: aggregated.best, | ||
| worst: aggregated.worst, | ||
| sessionsAnalyzed: results.length, | ||
| sessionsRanked: results.filter((result) => result.filesScanned > 0).length, | ||
| sessionsNonReact: results.filter( | ||
| (result) => result.filesScanned === 0 && result.reconstructedFiles > 0, | ||
| ).length, | ||
| sessionsUnreconstructable: results.filter( | ||
| (result) => | ||
| result.filesScanned === 0 && | ||
| result.reconstructedFiles === 0 && | ||
| result.unreconstructable > 0, | ||
| ).length, | ||
| generatedAt: new Date().toISOString(), | ||
| }; | ||
| recordStatsLeaderboard(built.models, rootSpan); | ||
| // Send the same leaderboard rows to our own store and get the community | ||
| // board back. Best-effort and telemetry-gated; never blocks the result. | ||
| progress?.update("Comparing with the community…"); | ||
| const communityBoard = telemetryEnabled | ||
| ? await traceStatsPhase("report leaderboard", () => reportStatsRun(built)) | ||
| : null; | ||
| progress?.succeed("Done."); | ||
| return { report: built, community: communityBoard }; | ||
| } finally { | ||
| progress?.stop(); | ||
| } | ||
| }); | ||
|
|
||
| recordCount(METRIC.statsRun, 1, { | ||
| scope: report.scope, | ||
| sessions: report.sessionsAnalyzed, | ||
| providers: report.providers.length, | ||
| provider: scope.provider ?? "all", | ||
| }); | ||
|
|
||
| if (flags.json) { | ||
| process.stdout.write(`${JSON.stringify({ schemaVersion: 1, ...report }, null, 2)}\n`); | ||
| return; | ||
|
aidenybai marked this conversation as resolved.
|
||
| } | ||
|
|
||
| process.stdout.write(`${renderStatsReport(report, community)}\n`); | ||
| }; | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lint failures labeled non-React
Low Severity
Sessions where React files were reconstructed but linting failed or was skipped are counted in
sessionsNonReact, so the footer can say they “changed only non-React files” even when the pipeline dropped them for scan errors.Additional Locations (1)
packages/react-doctor/src/stats/run-stats-scan.ts#L115-L120Reviewed by Cursor Bugbot for commit db52fc6. Configure here.