Overnight harness-improvement run (eval-driven, WIP — do not merge until reviewed)#61
Overnight harness-improvement run (eval-driven, WIP — do not merge until reviewed)#61agjs wants to merge 4 commits into
Conversation
The analyzer read run dirs from evals/ but sweep writes them to evals/runs/ (sweep.ts), so every `analyze-runs <seed> N` returned 0 runs. Its TIMING regex also still required the dropped `⏱` prefix, so even direct-path runs parsed no turn timings. Both fixed — the discovery instrument works again.
A script calling the natural JS idiom read("a.ts") / run("bun test") sent a
bare string to the RPC server, which coerced any non-object arg to {} — the tool
then rejected for the missing field and its rejection TEXT came back as the
call's RESULT, which the script silently treats as data. Observed in a migrate
run: the model called read("svcN.ts") 16x inside a script, each silently
returned the rejection string, its regex found "no tier", and it spiralled to a
stuck-fail never seeing the reads were rejected.
Stubs for tools whose entire required-arg set is a single string param (read→file,
run→command, search→pattern) now wrap a bare string into the named arg. Multi-arg
tools (edit/create/…) stay object-only. Map is drift-guarded against the real
tool schemas by a test.
There was a problem hiding this comment.
Code Review
This pull request updates the run analyzer to resolve run directories from evals/runs instead of the root evals directory, and makes the timing emoji prefix optional in the log parser. It also introduces support for positional string arguments in script-exposable tools (read, run, search) by wrapping them into their named argument objects in generateToolStubs, and adds corresponding tests. Feedback suggests wrapping the readdir call on runsRoot in a try-catch block to prevent the script from crashing with an ENOENT error if the directory does not exist.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…rors The deeper root cause behind the positional-args fix: ANY tool call rejected inside a script returned its rejection reason as the call's RESULT string, so the script silently treated the reason as data. The read positional fix removed one trigger, but a migrate run then hit the same class on edit — 8 `tool_input_ rejected:edit` inside one script, which printed "Updated …" and exited 0 while the model never saw the edits failed (it guessed "the edit stub might not have worked" and recovered by editing directly, costing turns). The RPC handler now watches for a `tool_(input_)?rejected:` event during a call and returns it as an RPC error so the stub THROWS — the script fails loudly with the real reason on the first rejection instead of carrying on with bad data. A script that wants to tolerate failures can try/catch. Scope/policy denials use a different signal and are unaffected.
Overnight discovery findings (validated by sweeps)✅ Fixed + validated (commits in this PR)
🔎 Found but NOT fixed — needs your call (touches deliberate guardrails)1.
2. Large-edit rejection spiral (contributing to query failures). 3. Minor: a Sweeps: batch1 slugify/debounce/rate-limit/validators 100%; migrate 3/3; checkout 3/3 (churny: 9–21 turns); query 1/3. |
Gemini review: readdir(runsRoot) crashes with ENOENT on a clean checkout where no sweep has run yet. Treat a missing dir as "no runs" via .catch(() => []).
|
Minor finding (investigated, reverted — needs your call): the sweep's |
|
Thinking-budget breadth check (task: validate
Neutral. Pass rate identical (8/8 both arms). Wall-clock differences are within run-to-run noise at n=2 (a 47s math-A outlier; judge Q0 hiccups in both arms). These seeds finish in 1–3 cycles and never trigger the over-think spirals the cap targets (the |
Overnight harness-improvement run (eval-driven)
Autonomous, eval-driven work on a single branch (per request). North star: lift the local model by removing harness friction. Every commit keeps
bun run validategreen; CI is green; all review threads resolved. Not merged — your call in the morning.✅ Landed (4 commits, all validated)
fix(eval)analyze-runs path+timingevals/notevals/runs/and used a stale⏱regex → returned 0 runs / no timings. Repaired.fix(script-tool)positional argsread("a.ts")/run(...)/search(...)inside a script sent a bare string the server dropped to{}→ silent rejection masquerading as data. Single-string-arg stubs now accept positional strings.fix(script-tool)rejections throwedits all silently failed). Now they throw with the reason.fix(eval)ENOENT guardevals/runs/.🔎 Found but NOT fixed — your call (touch deliberate guardrails)
test-sibling-requiredtransitive false-positive (query 2/3 FAIL). A facade-tested multi-file spec (query.test.tsimports onlyquery.ts; lexer/parser/executor covered transitively) deadlocks the model demandinglexer.test.ts. The rule'scoveredByExistingTestcounts direct imports only — your explicit deliberate choice — so it under-shoots the facade pattern it was meant to fix. Extending to one transitive hop softens a guardrail you scoped on purpose, so I left it.avgQualityphantom 0 — an unparseable judge response recordsquality:0, whichscore.tsaverages in. Behavior is correct (fix(loop): never feed a no-signal judge result to the generator as a critique #58 skip fires) but reporting is skewed. Clean fix needs a newjudge.tssignal (the no-signal flag is overloaded with the empty-window floor + genuine-0). Reverted my attempt; flagged.Sweeps run
batch1 slugify/debounce/rate-limit/validators 100%; migrate 3/3; checkout 3/3 (churny 9–21 turns); query 1/3 (finding #1); fix-regression/math 100%; auth 2/2 (churny, model iteration not a harness bug). Thinking-budget A/B (2048 vs uncapped, 4 scratch seeds): neutral, default confirmed safe — no change.
See PR comments for full evidence on each finding.