Overnight harness-improvement run (eval-driven, WIP — do not merge until reviewed) by agjs · Pull Request #61 · agjs/tsforge

agjs · 2026-06-29T00:15:18Z

Overnight harness-improvement run (eval-driven)

Autonomous, eval-driven work on a single branch (per request). North star: lift the local model by removing harness friction. Every commit keeps bun run validate green; CI is green; all review threads resolved. Not merged — your call in the morning.

✅ Landed (4 commits, all validated)

Commit	What	Validation
`fix(eval)` analyze-runs path+timing	The discovery analyzer read `evals/` not `evals/runs/` and used a stale `⏱` regex → returned 0 runs / no timings. Repaired.	This is what surfaced the bugs below.
`fix(script-tool)` positional args	`read("a.ts")`/`run(...)`/`search(...)` inside a script sent a bare string the server dropped to `{}` → silent rejection masquerading as data. Single-string-arg stubs now accept positional strings.	migrate 3/4 → 3/3; 0 read-rejections where there were 16.
`fix(script-tool)` rejections throw	Deeper root cause: ANY rejected in-script tool call returned its rejection text as a normal value, so scripts carried on with bad data (a migrate run's 8 `edit`s all silently failed). Now they throw with the reason.	Unit test + observed edit-rejection case.
`fix(eval)` ENOENT guard	Gemini review: analyze-runs crashed on a clean checkout with no `evals/runs/`.	—

🔎 Found but NOT fixed — your call (touch deliberate guardrails)

test-sibling-required transitive false-positive (query 2/3 FAIL). A facade-tested multi-file spec (query.test.ts imports only query.ts; lexer/parser/executor covered transitively) deadlocks the model demanding lexer.test.ts. The rule's coveredByExistingTest counts direct imports only — your explicit deliberate choice — so it under-shoots the facade pattern it was meant to fix. Extending to one transitive hop softens a guardrail you scoped on purpose, so I left it.
Large-edit rejection spiral — the model retried 113/106/95-line replacements, each correctly rejected ("too large"), never adapting. Guardrail works as designed; model recovery is poor. Possible UX lever (your call): a split-into-per-line-edits hint on first too-large rejection.
avgQuality phantom 0 — an unparseable judge response records quality:0, which score.ts averages in. Behavior is correct (fix(loop): never feed a no-signal judge result to the generator as a critique #58 skip fires) but reporting is skewed. Clean fix needs a new judge.ts signal (the no-signal flag is overloaded with the empty-window floor + genuine-0). Reverted my attempt; flagged.

Sweeps run

batch1 slugify/debounce/rate-limit/validators 100%; migrate 3/3; checkout 3/3 (churny 9–21 turns); query 1/3 (finding #1); fix-regression/math 100%; auth 2/2 (churny, model iteration not a harness bug). Thinking-budget A/B (2048 vs uncapped, 4 scratch seeds): neutral, default confirmed safe — no change.

See PR comments for full evidence on each finding.

The analyzer read run dirs from evals/ but sweep writes them to evals/runs/ (sweep.ts), so every `analyze-runs <seed> N` returned 0 runs. Its TIMING regex also still required the dropped `⏱` prefix, so even direct-path runs parsed no turn timings. Both fixed — the discovery instrument works again.

A script calling the natural JS idiom read("a.ts") / run("bun test") sent a bare string to the RPC server, which coerced any non-object arg to {} — the tool then rejected for the missing field and its rejection TEXT came back as the call's RESULT, which the script silently treats as data. Observed in a migrate run: the model called read("svcN.ts") 16x inside a script, each silently returned the rejection string, its regex found "no tier", and it spiralled to a stuck-fail never seeing the reads were rejected. Stubs for tools whose entire required-arg set is a single string param (read→file, run→command, search→pattern) now wrap a bare string into the named arg. Multi-arg tools (edit/create/…) stay object-only. Map is drift-guarded against the real tool schemas by a test.

gemini-code-assist

Code Review

This pull request updates the run analyzer to resolve run directories from evals/runs instead of the root evals directory, and makes the timing emoji prefix optional in the log parser. It also introduces support for positional string arguments in script-exposable tools (read, run, search) by wrapping them into their named argument objects in generateToolStubs, and adds corresponding tests. Feedback suggests wrapping the readdir call on runsRoot in a try-catch block to prevent the script from crashing with an ENOENT error if the directory does not exist.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

…rors The deeper root cause behind the positional-args fix: ANY tool call rejected inside a script returned its rejection reason as the call's RESULT string, so the script silently treated the reason as data. The read positional fix removed one trigger, but a migrate run then hit the same class on edit — 8 `tool_input_ rejected:edit` inside one script, which printed "Updated …" and exited 0 while the model never saw the edits failed (it guessed "the edit stub might not have worked" and recovered by editing directly, costing turns). The RPC handler now watches for a `tool_(input_)?rejected:` event during a call and returns it as an RPC error so the stub THROWS — the script fails loudly with the real reason on the first rejection instead of carrying on with bad data. A script that wants to tolerate failures can try/catch. Scope/policy denials use a different signal and are unaffected.

agjs · 2026-06-29T00:56:54Z

Overnight discovery findings (validated by sweeps)

✅ Fixed + validated (commits in this PR)

read positional spiral → migrate went 3/4 → 3/3 after the positional-args fix. The run that used a script with 16 in-script reads now reads cleanly (0 tool_input_rejected:read).
invisible in-script rejections → the same migrate run revealed the deeper root: its 8 in-script edit calls were all tool_input_rejected:edit, the script printed "Updated …" and exited 0, and the model never saw they failed. Now rejections throw with the reason (covered by the visibility-fix commit + test).

🔎 Found but NOT fixed — needs your call (touches deliberate guardrails)

1. test-sibling-required false-positive on facade-tested multi-file specs (query: 2/3 FAIL).
The query spec is files: lexer.ts, parser.ts, executor.ts, query.ts with one provided test query.test.ts that imports only the facade: import { query } from "./query". lexer/parser/executor are covered transitively through the pipeline. coveredByExistingTest counts direct imports only ("transitive coverage is deliberately NOT counted" — your explicit comment), so the rule fires on lexer.ts and the model deadlocks adding lexer.test.ts (5 attempts → blocked). Both query failures are this; the 2nd burned 111s + 118s thinking turns fighting it.

This is the same rule #59 touched, and its coveredByExistingTest comment literally names "query deadlocking" as the case it was meant to fix — the direct-only implementation under-shoots the facade pattern. Decision for you: extend coverage to follow one transitive hop (test → facade → module)? That softens a guardrail you deliberately scoped to direct imports, so I left it for you rather than flipping it overnight.

2. Large-edit rejection spiral (contributing to query failures).
The model tried to fix lexer.ts with 113/113/106/95-line replacements, each correctly rejected ("replacement too large — make targeted edits"), but never adapted to smaller edits. The guardrail is working as designed; the model just doesn't recover well. Possible levers (your call): a one-shot "here's how to split this into per-line edits" example on the first too-large rejection, or auto-splitting an over-large replacement. Both change a deliberate guardrail's UX.

3. Minor: a checkout run hit edit_lines:parse-error (Unrecognized line: "test(...") — the model fed edit_lines a non-hashline body. Recovered (checkout 3/3), low priority.

Sweeps: batch1 slugify/debounce/rate-limit/validators 100%; migrate 3/3; checkout 3/3 (churny: 9–21 turns); query 1/3.

Gemini review: readdir(runsRoot) crashes with ENOENT on a clean checkout where no sweep has run yet. Treat a missing dir as "no runs" via .catch(() => []).

agjs · 2026-06-29T01:16:08Z

Minor finding (investigated, reverted — needs your call): the sweep's avgQuality includes a phantom 0 when the judge returns an unparseable response (seen on fix-regression: Q0 = "unparseable judge response"). The harness behavior is correct — UNPARSEABLE sets scored:false so #58's quality-repair skip fires — but the run is recorded with quality:0, which score.ts averages in (its filter is quality !== undefined, and 0 !== undefined). I tried returning undefined from qualityRepair's no-signal path, but that flag is overloaded: it also covers the empty-window deliberate floor (quality:0, asserted by a test) and a genuine 0-score. Cleanly splitting "measurement failed → exclude" from "deliberate 0 → keep" needs a new signal threaded through judge.ts's IJudgeScore — so I reverted rather than rush it. Flagging for when you want honest quality aggregation.

agjs · 2026-06-29T01:41:48Z

Thinking-budget breadth check (task: validate scratchThinkingBudget=2048 beyond money). A/B of default 2048 vs uncapped (TSFORGE_THINKING_BUDGET=200000), n=2 each, t=0:

Seed	2048 (default)	uncapped	pass
slugify	7.1s	6.8s	100% / 100%
debounce	10.7s	7.5s	100% / 100%
validators	23.6s	24.8s	100% / 100%
math	31.9s	20.6s	100% / 100%

Neutral. Pass rate identical (8/8 both arms). Wall-clock differences are within run-to-run noise at n=2 (a 47s math-A outlier; judge Q0 hiccups in both arms). These seeds finish in 1–3 cycles and never trigger the over-think spirals the cap targets (the money 92–198s case), so neither arm spirals here. No evidence to change the default — it's confirmed safe (no regression) on 4 more scratch seeds. No code change.

agjs added 2 commits June 29, 2026 02:14

gemini-code-assist Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread packages/core/scripts/analyze-runs.ts Outdated

fix(eval): analyze-runs tolerates a missing evals/runs dir (review)

a2450ae

Gemini review: readdir(runsRoot) crashes with ENOENT on a clean checkout where no sweep has run yet. Treat a missing dir as "no runs" via .catch(() => []).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overnight harness-improvement run (eval-driven, WIP — do not merge until reviewed)#61

Overnight harness-improvement run (eval-driven, WIP — do not merge until reviewed)#61
agjs wants to merge 4 commits into
mainfrom
overnight/harness-improvements

agjs commented Jun 29, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

agjs commented Jun 29, 2026

Uh oh!

agjs commented Jun 29, 2026

Uh oh!

agjs commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

agjs commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!