Skip to content

Overnight harness-improvement run (eval-driven, WIP — do not merge until reviewed)#61

Open
agjs wants to merge 4 commits into
mainfrom
overnight/harness-improvements
Open

Overnight harness-improvement run (eval-driven, WIP — do not merge until reviewed)#61
agjs wants to merge 4 commits into
mainfrom
overnight/harness-improvements

Conversation

@agjs

@agjs agjs commented Jun 29, 2026

Copy link
Copy Markdown
Owner

Overnight harness-improvement run (eval-driven)

Autonomous, eval-driven work on a single branch (per request). North star: lift the local model by removing harness friction. Every commit keeps bun run validate green; CI is green; all review threads resolved. Not merged — your call in the morning.

✅ Landed (4 commits, all validated)

Commit What Validation
fix(eval) analyze-runs path+timing The discovery analyzer read evals/ not evals/runs/ and used a stale regex → returned 0 runs / no timings. Repaired. This is what surfaced the bugs below.
fix(script-tool) positional args read("a.ts")/run(...)/search(...) inside a script sent a bare string the server dropped to {} → silent rejection masquerading as data. Single-string-arg stubs now accept positional strings. migrate 3/4 → 3/3; 0 read-rejections where there were 16.
fix(script-tool) rejections throw Deeper root cause: ANY rejected in-script tool call returned its rejection text as a normal value, so scripts carried on with bad data (a migrate run's 8 edits all silently failed). Now they throw with the reason. Unit test + observed edit-rejection case.
fix(eval) ENOENT guard Gemini review: analyze-runs crashed on a clean checkout with no evals/runs/.

🔎 Found but NOT fixed — your call (touch deliberate guardrails)

  1. test-sibling-required transitive false-positive (query 2/3 FAIL). A facade-tested multi-file spec (query.test.ts imports only query.ts; lexer/parser/executor covered transitively) deadlocks the model demanding lexer.test.ts. The rule's coveredByExistingTest counts direct imports only — your explicit deliberate choice — so it under-shoots the facade pattern it was meant to fix. Extending to one transitive hop softens a guardrail you scoped on purpose, so I left it.
  2. Large-edit rejection spiral — the model retried 113/106/95-line replacements, each correctly rejected ("too large"), never adapting. Guardrail works as designed; model recovery is poor. Possible UX lever (your call): a split-into-per-line-edits hint on first too-large rejection.
  3. avgQuality phantom 0 — an unparseable judge response records quality:0, which score.ts averages in. Behavior is correct (fix(loop): never feed a no-signal judge result to the generator as a critique #58 skip fires) but reporting is skewed. Clean fix needs a new judge.ts signal (the no-signal flag is overloaded with the empty-window floor + genuine-0). Reverted my attempt; flagged.

Sweeps run

batch1 slugify/debounce/rate-limit/validators 100%; migrate 3/3; checkout 3/3 (churny 9–21 turns); query 1/3 (finding #1); fix-regression/math 100%; auth 2/2 (churny, model iteration not a harness bug). Thinking-budget A/B (2048 vs uncapped, 4 scratch seeds): neutral, default confirmed safe — no change.

See PR comments for full evidence on each finding.

agjs added 2 commits June 29, 2026 02:14
The analyzer read run dirs from evals/ but sweep writes them to evals/runs/
(sweep.ts), so every `analyze-runs <seed> N` returned 0 runs. Its TIMING regex
also still required the dropped `⏱` prefix, so even direct-path runs parsed
no turn timings. Both fixed — the discovery instrument works again.
A script calling the natural JS idiom read("a.ts") / run("bun test") sent a
bare string to the RPC server, which coerced any non-object arg to {} — the tool
then rejected for the missing field and its rejection TEXT came back as the
call's RESULT, which the script silently treats as data. Observed in a migrate
run: the model called read("svcN.ts") 16x inside a script, each silently
returned the rejection string, its regex found "no tier", and it spiralled to a
stuck-fail never seeing the reads were rejected.

Stubs for tools whose entire required-arg set is a single string param (read→file,
run→command, search→pattern) now wrap a bare string into the named arg. Multi-arg
tools (edit/create/…) stay object-only. Map is drift-guarded against the real
tool schemas by a test.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the run analyzer to resolve run directories from evals/runs instead of the root evals directory, and makes the timing emoji prefix optional in the log parser. It also introduces support for positional string arguments in script-exposable tools (read, run, search) by wrapping them into their named argument objects in generateToolStubs, and adds corresponding tests. Feedback suggests wrapping the readdir call on runsRoot in a try-catch block to prevent the script from crashing with an ENOENT error if the directory does not exist.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/core/scripts/analyze-runs.ts Outdated
…rors

The deeper root cause behind the positional-args fix: ANY tool call rejected
inside a script returned its rejection reason as the call's RESULT string, so the
script silently treated the reason as data. The read positional fix removed one
trigger, but a migrate run then hit the same class on edit — 8 `tool_input_
rejected:edit` inside one script, which printed "Updated …" and exited 0 while
the model never saw the edits failed (it guessed "the edit stub might not have
worked" and recovered by editing directly, costing turns).

The RPC handler now watches for a `tool_(input_)?rejected:` event during a call
and returns it as an RPC error so the stub THROWS — the script fails loudly with
the real reason on the first rejection instead of carrying on with bad data. A
script that wants to tolerate failures can try/catch. Scope/policy denials use a
different signal and are unaffected.
@agjs

agjs commented Jun 29, 2026

Copy link
Copy Markdown
Owner Author

Overnight discovery findings (validated by sweeps)

✅ Fixed + validated (commits in this PR)

  • read positional spiralmigrate went 3/4 → 3/3 after the positional-args fix. The run that used a script with 16 in-script reads now reads cleanly (0 tool_input_rejected:read).
  • invisible in-script rejections → the same migrate run revealed the deeper root: its 8 in-script edit calls were all tool_input_rejected:edit, the script printed "Updated …" and exited 0, and the model never saw they failed. Now rejections throw with the reason (covered by the visibility-fix commit + test).

🔎 Found but NOT fixed — needs your call (touches deliberate guardrails)

1. test-sibling-required false-positive on facade-tested multi-file specs (query: 2/3 FAIL).
The query spec is files: lexer.ts, parser.ts, executor.ts, query.ts with one provided test query.test.ts that imports only the facade: import { query } from "./query". lexer/parser/executor are covered transitively through the pipeline. coveredByExistingTest counts direct imports only ("transitive coverage is deliberately NOT counted" — your explicit comment), so the rule fires on lexer.ts and the model deadlocks adding lexer.test.ts (5 attempts → blocked). Both query failures are this; the 2nd burned 111s + 118s thinking turns fighting it.

This is the same rule #59 touched, and its coveredByExistingTest comment literally names "query deadlocking" as the case it was meant to fix — the direct-only implementation under-shoots the facade pattern. Decision for you: extend coverage to follow one transitive hop (test → facade → module)? That softens a guardrail you deliberately scoped to direct imports, so I left it for you rather than flipping it overnight.

2. Large-edit rejection spiral (contributing to query failures).
The model tried to fix lexer.ts with 113/113/106/95-line replacements, each correctly rejected ("replacement too large — make targeted edits"), but never adapted to smaller edits. The guardrail is working as designed; the model just doesn't recover well. Possible levers (your call): a one-shot "here's how to split this into per-line edits" example on the first too-large rejection, or auto-splitting an over-large replacement. Both change a deliberate guardrail's UX.

3. Minor: a checkout run hit edit_lines:parse-error (Unrecognized line: "test(...") — the model fed edit_lines a non-hashline body. Recovered (checkout 3/3), low priority.

Sweeps: batch1 slugify/debounce/rate-limit/validators 100%; migrate 3/3; checkout 3/3 (churny: 9–21 turns); query 1/3.

Gemini review: readdir(runsRoot) crashes with ENOENT on a clean checkout where
no sweep has run yet. Treat a missing dir as "no runs" via .catch(() => []).
@agjs

agjs commented Jun 29, 2026

Copy link
Copy Markdown
Owner Author

Minor finding (investigated, reverted — needs your call): the sweep's avgQuality includes a phantom 0 when the judge returns an unparseable response (seen on fix-regression: Q0 = "unparseable judge response"). The harness behavior is correct — UNPARSEABLE sets scored:false so #58's quality-repair skip fires — but the run is recorded with quality:0, which score.ts averages in (its filter is quality !== undefined, and 0 !== undefined). I tried returning undefined from qualityRepair's no-signal path, but that flag is overloaded: it also covers the empty-window deliberate floor (quality:0, asserted by a test) and a genuine 0-score. Cleanly splitting "measurement failed → exclude" from "deliberate 0 → keep" needs a new signal threaded through judge.ts's IJudgeScore — so I reverted rather than rush it. Flagging for when you want honest quality aggregation.

@agjs

agjs commented Jun 29, 2026

Copy link
Copy Markdown
Owner Author

Thinking-budget breadth check (task: validate scratchThinkingBudget=2048 beyond money). A/B of default 2048 vs uncapped (TSFORGE_THINKING_BUDGET=200000), n=2 each, t=0:

Seed 2048 (default) uncapped pass
slugify 7.1s 6.8s 100% / 100%
debounce 10.7s 7.5s 100% / 100%
validators 23.6s 24.8s 100% / 100%
math 31.9s 20.6s 100% / 100%

Neutral. Pass rate identical (8/8 both arms). Wall-clock differences are within run-to-run noise at n=2 (a 47s math-A outlier; judge Q0 hiccups in both arms). These seeds finish in 1–3 cycles and never trigger the over-think spirals the cap targets (the money 92–198s case), so neither arm spirals here. No evidence to change the default — it's confirmed safe (no regression) on 4 more scratch seeds. No code change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant