An LLM agent as the optimizer in a code-search loop.
TL;DR — An AI agent rewrites one file, a fast scorer grades each version on data it was never allowed to practice on, and the best survives — thousands of iterations a night. The scorer is built to resist overfitting, so the search chases edges that generalise. And because the loop is driven by a Claude Code session, not metered API calls, a search that long costs a flat subscription fee instead of a per-token bill. New here? Read the 2-minute ELI5.
The agent edits one source file — a candidate — rebuilds it, runs a fast deterministic evaluator, reads a single fitness score, and keeps or discards the change. Then it does it again. And again. Overnight, thousands of times.
It is FunSearch / AlphaEvolve-class program search, with three deliberate choices: the mutation operator is a general-purpose LLM editing real compilable code; the fitness function is built to resist overfitting so the search chases edges that generalise instead of artifacts that memorise; and the loop is driven by a Claude Code session rather than the metered API. The last point — plus a Rust evaluator fast enough to iterate — is what makes a thousands-of-candidates search practical at all (see why we built our own harness).
This repo is the harness. It ships with two worked examples — a trading-strategy search and a bin-packing-heuristic search — that share the same ~250-line domain-free core.
Each iteration: the agent reads the candidate and recent results, rewrites
the candidate file, the harness builds it and scores it on training,
held-out, and walk-forward data, the agent compares final_score to its
best so far, and keeps the change (a git commit) or discards it (a git
reset). Then it does it again — see the diagram above.
The agent never sees the evaluator's internals — only the candidate file it
edits and the final_score it reads back. Every kept step is a git commit, so
the search history is the git history and any step is revertible.
A search that maximises an in-sample score will find ways to cheat it. The harness makes cheating not pay:
- Holdout — a slice of data the candidate's author never iterates against.
- Walk-forward — the training data re-cut into overlapping sub-windows; the worst window's score is the one that counts.
final_scoreis the minimum across all of them.
A candidate that overfits scores high on train and collapses elsewhere — and
the minimum drags its reported score down to the collapse. The agent,
maximising final_score, is pushed toward edges that hold up everywhere. This
is the core idea, and it lives in core/ in about 250 lines.
A search is only as good as it is long — thousands of candidates, not dozens. Two costs decide whether a search that long is practical at all, and getting both right is the reason this is a purpose-built harness rather than a generic framework.
Every iteration is an LLM turn, and a search is thousands of them.
Priced through the metered Anthropic API, that is a per-token bill that grows with the length of the search. Concretely — roughly four weeks of autoresearch in the project this harness grew out of logged ~10,300 model messages; priced at Anthropic's published Opus 4.7 API rates with prompt caching already applied, that comes to ≈ $5,900.
That figure is not a hand-wave. It comes from
scripts/session_cost.py, which reads the Claude
Code session JSONL files on disk, sums per-message input_tokens,
output_tokens, cache_creation_input_tokens and cache_read_input_tokens
across every assistant turn, and prices each class at its published rate
($15 / $75 / $18.75 / $30 / $1.50 per MTok respectively). Point it
at any session you have and it reports the same number for that session — so
the methodology is auditable, and you can re-run it on your own runs.
Run as Claude Code sessions, the identical work falls under a flat subscription — on the order of $100–200/month (approximate, tier-dependent), within its usage limits. That is roughly a 30× difference in the cost of the same search.
It is also zero orchestration to build: Claude Code already edits files, runs
builds, parses output, commits to git, and loops. FunSearch and AlphaEvolve had
to engineer all of that machinery — here autoresearch-core is ~250 lines
because the orchestration is a tool you already have.
A search only makes as many attempts a night as the evaluator allows. The trading harness is built on a Rust backtest engine, ported from a Python reference implementation — and the project benchmarked the port:
| Workload | Python | Rust | Speedup |
|---|---|---|---|
| One backtest — 1M ticks, 24 h | ~30 s | 161 ms | ~190× |
| Grid sweep — 1920 configs | ~16 h | 112 s | ~500× |
At ~30 s per backtest an overnight run gets through a few hundred candidates; at ~160 ms it gets through thousands. Same loop, same idea — but only the fast evaluator finishes a useful search by morning. (An on-disk tick cache adds a further ~118× on data loading, so repeated runs pay the parse cost once.)
Together: cheap enough and fast enough that "leave it running overnight" is a
real workflow, not a budget line. program.md is plain instructions, so the
harness still works wired to the raw API — but the path this repo is built for
is a Claude Code session. See docs/agent-loop.md.
The bin-packing example needs no data — instances come from a seeded PRNG — so it runs on a bare clone:
cargo run -p binpacking-search --releasetrain_score: 0.8724
holdout_score: 0.8651
wf_min_score: 0.8653
final_score: 0.8651 <- the number the agent maximises
train.vs_baseline: -0.0065 <- the starter heuristic; the agent's job is to beat it
| Example | Candidate the agent edits | Dataset | Score |
|---|---|---|---|
binpacking-search |
An online bin-packing heuristic | Seeded PRNG — no download | Mean packing efficiency |
trading-search |
A tick-level trading strategy | Bundled market-data sample | Gated Probabilistic Sharpe Ratio |
They share autoresearch-core and differ only in the domain plug-in. That is
the point: the harness is not a backtester with extra steps — trading is just
one Experiment implementation.
The loop is driven by an LLM coding agent (built for Claude Code). Point the
agent at an example's program.md and let it iterate. See
docs/agent-loop.md.
Implement the Experiment trait (two methods), write a program.md, drop in
a starter candidate. The harness handles train/holdout/walk-forward scoring,
the keep/discard ratchet, and the agent protocol. See
docs/adding-an-experiment.md.
core/ autoresearch-core — domain-free harness (~250 LOC)
examples/binpacking/ non-trading example: bin-packing heuristic search
examples/trading/ trading example: strategy search
engine/ trading-engine — tick-level backtest engine
docs/ architecture, the agent loop, adding a domain
MIT — see LICENSE.
This project is not affiliated with, endorsed by, or sponsored by Anthropic. "Claude" and "Claude Code" are products of Anthropic; this is an independent piece of software that targets that runtime.